第一篇博客,浅记录一下菜鸟读代码的心酸,又不想太摸鱼,一天一点进度,希望这个月能完成大部分,在这里留个足迹来安慰自己,顺便养成好习惯。
(意识流写文章,想到什么写什么吧。。。自己看的懂就好)
又是看OLTR代码发懵的一天。代码还没跑通,现在在改github上的代码,想先把CIFAR10的数据集代替进去,代码架构复杂得一匹。
捋一下代码的运行流程。
parser = argparse.ArgumentParser()
parser.add_argument('--config', default='./temp.py', type=str)
parser.add_argument('--test', default=False, action='store_true')
args = parser.parse_args()
test_mode = args.test
config = source_import("./temp.py").config
training_opt = config['training_opt']
读了parser其实就是读一下config字典和test_mode而已,其他的不需要用到就删了。
(注:test_mode好像要改了,直接拆成两个代码)
先看一下训练mode里面的。。
sampler_defs = training_opt['sampler']
if sampler_defs:
sampler_dic = {'sampler': source_import(sampler_defs['def_file']).get_sampler(),
'num_samples_cls': sampler_defs['num_samples_cls']}
else:
sampler_dic = None
data = {x: load_data(phase=x,batch_size=training_opt['batch_size'],sampler_dic=sampler_dic,num_workers=training_opt['num_workers'])
for x in (['train', 'val', 'train_plain'] if relatin_opt['init_centroids'] else ['train', 'val'])}
training_model = model(config, data, test=False)
training_model.train()
这边有个sampler采样器,在shuffle为true的时候会自动调用,默认是顺序采样,不过这里好像写了一个自定义的采样器。(还没看明白)
注意到需不需要采样器取决于training_opt,这是配置字典里面的一个字段。注意到stage1是不需要采样器的。采样器是作为data参数的。
最主要是这个data,在这里面改的应该会比较多。
data这里有个relatin_opt,是等于config['memory']的,里面有centroids和init-centroids两个参数,stage-1里面两者都是false,元嵌入两者都是true。所以在stage1中,data中只有两个,train和validation。
所以对于stage1来说,data就是{train: load_data(phase='train', batch_size=64, sampler=None, num_worker=4, val: load_data(phase='val', batch_size=64, sampler=None, num_worker=4)}。在dataloader.py里面具体改动(主要使返回值,github代码里面有个.txt日志可以删除掉)。
好的。然后train_model等于model传三个参数,config,data以及test=false。
重头戏还是在runnet.py里面,我是真的看不懂。。。
self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
self.config = config
self.training_opt = self.config['training_opt']
self.memory = self.config['memory']
self.data = data
self.test_mode = test
# Initialize model
self.init_models()
# Under training mode, initialize training steps, optimizers, schedulers, criterions, and centroids
if not self.test_mode:
# If using steps for training, we need to calculate training steps
# for each epoch based on actual number of training data instead of
# oversampled data number
print('Using steps for training.')
self.training_data_num = len(self.data['train'].dataset)
self.epoch_steps = int(self.training_data_num/ self.training_opt['batch_size'])
# Initialize model optimizer and scheduler
print('Initializing model optimizer.')
self.scheduler_params = self.training_opt['scheduler_params']
self.model_optimizer, \
self.model_optimizer_scheduler = self.init_optimizers(self.model_optim_params_list)
self.init_criterions()
if self.memory['init_centroids']:##如果要初始化质心
self.criterions['FeatureLoss'].centroids.data = \
self.centroids_cal(self.data['train_plain'])##
看一下这个init代码, 初始化了config, training_opt, data, 还有test(这里等于False)。
train模式要计算steps for training, 包括数据集大小,epoch大小,总体学习率, 还要初始化优化器参数,初始化损失函数。stage_1需要不需要初始化质心,stage_2需要初始化质心。
__init_model__模块初始化model,networks配置的字典有网络本身以及分类器,每个都由基本参数和优化器参数组成。stage1_weight参数需要注意一下,stage_1是不需要用到的。这里根据参数create出网络模型和分类器(很奇怪不知道为什么可以??)。
fix这个地方不是很懂。以后来补。
def train(self):
# When training the network
print_str = ['Phase: train']
time.sleep(0.25)
# Initialize best model
best_model_weights = {}
best_model_weights['feat_model'] = copy.deepcopy(self.networks['feat_model'].state_dict())
best_model_weights['classifier'] = copy.deepcopy(self.networks['classifier'].state_dict())
best_acc = 0.0
best_epoch = 0
end_epoch = self.training_opt['num_epochs']
# Loop over epochs
for epoch in range(1, end_epoch + 1):
for model in self.networks.values():
model.train()
torch.cuda.empty_cache()
# Iterate over dataset
for step, (inputs, labels, _) in enumerate(self.data['train']):
# Break when step equal to epoch step
if step == self.epoch_steps:
break
inputs, labels = inputs.to(self.device), labels.to(self.device)
# If on training phase, enable gradients
with torch.set_grad_enabled(True):
# If training, forward with loss, and no top 5 accuracy calculation
self.batch_forward(inputs, labels,
centroids=self.memory['centroids'],
phase='train')
self.batch_loss(labels)
self.batch_backward()
# Output minibatch training results
if step % self.training_opt['display_step'] == 0:
minibatch_loss_feat = self.loss_feat.item() \
if 'FeatureLoss' in self.criterions.keys() else None
minibatch_loss_perf = self.loss_perf.item()
_, preds = torch.max(self.logits, 1)
minibatch_acc = mic_acc_cal(preds, labels)
print_str = ['Epoch: [%d/%d]'
% (epoch, self.training_opt['num_epochs']),
'Step: %5d'
% (step),
'Minibatch_loss_feature: %.3f'
% (minibatch_loss_feat) if minibatch_loss_feat else '',
'Minibatch_loss_performance: %.3f'
% (minibatch_loss_perf),
'Minibatch_accuracy_micro: %.3f'
% (minibatch_acc)]
# Set model modes and set scheduler
# In training, step optimizer scheduler and set model to train()
self.model_optimizer_scheduler.step()
if self.criterion_optimizer:
self.criterion_optimizer_scheduler.step()
# After every epoch, validation
self.eval(phase='val')
# Under validation, the best model need to be updated
if self.eval_acc_mic_top1 > best_acc:
best_epoch = copy.deepcopy(epoch)
best_acc = copy.deepcopy(self.eval_acc_mic_top1)
best_centroids = copy.deepcopy(self.centroids)
best_model_weights['feat_model'] = copy.deepcopy(self.networks['feat_model'].state_dict())
best_model_weights['classifier'] = copy.deepcopy(self.networks['classifier'].state_dict())
print()
print('Training Complete.')
print_str = ['Best validation accuracy is %.3f at epoch %d' % (best_acc, best_epoch)]
# Save the best model and best centroids if calculated
print(print_str,"*********")
self.save_model(epoch, best_epoch, best_model_weights, best_acc, centroids=best_centroids)
print('Done')
train模块这里也比较重要。
在batch_forward传入参数值为(data,lable,False,'train')(Train模式)。
---------------------------------------------------------------------------------------------------------------------------------
分界线11.05