1. BN层不能少于1张图片
File "/home/user02/wildkid1024/haq/models/mobilenet.py", line 71, in forward
    x = self.features(x)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user02/wildkid1024/haq/lib/utils/utils.py", line 244, in lambda_forward
    return m.old_forward(x)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 76, in forward
    exponential_average_factor, self.eps)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/nn/functional.py", line 1619, in batch_norm
    raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1, 1])

问题分析: 模型中用了batchnomolization,训练中用batch训练的时候,应该是有单数,比如dataset的总样本数为17,batch_size为8,就会报这样的错误。
解决方案: 1. 将dataloader的一个丢弃参数设置为true 2. 手动舍弃小于1的样本数量 3. 如果是验证过程,通过设置model.eval()改变BN层的行为。 4. 如果训练过程中只能使用1个sample,替换BN为InstanceNorm.

  1. 自动求导的时候没有设定变量可微分
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user02/anaconda2/envs/py3_dl/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

问题分析 模型在使用梯度值的时候没有设置参数的requires_grad=True,导致在取偏导的时候,计算图已经被销毁。
解决方案 检查一下是否使用了model.eval()和torch.no_grad()函数,如果有就删除,如果没有,那就再input上添加var.required_grad = True。

  1. PyTorch训练一个epoch时,模型不能接着训练,Dataloader卡死,或者程序会非0值退出
    问题分析pytorch的多线程有关系,pytorch在多线程读取的时候可能会出现死锁的情况。
    解决方案  1. 检查data读取是否使用了cv2.imread,建议改成PIL的Image读取。或者关闭关闭Opencv的多线程:cv2.setNumThreads(0)和cv2.ocl.setUseOpenCL(False)。 2. 将num_works设置为0,此时数据读取会变慢。如果不想设置为0,那么应当设置pin_memory=True来预先分配内存。