ValueErrorValueError: : `val_check_interval` (500) must be less than or equal to the number of the training batches (8). If you want to disable validation set `limit_val_batches` to 0.0 instead.`val_check_interval` (500) must be less than or equal to the number of the training batches (8). If you want to disable validation set `limit_val_batches` to 0.0 instead.
减少val_check_interval
的值,确保小于或等于批次总数。例如,如果你有8个批次,你可以将其设置为小于或等于8。这样做将确保在训练期间至少进行一次验证。例如,你可以将其设置为4或8,这意味着每4个或8个批次进行一次验证。
禁用验证,如果你不想在这次训练中进行验证,可以将limit_val_batches
设置为0.0。这样做将完全跳过验证步骤。不过,根据你提供的配置文件格式,这个选项可能需要直接在代码中设置,而不是在配置文件中。
对于第一个选项,修改后的部分将如下所示:
trainer:
accelerator: ddp
precision: 32
# Indices of GPUs used for training.
gpus: [0, 1, 2]
# Path to save logs and checkpoints.
default_root_dir: /HOME/scz0csh/202005010215
# Max number of training steps (batches).
max_steps: 150001
# Validation frequency in terms of training steps.
# val_check_interval: 500
val_check_interval: 8 # 或者一个小于等于8的数字
# Log frequency of tensorboard logger.
log_every_n_steps: 50
# Accumulate gradients from multiple batches so as to increase batch size.
accumulate_grad_batches: 1
确保将val_check_interval
设置为一个适当的值
这样可以在只有少量批次时进行有效的验证