tensorflow object detection API multi-GPU采坑

在使用TensorFlow Object Detection API进行faster-rcnn训练时遇到多GPU模式下batch_size设置错误导致ValueError的问题,以及在评估阶段的NameError和InvalidArgumentError。解决方案包括调整batch_size至合适的值以及检查代码逻辑中GPU和网络名称的判断。参考StackOverflow上的解答,增大batch_size可以解决问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

tensorflow object detection API 使用总结:

1. faster-rcnn 训练

训练脚本:
python train.py \
--logtostderr \
--pipeline_config_path="PATH WHERE CONFIG FILE RESIDES" \
--train_dir="PATH WHERE MODEL DIRECTORY RESIDES"

报错信息:
InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [1,890,600,3] vs. shape[1] = [1,766,600,3] [[Node: concat_1 = ConcatV2[N=10, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](Preprocessor/sub, Preprocessor_1/sub, Preprocessor_2/sub, Preprocessor_3/sub, Preprocessor_4/sub, Preprocessor_5/sub, Preprocessor_6/sub, Preprocessor_7/sub, Preprocessor_8/sub, Preprocessor_9/sub, concat_1/axis)]]

原因: 因为FRCNN系列使用了keep_aspect_ratio_resizer,意味着如果你有不同大小尺寸的图像,在预处理后也会有不同的大小, 所以使得批处理不可能。
因为批量张量具有形状[num_batch,height,width,channels]。当(高度,宽度)从一个示例到下一个示例不同时,就会报错!
***解决:*** 将batch_size 设为1 即可。

2.多GPU模式训练:

model_main.py不支持多GPU,因为Estimator分发策略不适用于tf.contrib.slim。所以没有关于分布式训练,多卡训练的部分。追踪issue发现,日后会合并继承keras相关的代码
所以目前训练只能使用object_detection/legacy/train.py文件

单机多卡训练时报出如下错误:

	 File "/tensorflow/models/research/object_detection/legacy/trainer.py", line 180, in _create_losses
train_config.use_multiclass_scores)

ValueError: not enough values to unpack (expected 7, got 0)

脚本内容:

   python object_detection/legacy/train.py \
  --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config \
    --train_dir=/root/research/Datasets/bak/model/myTrain2 \
	--worker_replicas=2 \
    --num_clones=2 --ps_tasks=1

解决: 因为faster-rcnn 上一步中将batch_size设成了1,这里的batch_size=per_batch_size*num_gups 。所以应该设成2.再次运行问题解决。

  1. 使用eval.py 评估时报错:
    信息如下:

    See tf.nn.softmax_cross_entropy_with_logits_v2.
    Traceback (most recent call last):
    File “object_detection/legacy/eval.py”, line 154, in
    tf.app.run()
    File “/root/anaconda3/envs/cvtf/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 125, in run
    _sys.exit(main(argv))
    File “/root/anaconda3/envs/cvtf/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py”, line 306, in new_func
    return func(*args, **kwargs)
    File “object_detection/legacy/eval.py”, line 145, in main
    graph_hook_fn=graph_rewriter_fn)
    File “/root/research/object_detection/legacy/evaluator.py”, line 274, in evaluate
    evaluator_list = get_evaluators(eval_config, categories)
    File “/root/research/object_detection/legacy/evaluator.py”, line 166, in get_evaluators
    EVAL_METRICS_CLASS_DICTeval_metric_fn_key)
    File “/root/research/object_detection/utils/object_detection_evaluation.py”, line 470, in init
    use_weighted_mean_ap=False)
    File “/root/research/object_detection/utils/object_detection_evaluation.py”, line 194, in init
    self._build_metric_names()
    File “/root/research/object_detection/utils/object_detection_evaluation.py”, line 213, in _build_metric_names
    category_name = unicode(category_name, ‘utf-8’)
    NameError: name ‘unicode’ is not defined


解决:

因为:Python3没有unicode类型,并且已经重命名为str。

 	replace this: category_name = unicode(category_name, 'utf-8')
to this: category_name = str(category_name, 'utf-8')
  1. 使用ssd_mobilenet_v1断点续训时宝如下错误:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
[[node ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at /root/research/object_detection/legacy/trainer.py:354) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]]
[[{{node BoxPredictor_0/BoxEncodingPredictor/weights/read/_341}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name=“edge_1416_BoxPredictor_0/BoxEncodingPredictor/weights/read”, tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op ‘ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance’, defined at:

***

问题:

***代码逻辑中更加显卡好网络名称判断生成batch_size大小时,出现漏洞,将batch_size设为了1.
相关解答类似:https://stackoverflow.com/a/49260201/9498482

解决: 所以当我的batch_size设的更大时,就没有问题了。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值