tensorflow object detection API 使用总结:
1. faster-rcnn 训练
训练脚本:
python train.py \
--logtostderr \
--pipeline_config_path="PATH WHERE CONFIG FILE RESIDES" \
--train_dir="PATH WHERE MODEL DIRECTORY RESIDES"
报错信息:
InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [1,890,600,3] vs. shape[1] = [1,766,600,3] [[Node: concat_1 = ConcatV2[N=10, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](Preprocessor/sub, Preprocessor_1/sub, Preprocessor_2/sub, Preprocessor_3/sub, Preprocessor_4/sub, Preprocessor_5/sub, Preprocessor_6/sub, Preprocessor_7/sub, Preprocessor_8/sub, Preprocessor_9/sub, concat_1/axis)]]
原因: 因为FRCNN系列使用了keep_aspect_ratio_resizer,意味着如果你有不同大小尺寸的图像,在预处理后也会有不同的大小, 所以使得批处理不可能。
因为批量张量具有形状[num_batch,height,width,channels]。当(高度,宽度)从一个示例到下一个示例不同时,就会报错!
***解决:*** 将batch_size 设为1 即可。
2.多GPU模式训练:
model_main.py不支持多GPU,因为Estimator分发策略不适用于tf.contrib.slim。所以没有关于分布式训练,多卡训练的部分。追踪issue发现,日后会合并继承keras相关的代码
所以目前训练只能使用object_detection/legacy/train.py文件
单机多卡训练时报出如下错误:
File "/tensorflow/models/research/object_detection/legacy/trainer.py", line 180, in _create_losses
train_config.use_multiclass_scores)
ValueError: not enough values to unpack (expected 7, got 0)
脚本内容:
python object_detection/legacy/train.py \
--pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config \
--train_dir=/root/research/Datasets/bak/model/myTrain2 \
--worker_replicas=2 \
--num_clones=2 --ps_tasks=1
解决: 因为faster-rcnn 上一步中将batch_size设成了1,这里的batch_size=per_batch_size*num_gups 。所以应该设成2.再次运行问题解决。
-
使用eval.py 评估时报错:
信息如下:See
tf.nn.softmax_cross_entropy_with_logits_v2
.
Traceback (most recent call last):
File “object_detection/legacy/eval.py”, line 154, in
tf.app.run()
File “/root/anaconda3/envs/cvtf/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 125, in run
_sys.exit(main(argv))
File “/root/anaconda3/envs/cvtf/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py”, line 306, in new_func
return func(*args, **kwargs)
File “object_detection/legacy/eval.py”, line 145, in main
graph_hook_fn=graph_rewriter_fn)
File “/root/research/object_detection/legacy/evaluator.py”, line 274, in evaluate
evaluator_list = get_evaluators(eval_config, categories)
File “/root/research/object_detection/legacy/evaluator.py”, line 166, in get_evaluators
EVAL_METRICS_CLASS_DICTeval_metric_fn_key)
File “/root/research/object_detection/utils/object_detection_evaluation.py”, line 470, in init
use_weighted_mean_ap=False)
File “/root/research/object_detection/utils/object_detection_evaluation.py”, line 194, in init
self._build_metric_names()
File “/root/research/object_detection/utils/object_detection_evaluation.py”, line 213, in _build_metric_names
category_name = unicode(category_name, ‘utf-8’)
NameError: name ‘unicode’ is not defined
解决:
因为:Python3没有unicode类型,并且已经重命名为str。
replace this: category_name = unicode(category_name, 'utf-8')
to this: category_name = str(category_name, 'utf-8')
- 使用ssd_mobilenet_v1断点续训时宝如下错误:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
[[node ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at /root/research/object_detection/legacy/trainer.py:354) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/tag, FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance/read)]]
[[{{node BoxPredictor_0/BoxEncodingPredictor/weights/read/_341}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name=“edge_1416_BoxPredictor_0/BoxEncodingPredictor/weights/read”, tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Caused by op ‘ModelVars/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance’, defined at:
***
问题:
***代码逻辑中更加显卡好网络名称判断生成batch_size大小时,出现漏洞,将batch_size设为了1.
相关解答类似:https://stackoverflow.com/a/49260201/9498482
解决: 所以当我的batch_size设的更大时,就没有问题了。