10-2【mmaction2 行为识别商用级别】修改ava数据集的视频帧ID 分析slowfast训练的数据集如何输入的-优快云博客

本文链接：https://blog.youkuaiyun.com/WhiffeYF/article/details/122673296

mmaction2官方github：https://github.com/open-mmlab/mmaction2
GPU平台：https://cloud.videojj.com/auth/register?inviter=18452&activityChannel=student_invite
mmaction2的文档：https://mmaction2.readthedocs.io/zh_CN/latest/faq.html?highlight=start_index#id3

本系列的链接

00【mmaction2 行为识别商用级别】快速搭建mmaction2 pytorch 1.6.0与 pytorch 1.8.0 版本

03【mmaction2 行为识别商用级别】使用mmaction搭建faster rcnn批量检测图片输出为via格式

04【mmaction2 行为识别商用级别】slowfast检测算法使用yolov3来检测人

！！！等论文发布后公开！！！！05【mmaction2 行为识别商用级别】slowfast 与 yolov5融合（即检测部分使用yolov5）

！！！等论文发布后公开！！！！06【mmaction2 行为识别商用级别】slowfast 与 yolov5与deepsort融合（即追踪部分使用deepsort）

！！！等论文发布后公开！！！！07【mmaction2 行为识别商用级别】 yolov5 采用 yolov5-crowdhuman

08【mmaction2 行为识别商用级别】自定义ava数据集之将视频裁剪为帧

！！！等论文发布后公开！！！！10-1【mmaction2 行为识别商用级别】ava数据集缩小版与 slowfast训练

！！！等论文发布后公开！！！！10-2【mmaction2 行为识别商用级别】修改ava数据集的帧ID 分析slowfast训练的数据集如何输入的

12【mmaction2 行为识别商用级别】X3D复现 demo实现检测自己的视频 Expanding Architecturesfor Efficient Video Recognition

AVA标注解释（AVA Annotation Explained）

来自：[Doc] AVA annotations explained：https://github.com/open-mmlab/mmaction2/pull/1097/commits/d7a61f7ed6fdd9326affe8d8bca04cd15610a931

下面是直接复制粘贴过来的内容

In this section, we explain the annotation format of AVA in details:

mmaction2
├── data
│   ├── ava
│   │   ├── annotations
│   │   |   ├── ava_dense_proposals_train.FAIR.recall_93.9.pkl
│   │   |   ├── ava_dense_proposals_val.FAIR.recall_93.9.pkl
│   │   |   ├── ava_dense_proposals_test.FAIR.recall_93.9.pkl
│   │   |   ├── ava_train_v2.1.csv
│   │   |   ├── ava_val_v2.1.csv
│   │   |   ├── ava_train_excluded_timestamps_v2.1.csv
│   │   |   ├── ava_val_excluded_timestamps_v2.1.csv
│   │   |   ├── ava_action_list_v2.1_for_activitynet_2018.pbtxt

The proposals generated by human detectors

In the annotation folder, ava_dense_proposals_[train/val/test].FAIR.recall_93.9.pkl are human proposals generated by a human detector. They are used in training, validation and testing respectively. Take ava_dense_proposals_train.FAIR.recall_93.9.pkl as an example. It is a dictionary of size 203626. The key consists of the videoID and the timestamp. For example, the key -5KQ66BBWC4,0902 means the values are the detection results for the frame at the $902_{nd}$ second in the video -5KQ66BBWC4. The values in the dictionary are numpy arrays with shape $\times 5$ , $N$ is the number of detected human bounding boxes in the corresponding frame. The format of bounding box is $[x_1, y_1, x_2, y_2, score], 0 \le x_1, y_1, x_2, w_2, score \le 1$ . $x_1, y_1)$ indicates the top-left corner of the bounding box, $x_2, y_2)$ indicates the bottom-right corner of the bounding box; $(0, 0)$ indicates the top-left corner of the image, while $(1, 1)$ indicates the bottom-right corner of the image.

The ground-truth labels for spatio-temporal action detection

In the annotation folder, ava_[train/val]_v[2.1/2.2].csv are ground-truth labels for spatio-temporal action detection, which are used during training & validation. Take ava_train_v2.1.csv as an example, it is a csv file with 837318 lines, each line is the annotation for a human instance in one frame. For example, the first line in ava_train_v2.1.csv is '-5KQ66BBWC4,0902,0.077,0.151,0.283,0.811,80,1': the first two items -5KQ66BBWC4 and 0902 indicate that it corresponds to the $902_{nd}$ second in the video -5KQ66BBWC4. The next four items ( $0.077(x_1), 0.151(y_1), 0.283(x_2), 0.811(y_2)]$ ) indicates the location of the bounding box, the bbox format is the same as human proposals. The next item 80 is the action label. The last item 1 is the ID of this bounding box.

Excluded timestamps

ava_[train/val]_excludes_timestamps_v[2.1/2.2].csv contains excluded timestamps which are not used during training or validation. The format is video_id, second_idx .

Label map

ava_action_list_v[2.1/2.2]_for_activitynet_[2018/2019].pbtxt contains the label map of the AVA dataset, which maps the action name to the label index.

some issues

What is the function of proposals in ava_dense_proposals_train.FAIR.recall_93.9.pkl during training slwofast?

It provides the human proposal boxes for AVA videos, during training, we will use the proposal boxes and RoIAlign to obtain instance-level features.

The proposal box is already provided in train.csv. Is the proposal box provided by pkl file again to improve the recognition accuracy?

Not exactly. Boxes provided in CSV files are ground-truth (annotated by humans), boxes provided in pickles files are proposals (predicted by detectors).

So what is the specific role of this pkl file?

The pkl files contain the proposals, we use the proposal boxes and RoIAlign to obtain instance-level features in training and testing. We can not assume ground-truth bounding boxes, so we need to use proposal boxes for training and testing

What is the function of proposals in ava_dense_proposals_train.FAIR.recall_93.9.pkl during training slwofast?Is it helpful for training the slowfast model?

The proposals were generated by a person detector with 93.9% recognition rate.
AVA is a spatiotemporal detection dataset centered on human behavior, the proposals are used to determine the spatial position of people.

But in ava dataset there is already a person box in csv.It is similar to the proposals.
So I’m confusing about it.
Can you explain it with specific code?It is helpful to understand the function of proposals.

The person box in ava dataset is the ground truth. we cannot use that to test accuracy.

On this spatiotemporal dataset, our task is to find out the position of the persons on each keyframe and recognize the actions they are doing.
Now many methods are to transform it into two tasks: detect the person and recognize the human behavior, the former is done by a person detector, and the latter is done by some models of video understanding.

修改ava数据集的视频帧ID

这一部分，我通过修改视频帧ID来更深入了解ava数据集的组成。

修改ava_train

下面是ava数据集的一部分

在这里插入图片描述
第一列：视频的名字
第二列：视频帧ID，比如15:02这一帧，表示为902，15:03这一帧表示为903（这里是我在思考的地方，自定义数据集的时候，这个902，改成 2 可以么）
第三列到第六列：人的坐标值（x1，y1，x2，y2）
第七列：动作类别编号
第八列：人的ID

在我们自制ava数据集的时候，不会像官网那样从第900秒开始，我们一般都是从第0秒开始，所以
针对第二列的思考，我使用代码将第二列的编号全部剪去900，代码与结果如下：
代码：

import csv

minCsv2 = []
with open('./data/ava/annotations/ava_train_v2.2_mini.csv', 'r') as db01:
    reader = csv.reader(db01)
    for row in reader:
        temp = row
        temp[1] = str( int(temp[1]) - 900 )
        minCsv2.append(temp)

with open('./data/ava/annotations/ava_train_v2.2_mini2.csv',"w") as csvfile: 
    writer = csv.writer(csvfile)
    writer.writerows(minCsv2)

在这里插入图片描述

缩小ava_dense_proposals_train

下面是ava_dense_proposals_train.FAIR.recall_93.9.pkl部分内容

-5KQ66BBWC4,0902 [[0.003    0.125    0.119    0.837    0.742486]
 [0.626    0.153    0.797    0.838    0.987177]
 [0.326    0.185    0.47     0.887    0.996382]
 [0.508    0.117    0.648    0.777    0.903317]
 [0.222    0.031    0.362    0.529    0.983264]
 [0.108    0.143    0.283    0.805    0.547549]
 [0.773    0.143    0.862    0.351    0.82769 ]
 [0.706    0.105    0.787    0.31     0.108642]
 [0.805    0.289    0.997    0.991    0.983301]
 [0.852    0.175    0.929    0.335    0.178122]]
-5KQ66BBWC4,0903 [[0.516    0.134    0.659    0.788    0.995238]
 [0.628    0.163    0.781    0.84     0.996272]
 [0.326    0.172    0.489    0.895    0.999214]
 [0.145    0.161    0.301    0.831    0.9853  ]
 [0.876    0.157    0.993    0.443    0.871956]
 [0.736    0.113    0.815    0.357    0.716372]
 [0.8      0.293    0.997    0.961    0.94542 ]
 [0.791    0.16     0.879    0.501    0.503257]
 [0.522    0.137    0.656    0.373    0.083969]
 [0.838    0.13     0.995    0.574    0.090968]
 [0.552    0.115    0.669    0.304    0.106534]
 [0.233    0.024    0.362    0.522    0.9284  ]
 [0.009    0.183    0.147    0.84     0.991572]
 [0.781    0.146    0.865    0.381    0.271116]
 [0.592    0.062    0.682    0.265    0.34697 ]]
-5KQ66BBWC4,0904 [[0.215    0.018    0.988    0.991    0.999776]]
-5KQ66BBWC4,0905 [[0.192    0.072    0.396    0.971    0.990042]
 [0.391    0.033    0.552    0.625    0.994892]
 [0.607    0.062    0.814    0.976    0.995895]
 [0.852    0.079    0.998    0.892    0.950123]
 [0.059    0.076    0.227    0.893    0.976499]
 [0.287    0.086    0.389    0.302    0.41169 ]]

我们还是先缩小ava_dense_proposals_train.FAIR.recall_93.9.pkl，只要其中2个视频，代码如下：

#获取指定视频的pkl
import pickle
import csv

videos = ["053oq2xB3oU", "Ytga8ciKWJc"]
minPkl = {}

f = open('ava_dense_proposals_train.FAIR.recall_93.9.pkl','rb')
info = pickle.load(f, encoding='iso-8859-1') 

for i in info:
    name,vID = i.split(',')
    
    if name in videos:
        minPkl[i] = info[i]
with open('ava_dense_proposals_train_mini.pkl',"wb") as pklfile: 
    pickle.dump(minPkl, pklfile)

运行后，就可以得到只有"053oq2xB3oU", "Ytga8ciKWJc"的proposals

修改ava_dense_proposals_train

这里要修改的同样是视频帧ID，视频帧ID全部剪去900，ID从2开始，代码如下：

import pickle

minPkl = {}

with open('ava_dense_proposals_train_mini.pkl', 'rb') as f:
    info = pickle.load(f, encoding='iso-8859-1') 
    for i in info:
        #minPkl[i] = info[i]
        name,vID = i.split(',')
        vID = str(int(vID) - 900)
        key = name + ',' + vID
        minPkl[key] = info[i]

with open('ava_dense_proposals_train_mini2.pkl',"wb") as pklfile: 
    pickle.dump(minPkl, pklfile)

下面是部分ava_dense_proposals_train_mini2.pkl的内容

053oq2xB3oU,2 [[0.498    0.357    0.586    0.543    0.238802]
 [0.495    0.229    0.864    0.792    0.876006]
 [0.711    0.233    0.947    0.838    0.839101]
 [0.509    0.218    0.766    0.622    0.221814]
 [0.711    0.254    0.861    0.619    0.397694]
 [0.009    0.125    0.657    0.887    0.993884]
 [0.006    0.229    0.22     0.839    0.160506]]
053oq2xB3oU,3 [[0.811    0.234    0.98     0.856    0.948427]
 [0.373    0.264    0.564    0.604    0.992532]
 [0.481    0.226    0.785    0.825    0.845053]
 [0.709    0.247    0.893    0.876    0.893365]
 [0.332    0.203    0.893    0.896    0.300915]
 [0.409    0.231    0.677    0.713    0.142233]
 [0.008    0.126    0.538    0.907    0.98535 ]]
053oq2xB3oU,4 [[0.446    0.205    0.818    0.872    0.901434]
 [0.016    0.13     0.504    0.875    0.783437]
 [0.711    0.231    0.918    0.862    0.943851]
 [0.277    0.247    0.566    0.677    0.939575]
 [0.215    0.212    0.612    0.865    0.274805]
 [0.367    0.238    0.692    0.798    0.168288]
 [0.839    0.222    0.992    0.856    0.973397]
 [0.006    0.163    0.264    0.844    0.073463]]

修改配置文件

配置文件主要修改有2部分，一部分是对标注文件的加载，一部分是增加 start_index

在这里插入图片描述

配置文件如下my_slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb2.py：

# model setting
model = dict(
    type='FastRCNN',
    backbone=dict(
        type='ResNet3dSlowFast',
        pretrained=None,
        resample_rate=8,
        speed_ratio=8,
        channel_ratio=8,
        slow_pathway=dict(
            type='resnet3d',
            depth=50,
            pretrained=None,
            lateral=True,
            conv1_kernel=(1, 7, 7),
            dilations=(1, 1, 1, 1),
            conv1_stride_t=1,
            pool1_stride_t=1,
            inflate=(0, 0, 1, 1),
            spatial_strides=(1, 2, 2, 1)),
        fast_pathway=dict(
            type='resnet3d',
            depth=50,
            pretrained=None,
            lateral=False,
            base_channels=8,
            conv1_kernel=(5, 7, 7),
            conv1_stride_t=1,
            pool1_stride_t=1,
            spatial_strides=(1, 2, 2, 1))),
    roi_head=dict(
        type='AVARoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor3D',
            roi_layer_type='RoIAlign',
            output_size=8,
            with_temporal_pool=True),
        bbox_head=dict(
            type='BBoxHeadAVA',
            in_channels=2304,
            num_classes=81,
            multilabel=True,
            dropout_ratio=0.5)),
    train_cfg=dict(
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssignerAVA',
                pos_iou_thr=0.9,
                neg_iou_thr=0.9,
                min_pos_iou=0.9),
            sampler=dict(
                type='RandomSampler',
                num=32,
                pos_fraction=1,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=1.0,
            debug=False)),
    test_cfg=dict(rcnn=dict(action_thr=0.002)))

dataset_type = 'AVADataset'
data_root = 'data/ava/rawframes'
anno_root = 'data/ava/annotations'


#ann_file_train = f'{anno_root}/ava_train_v2.1.csv'
ann_file_train = f'{anno_root}/ava_train_v2.2_mini2.csv'
#ann_file_val = f'{anno_root}/ava_val_v2.1.csv'
ann_file_val = f'{anno_root}/ava_train_v2.2_mini2.csv'

#exclude_file_train = f'{anno_root}/ava_train_excluded_timestamps_v2.1.csv'
#exclude_file_val = f'{anno_root}/ava_val_excluded_timestamps_v2.1.csv'

exclude_file_train = f'{anno_root}/ava_train_excluded_timestamps_v2.2.csv'
exclude_file_val = f'{anno_root}/ava_val_excluded_timestamps_v2.2.csv'

#label_file = f'{anno_root}/ava_action_list_v2.1_for_activitynet_2018.pbtxt'
label_file = f'{anno_root}/ava_action_list_v2.2_for_activitynet_2019.pbtxt'

#proposal_file_train = (f'{anno_root}/ava_dense_proposals_train.FAIR.'
#                       'recall_93.9.pkl')
proposal_file_train = (f'{anno_root}/ava_dense_proposals_train_mini2.pkl')

#proposal_file_val = f'{anno_root}/ava_dense_proposals_val.FAIR.recall_93.9.pkl'
proposal_file_val = f'{anno_root}/ava_dense_proposals_train_mini2.pkl'

img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False)

train_pipeline = [
    dict(type='SampleAVAFrames', clip_len=32, frame_interval=2),
    dict(type='RawFrameDecode'),
    dict(type='RandomRescale', scale_range=(256, 320)),
    dict(type='RandomCrop', size=256),
    dict(type='Flip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW', collapse=True),
    # Rename is needed to use mmdet detectors
    dict(type='Rename', mapping=dict(imgs='img')),
    dict(type='ToTensor', keys=['img', 'proposals', 'gt_bboxes', 'gt_labels']),
    dict(
        type='ToDataContainer',
        fields=[
            dict(key=['proposals', 'gt_bboxes', 'gt_labels'], stack=False)
        ]),
    dict(
        type='Collect',
        keys=['img', 'proposals', 'gt_bboxes', 'gt_labels'],
        meta_keys=['scores', 'entity_ids'])
]
# The testing is w/o. any cropping / flipping
val_pipeline = [
    dict(type='SampleAVAFrames', clip_len=32, frame_interval=2),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW', collapse=True),
    # Rename is needed to use mmdet detectors
    dict(type='Rename', mapping=dict(imgs='img')),
    dict(type='ToTensor', keys=['img', 'proposals']),
    dict(type='ToDataContainer', fields=[dict(key='proposals', stack=False)]),
    dict(
        type='Collect',
        keys=['img', 'proposals'],
        meta_keys=['scores', 'img_shape'],
        nested=True)
]

data = dict(
    #videos_per_gpu=9,
    #workers_per_gpu=2,
    videos_per_gpu=5,
    workers_per_gpu=2,
    val_dataloader=dict(videos_per_gpu=1),
    test_dataloader=dict(videos_per_gpu=1),
    train=dict(
        type=dataset_type,
        ann_file=ann_file_train,
        exclude_file=exclude_file_train,
        pipeline=train_pipeline,
        label_file=label_file,
        proposal_file=proposal_file_train,
        person_det_score_thr=0.9,
        data_prefix=data_root,
        start_index=1,),
    val=dict(
        type=dataset_type,
        ann_file=ann_file_val,
        exclude_file=exclude_file_val,
        pipeline=val_pipeline,
        label_file=label_file,
        proposal_file=proposal_file_val,
        person_det_score_thr=0.9,
        data_prefix=data_root,
        start_index=1,))
data['test'] = data['val']

optimizer = dict(type='SGD', lr=0.1125, momentum=0.9, weight_decay=0.00001)
# this lr is used for 8 gpus

optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
# learning policy

lr_config = dict(
    policy='step',
    step=[10, 15],
    warmup='linear',
    warmup_by_epoch=True,
    warmup_iters=5,
    warmup_ratio=0.1)
total_epochs = 20
checkpoint_config = dict(interval=1)
workflow = [('train', 1)]
evaluation = dict(interval=1, save_best='mAP@0.5IOU')
log_config = dict(
    interval=20, hooks=[
        dict(type='TextLoggerHook'),
    ])
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = ('./work_dirs/ava/'
            'slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb')
load_from = ('https://download.openmmlab.com/mmaction/recognition/slowfast/'
             'slowfast_r50_4x16x1_256e_kinetics400_rgb/'
             'slowfast_r50_4x16x1_256e_kinetics400_rgb_20200704-bcde7ed7.pth')
resume_from = None
find_unused_parameters = False

训练

命令如下：

python tools/train.py configs/detection/ava/my_slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb2.py --validate

然后我们再试试可以正常训练否

修改训练的配置文件

配置文件：

# model setting
model = dict(
    type='FastRCNN',
    backbone=dict(
        type='ResNet3dSlowFast',
        pretrained=None,
        resample_rate=8,
        speed_ratio=8,
        channel_ratio=8,
        slow_pathway=dict(
            type='resnet3d',
            depth=50,
            pretrained=None,
            lateral=True,
            conv1_kernel=(1, 7, 7),
            dilations=(1, 1, 1, 1),
            conv1_stride_t=1,
            pool1_stride_t=1,
            inflate=(0, 0, 1, 1),
            spatial_strides=(1, 2, 2, 1)),
        fast_pathway=dict(
            type='resnet3d',
            depth=50,
            pretrained=None,
            lateral=False,
            base_channels=8,
            conv1_kernel=(5, 7, 7),
            conv1_stride_t=1,
            pool1_stride_t=1,
            spatial_strides=(1, 2, 2, 1))),
    roi_head=dict(
        type='AVARoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor3D',
            roi_layer_type='RoIAlign',
            output_size=8,
            with_temporal_pool=True),
        bbox_head=dict(
            type='BBoxHeadAVA',
            in_channels=2304,
            num_classes=81,
            multilabel=True,
            dropout_ratio=0.5)),
    train_cfg=dict(
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssignerAVA',
                pos_iou_thr=0.9,
                neg_iou_thr=0.9,
                min_pos_iou=0.9),
            sampler=dict(
                type='RandomSampler',
                num=32,
                pos_fraction=1,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=1.0,
            debug=False)),
    test_cfg=dict(rcnn=dict(action_thr=0.002)))

dataset_type = 'AVADataset'
data_root = 'data/ava/rawframes'
anno_root = 'data/ava/annotations'


#ann_file_train = f'{anno_root}/ava_train_v2.1.csv'
ann_file_train = f'{anno_root}/ava_train_v2.2_mini2.csv'
#ann_file_val = f'{anno_root}/ava_val_v2.1.csv'
ann_file_val = f'{anno_root}/ava_train_v2.2_mini2.csv'

#exclude_file_train = f'{anno_root}/ava_train_excluded_timestamps_v2.1.csv'
#exclude_file_val = f'{anno_root}/ava_val_excluded_timestamps_v2.1.csv'

exclude_file_train = f'{anno_root}/ava_train_excluded_timestamps_v2.2.csv'
exclude_file_val = f'{anno_root}/ava_val_excluded_timestamps_v2.2.csv'

#label_file = f'{anno_root}/ava_action_list_v2.1_for_activitynet_2018.pbtxt'
label_file = f'{anno_root}/ava_action_list_v2.2_for_activitynet_2019.pbtxt'

proposal_file_train = (f'{anno_root}/ava_dense_proposals_train.FAIR.'
                       'recall_93.9.pkl')
proposal_file_val = f'{anno_root}/ava_dense_proposals_val.FAIR.recall_93.9.pkl'

img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False)

train_pipeline = [
    dict(type='SampleAVAFrames', clip_len=32, frame_interval=2),
    dict(type='RawFrameDecode'),
    dict(type='RandomRescale', scale_range=(256, 320)),
    dict(type='RandomCrop', size=256),
    dict(type='Flip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW', collapse=True),
    # Rename is needed to use mmdet detectors
    dict(type='Rename', mapping=dict(imgs='img')),
    dict(type='ToTensor', keys=['img', 'proposals', 'gt_bboxes', 'gt_labels']),
    dict(
        type='ToDataContainer',
        fields=[
            dict(key=['proposals', 'gt_bboxes', 'gt_labels'], stack=False)
        ]),
    dict(
        type='Collect',
        keys=['img', 'proposals', 'gt_bboxes', 'gt_labels'],
        meta_keys=['scores', 'entity_ids'])
]
# The testing is w/o. any cropping / flipping
val_pipeline = [
    dict(type='SampleAVAFrames', clip_len=32, frame_interval=2),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW', collapse=True),
    # Rename is needed to use mmdet detectors
    dict(type='Rename', mapping=dict(imgs='img')),
    dict(type='ToTensor', keys=['img', 'proposals']),
    dict(type='ToDataContainer', fields=[dict(key='proposals', stack=False)]),
    dict(
        type='Collect',
        keys=['img', 'proposals'],
        meta_keys=['scores', 'img_shape'],
        nested=True)
]

data = dict(
    #videos_per_gpu=9,
    #workers_per_gpu=2,
    videos_per_gpu=5,
    workers_per_gpu=2,
    val_dataloader=dict(videos_per_gpu=1),
    test_dataloader=dict(videos_per_gpu=1),
    train=dict(
        type=dataset_type,
        ann_file=ann_file_train,
        exclude_file=exclude_file_train,
        pipeline=train_pipeline,
        label_file=label_file,
        proposal_file=proposal_file_train,
        person_det_score_thr=0.9,
        data_prefix=data_root),
    val=dict(
        type=dataset_type,
        ann_file=ann_file_val,
        exclude_file=exclude_file_val,
        pipeline=val_pipeline,
        label_file=label_file,
        proposal_file=proposal_file_val,
        person_det_score_thr=0.9,
        data_prefix=data_root))
data['test'] = data['val']

optimizer = dict(type='SGD', lr=0.1125, momentum=0.9, weight_decay=0.00001)
# this lr is used for 8 gpus

optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
# learning policy

lr_config = dict(
    policy='step',
    step=[10, 15],
    warmup='linear',
    warmup_by_epoch=True,
    warmup_iters=5,
    warmup_ratio=0.1)
total_epochs = 20
checkpoint_config = dict(interval=1)
workflow = [('train', 1)]
evaluation = dict(interval=1, save_best='mAP@0.5IOU')
log_config = dict(
    interval=20, hooks=[
        dict(type='TextLoggerHook'),
    ])
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = ('./work_dirs/ava/'
            'slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb')
load_from = ('https://download.openmmlab.com/mmaction/recognition/slowfast/'
             'slowfast_r50_4x16x1_256e_kinetics400_rgb/'
             'slowfast_r50_4x16x1_256e_kinetics400_rgb_20200704-bcde7ed7.pth')
resume_from = None
find_unused_parameters = False

训练命令：

python tools/train.py configs/detection/ava/my_slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb2.py --validate

我实现过程出现的问题与解决（可忽略）

但是发现又出现了以下的问题（查了网上的结果，说是数据集标注与图片没有匹配）
在这里插入图片描述

我猜测出错的问题在于三个文件的问题：
ava_dense_proposals_test.FAIR.recall_93.9.pkl
ava_dense_proposals_train.FAIR.recall_93.9.pkl
ava_dense_proposals_val.FAIR.recall_93.9.pkl
在这里插入图片描述

于是我把其打印出来发现，这是person box，其中也包含视频帧的编号与人的坐标，所以也要将其缩小与视频帧编号从2开始

结果运行后还是出现了上述的错误，跟踪后发现：
在：/home/mmaction2/mmaction/datasets/pipelines/loading.py
在这里插入图片描述
打印出的结果是：

results {'frame_dir': '/home/mmaction2/data/ava/rawframes/053oq2xB3oU', 'video_id': '053oq2xB3oU', 'timestamp': 319, 'img_key': '053oq2xB3oU,0319', 'shot_info': (0, 27000), 'fps': 30, 'filename_tmpl': 'img_{:05}.jpg', 'modality': 'RGB', 'start_index': 0, 'timestamp_start': 900, 'timestamp_end': 1800, 'proposals': array([[0, 0, 1, 1]]), 'scores': array([1]), 'gt_bboxes': array([[0.024, 0.167, 0.702, 0.861]]), 'gt_labels': array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.]], dtype=float32), 'entity_ids': array([393]), 'frame_inds': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'clip_len': 32, 'frame_interval': 2, 'num_clips': 1, 'crop_quadruple': array([0., 0., 1., 1.], dtype=float32)}

其中最关键的是：‘timestamp_start’: 900, ‘timestamp_end’: 1800，我们需要将其改为：‘timestamp_start’: 0, ‘timestamp_end’: 900

我们倒着追踪代码，发现 results 在 /home/mmaction2/mmaction/datasets/ava_dataset.py 这里定义：

在这里插入图片描述

但是这里并没有直接给出结果，只是一个传递的参数：self.start_index
我们再继续往前追踪：home/mmaction2/mmaction/datasets/base.py ，但是这里并没有出现直接的答案，只是一个线索

在这里插入图片描述
我尝试了很多种跟踪，估计是自己水平菜了，没成功溯源到这个start_index的值从哪里传过来的，通过观察，发现 /home/mmaction2/mmaction/datasets/ava_dataset.py 也存在这个start_index：

但是还是没有解决start_index从哪里来的问题，于是我查找到mmaction2的文档：
https://mmaction2.readthedocs.io/zh_CN/latest/faq.html?highlight=start_index#id3

文档里面说：

FileNotFound 如 No such file or directory: xxx/xxx/img_00300.jpg
在 MMAction2 中，对于帧数据集，start_index 的默认值为 1，而对于视频数据集， start_index 的默认值为 0。如果 FileNotFound 错误发生于视频的第一帧或最后一帧，则需根据视频首帧（即 xxx_00000.jpg 或 xxx_00001.jpg）的偏移量，修改配置文件中数据处理流水线的 start_index 值。

所以，我尝试在config里做操作，位置在：/home/mmaction2/configs/detection/ava/my_slowfast_kinetics_pretrained_r50_4x16x1_20e_ava_rgb2.py，

我加了个 start_index=1在下面
在这里插入图片描述
然后再运行训练代码就没问题了。

在这里插入图片描述

这三个文件在mmaction2的ava部分的Step 5. Fetch Proposal Files：https://github.com/open-mmlab/mmaction2/blob/master/tools/data/ava/README.md
在这里插入图片描述
fetch_ava_proposals.sh：

#!/usr/bin/env bash

set -e

DATA_DIR="../../../data/ava/annotations"

wget https://download.openmmlab.com/mmaction/dataset/ava/ava_dense_proposals_train.FAIR.recall_93.9.pkl -P ${DATA_DIR}
wget https://download.openmmlab.com/mmaction/dataset/ava/ava_dense_proposals_val.FAIR.recall_93.9.pkl -P ${DATA_DIR}
wget https://download.openmmlab.com/mmaction/dataset/ava/ava_dense_proposals_test.FAIR.recall_93.9.pkl -P ${DATA_DIR}