Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
https://arxiv.org/abs/1608.00859
https://github.com/yjxiong/temporal-segment-networks
https://github.com/yjxiong/tsn-pytorch
时间分段网络(TSN,Two-Stream)
结合了稀疏的时间采样策略和视频级别的监督,可以使用整个动作视频而不只是一个视频片段的信息进行高效的学习。测试时采用主流方法进行。
ActivityNet 2016竞赛的冠军(93.2% mAP)、HMDB51 ( 69.4%)、UCF101 (94.2%)
Paper
Method
视频中的长距离时序依赖
视频运动(motion)信息的处理和设法融合表象和运动信息是解决视频理解任务的关键
目前做动作识别的两大主流方法是3D卷积和two-stream,但这里两种方案能捕获的仅是视频中的短距离时序依赖。为了捕获长距离时序依赖,这些方法通常需要密采样视频片段clip(时序动作定位里,将视频分帧后,采用多尺度滑动窗口,比如滑动窗口为64,也就是每64帧图片为一个视频clip,视频分为若干个clip )
采用稀疏采样,利用整个视频的信息(相邻的帧有信息冗余)
TSN把视频分成3段,每个片段均匀地随机采样一个视频片段,并使用双流网络得到视频片段属于各类得分(softmax之前的值),之后把不同片段得分取平均,最后通过softmax输出。下图K个spatial convnet的参数是共享的,K个temporal convnet的参数也是共享的
Details
Input
- RGB
video中的某一帧 - RGB difference
相邻两帧的差,可以用来表达动作信息 - optical flow
- warped optical flow
Modality
- RGB/optical flow
1:1.5 - RGB/optical flow/warped optical flow
1:1:0.5
Data argument
- Random cropping
- Horizontal flipping
- Corner cropping
四角+中心,防止网络只关心中心位置 - Scale and ratio jittering
other skills
- Cross-modality pre-training
- spatial ConvNets
用在ImageNet预训练模型对双流网络进行初始化 - temporal ConvNets
交叉预训练,将图像领域的预训练模型迁移到光流领域
- spatial ConvNets
- Partial BN with dropout
除了第一个之外的所有BN层的均值和标准差参数固定
Results
Pytorch
from torch import nn
from ops.basic_ops import ConsensusModule, Identity
from transforms import *
from torch.nn.init import normal, constant
class TSN(nn.Module):
def __init__(self, num_class, num_segments, modality,
base_model='resnet101', new_length=None,
consensus_type='avg', before_softmax=True,
dropout=0.8,
crop_num=1, partial_bn=True):
super(TSN, self).__init__()
self.modality = modality
self.num_segments = num_segments
self.reshape = True
self.before_softmax = before_softmax
self.dropout = dropout
self.crop_num = crop_num
self.consensus_type = consensus_type
if not before_softmax and consensus_type != 'avg':
raise ValueError("Only avg consensus can be used after Softmax")
if new_length is None:
self.new_length = 1 if modality == "RGB" else 5
else:
self.new_length = new_length
print(("""
Initializing TSN with base model: {}.
TSN Configurations:
input_modality: {}
num_segments: {}
new_length: {}
consensus_module: {}
dropout_ratio: {}
""".format(base_model, self.modality, self.num_segments, self.new_length, consensus_type, self.dropout)))
self._prepare_base_model(base_model)
feature_dim = self._prepare_tsn(num_class)
if self.modality == 'Flow':
print("Converting the ImageNet model to a flow init model")
self.base_model = self._construct_flow_model(self.base_model)
print("Done. Flow model ready...")
elif self.modality == 'RGBDiff':
print("Converting the ImageNet model to RGB+Diff init model")
self.base_model = self._construct_diff_model(self.base_model)
print("Done. RGBDiff model ready.")
self.consensus = ConsensusModule(consensus_type)
if not self.before_softmax:
self.softmax = nn.Softmax()
self._enable_pbn = partial_bn
if partial_bn:
self.partialBN(True)
def _prepare_tsn(self, num_class):
feature_dim = getattr(self.base_model, self.base_model.last_layer_name).in_features
if self.dropout == 0:
setattr(self.base_model, self.base_model.last_layer_name, nn.Linear(feature_dim, num_class))
self.new_fc = None
else:
setattr(self.base_model, self.base_model.last_layer_name, nn.Dropout(p=self.dropout))
self.new_fc = nn.Linear(feature_dim, num_class)
std = 0.001
if self.new_fc is None:
normal(getattr(self.base_model, self.base_model.last_layer_name).weight, 0, std)
constant(getattr(self.base_model, self.base_model.last_layer_name).bias, 0)
else:
normal(self.new_fc.weight, 0, std)
constant(self.new_fc.bias