Spatiotemporal Residual Networks for Video Action Recognition

最新推荐文章于 2024-06-03 09:54:22 发布

paranoid_CNN

最新推荐文章于 2024-06-03 09:54:22 发布

阅读量4.6k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： documents

本文链接：https://blog.youkuaiyun.com/paranoid_CNN/article/details/77838574

documents 专栏收录该内容

8 篇文章

订阅专栏

该文提出一种基于3D ResNet的行为识别方法，通过扩展ResNet到时空域并引入残差连接来增强特征学习。实验证明，此方法在UCF101和HMDB51数据集上取得显著效果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这篇文章出自2016 NIPS，作者是格林茨大学的Feichtenhofer。

背景：几乎现在行为识别领域，比较work的工作都是基于二流网络。其中appearance和motion分别由两个不同的网络学习，并将两个网络的结果做融合后产生识别。

本文创新部分：
a、将二维空间ResNet网络部分扩展到时间域。即原来的二维空间网络W*H*C 映射到W*H*T*C，
具体初始化方法如下：这里写图片描述

这样的好处有两个：
a1）在一个网络里同时学习空时域特征。
a2）可以通过将二维模型的权重平均在在时间域上作为STResNet的初始化。带来的好处是：可以使用迁移学习而不是learning from scratch。
b、从motion网络引入residual connection到apprearance 网络，并同时fine-tune训练网络学习空时特征。

网络结构：

这里写图片描述

具体原理图如下：
这里写图片描述
红色即连接appearance流和motion流的residual connection（残差连接），利用ResNet的原因是其由于其跨越连接（黑色的跳跃箭头）可以避免梯度流失的问题，网络深度也比VGG深，网络表达能力也相应增强。

改进部分：
1、对于数据的输入部分，采用了skip frame采样。对应原文：to train our spatiotemporal ResNet we sample 5 inputs from a video with random temporal stride between 5 and 15 frames.
即：每个输入都是5个snippets，对于RGB网络，snippet=1，对于Flow网络，snippet=L（光流stacked数目）。

实验结果：
这里写图片描述
Overall, our 94.6% on UCF101 and 70.3% HMDB51 clearly sets a new state-of-the-art on these widely used action recognition datasets.

实验结论：
1、we demonstrate that injecting residual connections between the two streams and jointly fine-tuning the resulting model achieves improved performance over the two-stream architecture.
（从motion stream向appearance stream引入的residual connection 能够改善原来的二流框架在行为识别领域的识别率）

2、We convert convolutional dimensionality mapping filters to temporal filters that provide the network with learnable residual connections over time. By stacking several of these temporal filters and sampling the input sequence at large temporal strides (i.e. skipping frames), we enable the network to operate over large temporal extents of the input.
（首先，对于输入视频，也是skip frame采样。其次，将二维卷积变换到三维卷积，这两个做法增大了在时间维度上的接收域。）

3、 we directly convert image ConvNets into 3D architectures and show greatly improved performance over the two-stream baseline.

个人观点：
实际上，本文做了两件事情，第一，利用了前人3dcnn的工作，把二维ResNet扩展到3D ResNet学习空时域特征。第二，将ResNet内的残差单元从motion stream引入到了appearance stream。实验结果，准确率高，有效。