Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

最新推荐文章于 2023-09-06 16:52:18 发布

原创最新推荐文章于 2023-09-06 16:52:18 发布 · 2.1k 阅读

1 ·

CC 4.0 BY-SA版权

documents 专栏收录该内容

8 篇文章

订阅专栏

本文介绍了一种结合手工特征和深度卷积特征的视频特征提取方法TDD，通过轨迹约束采样和聚合策略来增强特征表示，并使用费舍尔向量进行全局聚合，最后采用线性SVM进行行为识别。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Introduction

此论文结合了hand-craft features 和 deep convolutional features两种提取特征的方式，方法名称：trajectory-pooled deepconvolutional descriptor ，简称TDD。
整个过程分两步：

利用cnn学习convolutional features
采用trajectory-constrained sample and pooling策略对convolutional features进行aggregate，形成descriptor。
利用费舍尔矢量将这些局部的TDDs aggregate成一个全局长矢量，再用svm分类。
英文原文： Based on convolutional feature maps and improved trajectories, we pool the local ConvNet responses over the spatiotemporal tubes centered at the trajectories, where the resulting descriptor is called TDD. Finally, we choose Fisher vector representation to aggregate these local TDDs over the whole video into a global super vector, and use linear SVM as the classifier to perform action recognition.

优点：

自动的学习视频特征，避免了过多的hand-craft 提取。
考虑到了视频的时间特性，采用trajector-constrained sample and pooling对cnn自动学习到的特征进行aggregate。
并且证明了TDD其实就是各类hand-craft的融合。原文如下：
our results demonstrate that our TDDs are complementary to those hand-crafted features (HOG, HOF, and MBH) and the fusion of them is able to further boost the recognition performance.

框架

TDD特征提取

Improved Trajectories

见论文。

而TDD的提取利用到的和improved Trajectory不同，本文只在一个scale下进行dense sample and track，速度快。因此，对于一个视频，可以获得一系列的轨迹集： $这里写图片描述$ , K是轨迹的数量，，第K条轨迹中，第p个点的像素位置。P=15

convolutional feature maps

具体明确三点：
1、搞明白原文中的这段话：
In order to make the feature maps with equal temporal duration with input video, we pad the optical flow fields at the beginning with F −1 copies of the optical flow field from the first frame, where F is the number of stacking optical flow
费解了小半年，以为自己智商有问题一直读不懂，其实是作者这里描述的不明确，斗胆在这里纠正一下错误，填充光流的F-1个copies目的是为了让特征图与输入视频的时间维度对应，真的是这样么？特征图之间是时间关系？输入视频的帧数又是多少？难道第五层的特征图个数是512我们的输入视频就512帧么，费解许久，后来想明白了，这里是让特征图的第三维度（w*h*L*N）L和提取的轨迹trajectory长度相同！！并不是输入视频！也不是时间维度！如果纠正的不对请指出。
2、Net改进一：
remove了目标层后的所有layers。
3、Net改进二：

这里也让我困惑了一段时间（这篇文章要看懂的东西还真多哎），附原文，防止翻译不准确：
The second modification is that before each convolutional or pooling layer, with kernel size k, we conduct zero padding of the layer’s input with size ⌊k/2⌋. This padding allows the input and output maps of these layers to have the same spatial extent. With this padding, it will be straightforward to map the positions of trajectory points in video to the coordinates of convolutional feature maps. A trajectory point with video coordinates (xp, yp, zp) in Equation (3) will be centered on (r × xp, r × yp, zp) in convolutionalmap, where r is map size ratio with respective to input size .
并附图：
三五三五......呜呜呜...三五好难啊！
有两个新概念：
receptive field和map size ratio。
receptive field称为感受野，即各个层每个元素对应输入图像的区域（水平有限，只能描述成这样）。但是那个每个层的感受野的数字是怎么算出来的呢！好吧，作者没交代，也没给参考文献，又傻乎乎的去搞了半天，终于，搞懂了。
* 第一层的卷积核大小即为第一层的感受野大小
* 对于卷积层和pooling层，第i层的感受野RF=（K-1）*stride+RF（i-1），语言描述即：第i层的感受野是该层的卷积核大小减1再与stride相乘并加上上一层的感受野，那么stride是什么？即前i-1层的各个层stride的乘积。
* 好了，费半天劲搞懂了感受野，那么看看map size ratio什么鬼，费半天劲还是没懂，那就直接拿过来用吧，就是轨迹中的某点A trajectory point with video coordinates (xp, yp, zp) will be centered on (r × xp, r × yp, zp) in convolutionalmap, where r is map size ratio with respective to input size.