Introduction
此论文结合了hand-craft features 和 deep convolutional features两种提取特征的方式,方法名称:trajectory-pooled deepconvolutional descriptor ,简称TDD。
整个过程分两步:
利用cnn学习convolutional features
采用trajectory-constrained sample and pooling策略对convolutional features进行aggregate,形成descriptor。
利用费舍尔矢量将这些局部的TDDs aggregate成一个全局长矢量,再用svm分类。
英文原文: Based on convolutional feature maps and improved trajectories, we pool the local ConvNet responses over the spatiotemporal tubes centered at the trajectories, where the resulting descriptor is called TDD. Finally, we choose Fisher vector representation to aggregate these local TDDs over the whole video into a global super vector, and use linear SVM as the classifier to perform action recognition.
优点:
自动的学习视频特征,避免了过多的hand-craft 提取。
考虑到了视频的时间特性,采用trajector-constrained sample and pooling对cnn自动学习到的特征进行aggregate。
并且证明了TDD其实就是各类hand-craft的融合。原文如下:
our results demonstrate that our TDDs are complementary to those hand-crafted features (HOG, HOF, and MBH) and the fusion of them is able to further boost the recognition performance.
框架
Improved Trajectories
见论文。
而TDD的提取利用到的和improved Trajectory不同,本文只在一个scale下进行dense sample and track,速度快。因此,对于一个视频,可以获得一系列的轨迹集:, K是轨迹的数量,
,
第K条轨迹中,第p个点的像素位置。P=15
convolutional feature maps
具体明确三点:
1、搞明白原文中的这段话:
In order to make the feature maps with equal temporal duration with input video, we pad the optical flow fields at the beginning with F −1 copies of the optical flow field from the first frame, where F is the number of stacking optical flow
费解了小半年,以为自己智商有问题一直读不懂,其实是作者这里描述的不明确,斗胆在这里纠正一下错误,填充光流的F-1个copies目的是为了让特征图与输入视频的时间维度对应,真的是这样么?特征图之间是时间关系?输入视频的帧数又是多少?难道第五层的特征图个数是512我们的输入视频就512帧么,费解许久,后来想明白了,这里是让特征图的第三维度(w*h*L*N)L和提取的轨迹trajectory长度相同!!并不是输入视频!也不是时间维度!如果纠正的不对请指出。
2、Net改进一:
remove了目标层后的所有layers。
3、Net改进二:
这里也让我困惑了一段时间(这篇文章要看懂的东西还真多哎),附原文,防止翻译不准确:
The second modification is that before each convolutional or pooling layer, with kernel size k, we conduct zero padding of the layer’s input with size ⌊k/2⌋. This padding allows the input and output maps of these layers to have the same spatial extent. With this padding, it will be straightforward to map the positions of trajectory points in video to the coordinates of convolutional feature maps. A trajectory point with video coordinates (xp, yp, zp) in Equation (3) will be centered on (r × xp, r × yp, zp) in convolutionalmap, where r is map size ratio with respective to input size .
并附图:
有两个新概念:
receptive field和map size ratio。
receptive field称为感受野,即各个层每个元素对应输入图像的区域(水平有限,只能描述成这样)。但是那个每个层的感受野的数字是怎么算出来的呢!好吧,作者没交代,也没给参考文献,又傻乎乎的去搞了半天,终于,搞懂了。
* 第一层的卷积核大小即为第一层的感受野大小
* 对于卷积层和pooling层,第i层的感受野RF=(K-1)*stride+RF(i-1),语言描述即:第i层的感受野是该层的卷积核大小减1再与stride相乘并加上上一层的感受野,那么stride是什么?即前i-1层的各个层stride的乘积。
* 好了,费半天劲搞懂了感受野,那么看看map size ratio什么鬼,费半天劲还是没懂,那就直接拿过来用吧,就是轨迹中的某点A trajectory point with video coordinates (xp, yp, zp) will be centered on (r × xp, r × yp, zp) in convolutionalmap, where r is map size ratio with respective to input size.
TDD
一、两种正交化:
Normalization proves to be an effective strategy in designing features partially because it can reduce the influence of illumination.
1.Spatiotemporal Normalization:
其中:
第n个特征图里h*w*L空间里最大的像素,从公式里看,时空正交化保证了每一个卷积特征图的所有空间的所有像素值变化在同一间隔(0-1)。
Channel Normalization
其中:
我还是不写自己的话了,还是原文吧,表达了几次觉得表达不好。This channel
normalization is able to make sure that the feature value of each pixel range in the same interval。
二、TDD descriptor
基本原理到这里就清楚多了。
啰嗦几句:本博客是本着自己加深理解写的,水平有限,难免会有错误,欢迎指出,转载需经作者同意。