A Key Volume Mining Deep Framework for Action Recognition论文学习

最新推荐文章于 2019-01-08 09:31:21 发布

code_Rocker

最新推荐文章于 2019-01-08 09:31:21 发布

阅读量3.1k

点赞数 1

分类专栏： ML papers reading Machine Learning 文章标签： action recognitio

本文链接：https://blog.youkuaiyun.com/u014381600/article/details/54314590

版权

Machine Learning 同时被 2 个专栏收录

33 篇文章

订阅专栏

ML papers reading

12 篇文章

订阅专栏

本文介绍了一种用于动作识别的关键体积挖掘深度框架。通过从视频中挖掘决定性关键区域提高识别准确性。论文提出两种新颖技术：随机输出选择关键体积及无监督关键体积建议算法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Zhu W, Hu J, Sun G, et al. A Key Volume Mining Deep Framework for Action Recognition[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2016:1991-1999.百度学术
这里写图片描述

1：思路：
通过挖掘视频中的对动作具有决定性的关键区域来提高action recognition的准确率，上图中框出的区域就是真实动作发生的关键区域。训练一个CNN网络来学习出这些区域。
主要贡献：

First, we propose Stochastic out to select key volumes from multiple modalities; 
Second,we design an effective yet simple unsupervised key volume
proposal algorithm to improve the probability that an input
bag contains key volumes.
首先，通过随机筛选选出一些可能的候选动作区域
然后通过一种简单的无监督学习算法来提高包含关键区域的概率

最终结果： (93.1%) UCF101数据库

The main contributions of this paper can be summarized
as follows: 
1) We propose an end-to-end deep framework to
simultaneously identify key volumes and do action classification. And we integrate the alternative optimization into
forward and backward stages of SGD training. 
2) We propose two novel techniques, i.e., Stochastic out and unsupervised key volume proposal to benefit the deep framework.

不同于以往的神经网络训练，我们从一段视频中截取了多个 3D volumes 作为神经网络的输入。经过 CNN 之后，每个 volume 会得到一个预测向量，表示该 volume 属于每一个动作类别的概率。这里写图片描述
i类视频中至少有一个 volume 在第 i 类的分类器的概率随着K越来越大而增大。

借鉴 Multiple Instance Learning 的思想，网络的 loss function 要求第 i 类视频中的所有volume在非 i 类的分类器上的响应较小；同时鼓励第i类视频中至少有一个 volume 在第 i 类的分类器上响应较高。

当网络训练到一定的程度之后，神经网络训练的 Forward 阶段对每个 volume 的打分可以用来挑选 key volume；这些 key volume 会在 Backward 阶段影响到神经网络参数的调整。使用 key volume 来更新网络参数避免了随机 volume 引入的噪声，从而得到更好地网络参数。

A:如何挑选key volume
这里写图片描述
正如这个假设所说，对于一个3D体输入，计算后通过N个二分类器，在N个类别上分别得到对应的概率，S是一个K×N大小的矩阵，假设有N=3共三个类别，每一个输入都在这三个类别上产生概率，假如素以第一类的概率最大，S可以是（0.8，1，0.1，2，0.1，3）
接着为了计算多类的loss，作者提出了一种计算方法，也很basic
这里写图片描述
方法很简单，对于输入向量X，选中其中某一个元素Xi的概率是：

主体框架如上图，在前向计算中，对每一个volume都计算得到分数，然后在当前视频的label对应的概率上使用stochastic out，而在其他节点上使用max out。
在反向传播时，我们根据记得算得到的key volume来更新网络参数。

网络训练方法：在实验中固定一个volume的大小K=6，依然假设这K帧的volume中总是包含一个key volume。
如果将筛选出的关键体展现出来：
这里写图片描述

each volume extend T consecutive frames in temporal dimension, where T is the fixed temporal size

最终性能：这里写图片描述

部分参考http://www.toutiao.com/i6297136389775426050/