READING NOTE: Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Trackin

最新推荐文章于 2022-06-12 19:04:51 发布

Joshua_Li_

最新推荐文章于 2022-06-12 19:04:51 发布

阅读量1k

点赞数

CC 4.0 BY-SA版权

分类专栏：计算机视觉

本文链接：https://blog.youkuaiyun.com/joshua_1988/article/details/51988522

72 篇文章

订阅专栏

本文提出了一种用于视觉目标跟踪的空间监督循环卷积神经网络。该方法利用YOLO收集丰富的视觉特征并进行初步位置推断，使用LSTM回归视频帧中目标的位置。通过考虑历史位置和外观信息，并采用端到端训练方式实现统一的跟踪系统。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

TITLE: Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking

AUTHER: Guanghan Ning, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, Haohong Wang

ASSOCIATION: University of Missouri, University of Missouri

FROM: arXiv:1607.05781

LSTM’s interpretation and regression capabilities of high-level visual features is explored.
Neural network analysis is extended into the spatiotemporal domain for efficient visual object tracking.

The main steps of the method is as follows:

YOLO is used to collect rich and robust visual features, as well as preliminary location inferences.
LSTM is used to regress the location of the object in video frames.

Some Details

There are two streams of data flowing into the LSTMs. One stream includes

the feature representations from the convolutional layers $X_{t}$ , for example the 4096-d feature from the fully-connected layer of VGG.
the detection information $B_{t,i}$ from the fully connected layers. Thus, at each time-step t, we extract a feature vector of length 4096. We refer to these vectors as Xt .

Another stream includes

The processing speed is fast because of the YOLO algorithm.
History of both location and appearance are considered.
End-to-end training is used in tracking, which means that a unified system is introduced.

4096-d feature is a representation for whole image, how about local representation.
Is there a method of temporal-full convolutional operation on 3D video, similar with fully convolutional operation on 2D image?