2021-11-12 Spatial Temporal Transformer Network for Skeleton-based Action Recognition

最新推荐文章于 2025-02-24 09:35:09 发布

我是chios

最新推荐文章于 2025-02-24 09:35:09 发布

阅读量1.7k

点赞数 2

分类专栏：深度学习期刊会议论文阅读文章标签： transformer 计算机视觉人工智能

本文链接：https://blog.youkuaiyun.com/qq_33331451/article/details/121295025

版权

深度学习同时被 2 个专栏收录

16 篇文章

订阅专栏

期刊会议论文阅读

15 篇文章

订阅专栏

本文提出了一种新颖的两流Transformer模型，用于骨架数据的人体行为识别。通过空间自注意力模块（SSA）理解和捕捉不同身体部位之间的帧内交互，而时间自注意力模块（TSA）则建模关节随时间的动态相关性。实验在NTU RGB+D 60和NTU RGB+D 120数据集上超越了当前最佳模型，证明了该方法的有效性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spatial Temporal Transformer Network for Skeleton-based Action Recognition

Author and Department
Chiara et. al. 米兰理工大学,意大利；发表在上ICPR，2020.

论文有代码，但是复现不正确，之后跟踪继续。

目录

文章目录

Abstract
Summary
Research Objective(s)/Motivation
- Contribution
Background / Problem Statement(Introduction)
- Problem Statement
Method(s)
Experiments
- Ablation Study
References(optional)

Abstract

分为三个部分：1.background 2.motivation 3.method 4. conclusion

Background: Skeleton data has been demonstrated to be robust to illumination changes(光线变化) etc. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem(虽然骨架数据对于复杂环境鲁棒性较强，但是对于3D数据潜在信息的有效编码仍然是个问题)
Motivation：I think rubbing Transformer’s hotness. In addition, The existing methods ignore the correlation between joint pairs.
Method：Spatial-Temporal Transformer network(ST-TR)
- Spatial Self-Attention module (SSA): Understand intra-frame interactions between different body parts;
- Temporal Self-Attention module (TSA):model inter-frame correlations.
Conclusion：A two-stream network which outperforms state-of-the-art models on both NTU-RGB+D 60 and NTU-RGB+D 120.

Summary

写完笔记之后最后填，概述文章的内容，以后查阅笔记的时候先看这一段。注：写文章summary切记需要通过自己的思考，用自己的语言描述。忌讳直接Ctrl + c原文。

Research Objective(s)/Motivation

作者目的是通过Spatial Self-Attention module (SSA) 和Temporal Self-Attention module (TSA) 提取自适应低层特征，建模人类行为中的交互。

Contribution

Author propose a novel two-stream Transformer-based model (both the Termporal and spatial dimensions)
Spatial Self-Attention (SSA) & Temporal SelfAttention (TSA)
- SSA module dynamically build links between skeleton joints, 该模块获取人体各部分之间的关系，与动作有关，而非完全遵守自然人体关节结构。
- TSA study the dynamics of joints along time.

Background / Problem Statement(Introduction)

Problem Statement

The topology of the graph representing the human body is fixed for all layers and actions, preventing the extraction of rich representations(图表示人体的拓扑结构都是固定的，不能够提取丰富的表达)
时空卷积都是基于2D卷积的，所以都受限于局部邻居的特征影响;
correlations between body joints not linked in the human skeleton(人体的关节点未连接的部分同样有关联性)。

Method(s)

Spatial Self-Attention (SSA)

如图1(a)所示, first calculate $q_i^t\in \mathcal{R}^{dq}$ , $k_i^t\in \mathcal{R}^{dq}$ and $v_i^t\in \mathcal{R}^{dq}$ ;Then, 计算a query-key dot product 获取权重 $\alpha_{i,j}^t\in matgh$ (权重代表两个节点之间的关联性强度)。
a weighted sum is computed to obtain a new embedding for node $i^t$ ( $\sum$ 的目的是为了获取节点新的嵌入)
$a_{i.j}^t=\mathbf{q_i^t}\cdot \mathbf{k_j^t}^T,\forall{t}\in T, \mathbf{z}_i^t=\sum_jsoftmax_j(\frac{a_{i.j}^t}{\sqrt{d_k}})\mathbf{v}_j^t\tag{1}$

Multi-head 自注意力经过重复H次嵌入提取过程，每次采用不同集合的学习参数。，从而获得节点嵌入 $z_{i_1}^t,…,z_{i_H}^t$ ，所有参考 $i^t$ ,如 $concat(z_{i_1}^t,…,z_{i_H}^t)\cdot W_O$ ,并且构成SSA的输出特征。

总结，这部分就是为了获取节点与其他节点在空间中的特征聚合

因此，如图1a所示，节点的关系( $a_{i.j}^t$ score)动态的预测；所有动作的关系结构并不是固定的，都是随着样本自适应改变。SSA操作和全连接的图卷积相似，但是核心values( $a_{i.j}^t$ score)是基于骨架动作动态预测的。

Temporal Self-Attention (TSA)

$a_{i.j}^v=\mathbf{q_i^v}\cdot \mathbf{k_j^v},\forall{v}\in V, \mathbf{z}_i^v=\sum_jsoftmax_j(\frac{a_{i.j}^v}{\sqrt{d_k}})\mathbf{v}_j^v\tag{2}$
$i^v,j^v$ 分别表示节点v在时刻i,j的情况。其他和SSA一样。

Two-Stream Spatial Temporal Transformer Network

既然有了SSA和TSA，那么下一步就是为了合并。

作者分别用SSA和TSA代替ST-GCN中的GCN和TCN

Spatial Transformer Stream (S-TR)
$\mathbf{S-TR}(x)=Conv_{2D(1\times K_t)}(\mathbf{SSA}(x))$ . Following the original Transformer structure,Batch Normalization layer and skip connections are used。

Temporal Transformer Stream (T-TR)

$\mathbf{T-TR}(x)=\mathbf{TSA}(GCN(x))$ .

Experiments

++Datasets++:NTU RGB+D 60 and NTU RGB+D 120.

Ablation Study

STR stream achieves slightly better performance(+0.4%) than the T-TR stream. 原因：S-TR的SSA只有25个关节点，而时间维度相关需要大量的帧。并且在参数方面也是下降了的

在这里插入图片描述

其中“playing with phone”,“typing”, and “cross hands” on S-TR 收益最大，上时间关联或者两个人的如：“hugging”, “point finger”, “pat on back”, on T-TR收益最大。

References(optional)

[1] Cho, S., Maqbool, M., Liu, F., Foroosh, H.: Self-attention network for skeletonbased human action recognition. In: The IEEE Winter Conference on Applications of Computer Vision. pp. 635–644 (2020)
[2]Zehui, L., Liu, P., Huang, L., Fu, J., Chen, J., Qiu, X., Huang, X.: Dropattention: A regularization method for fully-connected self-attention networks. arXiv preprint arXiv:1907.11065 (2019)

下一步任务，代码解析，因为代码复现目前有问题，还在进一步调整