Video Mamba: State Space Model for Efficient Video Understanding

YuSun_WK

已于 2025-04-17 16:06:14 修改

阅读量922

点赞数 11

文章标签：深度学习人工智能

于 2025-04-16 20:58:42 首次发布

本文链接：https://blog.youkuaiyun.com/m0_74413554/article/details/147283934

版权

问题：

1，local redundancy

the large spatiotemporal redundancy within short video clips

2，global dependencies

the complex spatiotemporal dependencies among long contexts.

（CNNs有问题二，ViT有问题一）

贡献：

1，Sensitivity for recognizing short term actions even with fine-grained motion differences

对变化敏感，即使变化很小

More importantly, it is also suitable for masked modeling, which further enhances its temporal sensitivity.

2，Superiority in long-term video understanding

解决长程依赖（mamba本事具有的优势）

3，Scalability in the visual domain without extensive dataset pretraining, thanks to a novel self-distillation technique

To counteract the overfitting : Self-Distillation strategy, which uses a smaller and well-trained model as the "teacher" to guide the training of the larger "student" model.

4，Compatibility with other modalities

（模态就比如说语音，文本，视频等，多模态相关的就比如：视频转文本，语音转文本，文本转语音等）

To augment VideoMamba's temporal sensitivity and verify its adaptability with text modalities, we adopt a masked alignment approach inspired by UMT.

Firstly, VideoMamba is trained from scratch on video data alone, aligning unmasked tokens with those from CLIP-ViT. Subsequently, it is integrated with a text encoder and a cross-modal decoder , for pretraining on both image-text and video-text datasets

本文大框架：

（多加了个时序ps和位置特征pt）