问题:
1,local redundancy
the large spatiotemporal redundancy within short video clips
2,global dependencies
the complex spatiotemporal dependencies among long contexts.
(CNNs有问题二,ViT有问题一)
贡献:
1,Sensitivity for recognizing short term actions even with fine-grained motion differences
对变化敏感,即使变化很小
More importantly, it is also suitable for masked modeling, which further enhances its temporal sensitivity.
2,Superiority in long-term video understanding
解决长程依赖(mamba本事具有的优势)
3,Scalability in the visual domain without extensive dataset pretraining, thanks to a novel self-distillation technique
To counteract the overfitting : Self-Distillation strategy, which uses a smaller and well-trained model as the "teacher" to guide the training of the larger "student" model.
4,Compatibility with other modalities
(模态就比如说语音,文本,视频等,多模态相关的就比如:视频转文本,语音转文本,文本转语音等)
To augment VideoMamba's temporal sensitivity and verify its adaptability with text modalities, we adopt a masked alignment approach inspired by UMT.
Firstly, VideoMamba is trained from scratch on video data alone, aligning unmasked tokens with those from CLIP-ViT. Subsequently, it is integrated with a text encoder and a cross-modal decoder , for pretraining on both image-text and video-text datasets
本文大框架:
(多加了个时序ps和位置特征pt)
vision mamba(Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model)大框架:
(多加了位置特征Epos)
我的问题:
1,Unlike VMamba, which incorporates additional depthwise convolution, VideoMamba strictly follows the ViT design without downsampling layers.
所以如果改进的 话可不可以在这篇论文的基础上加上depthwise convolution来下采样减少计算量呢?