Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
Scalable Deep Multimodal Learning for Cross-Modal Retrieval
预定义的common space,每个模态到common space的映射分开学习
Retrieval-Augmented Multimodal Language Modeling

retrieve and generate both text and images
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory
只能生成文本
MR2: A Benchmark for Multimodal Retrieval-Augmented Rumor Detection in Social Media
Deep Multimodal Learning for Information Retrieval
workshop的引文
MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search
text-to-multimodal retrieval
淘宝电商检索


Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training


两阶段召回,先粗后细
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
lack high-quality image-text data, a dataset
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
modality collaboration phenomenon

key motivation:
However, this strategy can cause a mismatch in granularity , where image features often contain fruitful semantic information compared to the discrete semantic information within text embedding features. Those methods disregard the unique characteristics of visual and textual information, thus potentially limiting the model’s performance.

将图片模态和文本模态用两个参数分别处理。
MiniGPT-v2: Large Language Model As a Unified Interface for Vision-Language Multi-task Learning
motivation:
To alleviate this issue and towards a unified approach, we propose
a task-oriented instruction training scheme to reduce the multi-modal instructional ambiguity, and
a vision-language model, MiniGPT-v2.

Visual Instruction Tuning
llava:


Flamingo: a Visual Language Model for Few-Shot Learning


cross attention来结合图片信息
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts


An Empirical Study of Training End-to-End Vision-and-Language Transformers

排列组合式研究,较为全面地研究VL model的最好组合方式。






Scaling Vision-Language Models with Sparse Mixture of Experts




Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception




Adding more modalities hurt single tower (dense) encoder accuracy.


Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks


CoCa: Contrastive Captioners are Image-Text Foundation Models


Mixture-of-Experts with Expert Choice Routing

MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering


MOELoRA: An MOE-based Parameter Efficient Fine-Tuning Method for Multi-task Medical Applications



Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning




Multimodal Representation Learning by Alternating Unimodal Adaptation
to solve multimodal laziness

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features








Prompt-based and weak-modality enhanced multimodal recommendation

Zero-Shot Composed Image Retrieval with Textual Inversion


Multimodal Prompt Retrieval for Generative Visual Question Answering

Align and Prompt: Video-and-Language Pre-training with Entity Prompts


Grounding Language Models to Images for Multimodal Inputs and Outputs

同时生成和召回。
Bootstrapping Contrastive Learning Enhanced Music Cold-Start Matching
同时bpr算法,加上采样难样本进行对比学习。其他的都一致。
Bootstrap Latent Representations for Multi-modal Recommendation
图对比学习,设计了很多对比学习损失,来扩展表征能力。
CB2CF: a neural multiview content-to-collaborative filtering model for completely cold item
内容和cf向量拉近学习。
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
生成图片模态。
idea:将图片转化成文字,利用LLM文字生成合适的新内容,再将新内容解码为图片。
利用LLM的in context learning的能力

VILE: Block-Aware Visual Enhanced Document Retrieval

同时考虑document整体以及部分的信息进行document的建模。



Query Difficulty Estimation for Image Search With Query Reconstruction Error

研究query的难度,难度由返回结果判断。用结果重构query来判断query的难度。
From Region to Patch: Attribute-Aware Foreground-Background Contrastive Learning for Fine-Grained Fashion Retrieval

前景背景、region与patch同时考虑
Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

为了解决contrastive learning中false negative的问题,区别相似程度,通过模型计算相似程度,给所有pair对进行加权。
Planting a SEED of Vision in Large Language Model



BEIT V2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
i3 retriever: incorporating implicit interaction in pre-trained language models for passage retrieval



Unsupervised Multi-Modal Representation Learning for High Quality Retrieval of Similar Products at E-commerce Scale
同类目采样


How to Bridge the Gap between Modalities: A Comprehensive Survey on Multi-modal Large Language Model
It is worth noting that the method of transferring the capabilities of LLMs to multi-modal scenarios remains unclear.
- multimodal converter: map multi-modal features into a feature space that aligns with language
(1)direct mapping
understand embedding
drawbacks: over-reliance on the learning capabilities of Large Language Models.
(2)textual conversion
transforming images into textual descriptions
alleviating the deficiency of LLMs in understanding images. But rely on the expert model to convert images to text.
(3)Adapter-based adjustment
use efficient tuning to make LLM visable


2. multimodal perceiver
(1) VAE Perceiver
map image to embeddings through a codebook
(2)Q-former Perceiver
(3)Customization Perceiver

3. Tools assistance
(1) natural language assisted
LLMs as coordinators

LLMs as controllers
comprehend user intent and directly repond to the user based on the results obtained from the tools.
(2) code assisted
code has better interpretability and can express tasks more precisely.
(3) both code and natural language assisted
combine code and natural language
4. data-driven MLLMs
general-purpose model to specific domain
(1)disciplinary expertise
(2)Spatial comprehension
(3)Enhanced image comprehension
curated new high-quality data
hallucinations is caused by no contradictions

(4)complex modalities
point cloud
(5)instruction-following
VPGTrans: Transfer Visual Prompt Generator across LLMs



some conclusion:
- Inheriting the trained VPG can accelerate training.
- Warming up the linear projector can prevent performance drop and expedite VPG training
- Initializing LLMtgt’s projector with the help of the word converter can accelerate the linear projector warm-up
- Linear projector warm-up enables faster convergence with an extremely large learning rate.


other findings:
5. Merely tuning the projector can not achieve the best performance.
6. Word embedding converter can not replace a trained linear projector.
7. The projector warm-up is robust to a larger learning rate, while VPG can not.
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs



Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Large Multimodal Models: Training with LLM
the objective is totally different with our models.
Image-to-Text Generative Models
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

decoupling e2e training into two stages is crucial? 把这个问题引入到模态不平衡的问题,可以保持图片端encoder的能力。

may cause catastrophic forgetting.
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

检索的过程需要推理
5.future direction
(1) a more refined way of bridging
a. a more sophisticated multimodal perceiver
b. Perceiver adaptive to LLMs
c. Unifying Multi-modalities
(2)higher quality datasets
(3)more comprehensive benchmarks
(4)multimodal agent
(5)green mllms
(6)applications
(7)safety of mllms
存在的问题:
6. 面对不同的物品时,图文比例应该不同,图片与图片相似、图片与文字相似,如何评估各个相似的重要程度。目前方法都是计算一个结合比例的期望,面对不同笔记权重相同。idea:更换相似度函数,使用可学习相似度函数
7. 探索如何结合图片表征,查看多模态大语言模型的做法。
8. 如何界定图片信息的重要性
9. 图片意图分析:为什么需要放图片
732

被折叠的 条评论
为什么被折叠?



