multimodal LLM相关

原创已于 2024-07-25 11:37:28 修改 · 465 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #人工智能 #自然语言处理

于 2023-11-21 16:19:12 首次发布

Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval

Scalable Deep Multimodal Learning for Cross-Modal Retrieval

预定义的common space，每个模态到common space的映射分开学习

Retrieval-Augmented Multimodal Language Modeling

在这里插入图片描述

retrieve and generate both text and images

REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

只能生成文本

MR2: A Benchmark for Multimodal Retrieval-Augmented Rumor Detection in Social Media

Deep Multimodal Learning for Information Retrieval

workshop的引文

MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search

text-to-multimodal retrieval
淘宝电商检索
在这里插入图片描述

Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training

在这里插入图片描述

两阶段召回，先粗后细

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

lack high-quality image-text data, a dataset

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

modality collaboration phenomenon
在这里插入图片描述
key motivation:
However, this strategy can cause a mismatch in granularity , where image features often contain fruitful semantic information compared to the discrete semantic information within text embedding features. Those methods disregard the unique characteristics of visual and textual information, thus potentially limiting the model’s performance.
在这里插入图片描述
将图片模态和文本模态用两个参数分别处理。

MiniGPT-v2: Large Language Model As a Unified Interface for Vision-Language Multi-task Learning

motivation:
To alleviate this issue and towards a unified approach, we propose
a task-oriented instruction training scheme to reduce the multi-modal instructional ambiguity, and
a vision-language model, MiniGPT-v2.
在这里插入图片描述

Visual Instruction Tuning

llava:
在这里插入图片描述

Flamingo: a Visual Language Model for Few-Shot Learning

在这里插入图片描述

cross attention来结合图片信息

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

在这里插入图片描述

An Empirical Study of Training End-to-End Vision-and-Language Transformers

在这里插入图片描述
排列组合式研究，较为全面地研究VL model的最好组合方式。

Scaling Vision-Language Models with Sparse Mixture of Experts

在这里插入图片描述

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

在这里插入图片描述

Adding more modalities hurt single tower (dense) encoder accuracy.

Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks

在这里插入图片描述

CoCa: Contrastive Captioners are Image-Text Foundation Models

在这里插入图片描述

Mixture-of-Experts with Expert Choice Routing

在这里插入图片描述

MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering

在这里插入图片描述

MOELoRA: An MOE-based Parameter Efficient Fine-Tuning Method for Multi-task Medical Applications

在这里插入图片描述

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

在这里插入图片描述

Multimodal Representation Learning by Alternating Unimodal Adaptation

to solve multimodal laziness
在这里插入图片描述

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

在这里插入图片描述

Prompt-based and weak-modality enhanced multimodal recommendation

在这里插入图片描述

Zero-Shot Composed Image Retrieval with Textual Inversion

在这里插入图片描述

Multimodal Prompt Retrieval for Generative Visual Question Answering

在这里插入图片描述

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

在这里插入图片描述

Grounding Language Models to Images for Multimodal Inputs and Outputs

在这里插入图片描述
同时生成和召回。

Bootstrapping Contrastive Learning Enhanced Music Cold-Start Matching

同时bpr算法，加上采样难样本进行对比学习。其他的都一致。

Bootstrap Latent Representations for Multi-modal Recommendation

图对比学习，设计了很多对比学习损失，来扩展表征能力。

CB2CF: a neural multiview content-to-collaborative filtering model for completely cold item

内容和cf向量拉近学习。

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

生成图片模态。
idea：将图片转化成文字，利用LLM文字生成合适的新内容，再将新内容解码为图片。
利用LLM的in context learning的能力
在这里插入图片描述

VILE: Block-Aware Visual Enhanced Document Retrieval

在这里插入图片描述
同时考虑document整体以及部分的信息进行document的建模。

Query Difficulty Estimation for Image Search With Query Reconstruction Error

在这里插入图片描述
研究query的难度，难度由返回结果判断。用结果重构query来判断query的难度。

From Region to Patch: Attribute-Aware Foreground-Background Contrastive Learning for Fine-Grained Fashion Retrieval

在这里插入图片描述
前景背景、region与patch同时考虑

Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

在这里插入图片描述
为了解决contrastive learning中false negative的问题，区别相似程度，通过模型计算相似程度，给所有pair对进行加权。

Planting a SEED of Vision in Large Language Model

在这里插入图片描述

BEIT V2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

i3 retriever: incorporating implicit interaction in pre-trained language models for passage retrieval

在这里插入图片描述

Unsupervised Multi-Modal Representation Learning for High Quality Retrieval of Similar Products at E-commerce Scale

同类目采样
在这里插入图片描述

How to Bridge the Gap between Modalities: A Comprehensive Survey on Multi-modal Large Language Model

It is worth noting that the method of transferring the capabilities of LLMs to multi-modal scenarios remains unclear.

multimodal converter: map multi-modal features into a feature space that aligns with language
(1)direct mapping
understand embedding
drawbacks: over-reliance on the learning capabilities of Large Language Models.
(2)textual conversion
transforming images into textual descriptions
alleviating the deficiency of LLMs in understanding images. But rely on the expert model to convert images to text.
(3)Adapter-based adjustment
use efficient tuning to make LLM visable

在这里插入图片描述

2. multimodal perceiver
(1) VAE Perceiver
map image to embeddings through a codebook
(2)Q-former Perceiver
(3)Customization Perceiver

3. Tools assistance
(1) natural language assisted
LLMs as coordinators

LLMs as controllers
comprehend user intent and directly repond to the user based on the results obtained from the tools.
(2) code assisted
code has better interpretability and can express tasks more precisely.
(3) both code and natural language assisted
combine code and natural language
4. data-driven MLLMs
general-purpose model to specific domain
(1)disciplinary expertise
(2)Spatial comprehension
(3)Enhanced image comprehension
curated new high-quality data
hallucinations is caused by no contradictions
在这里插入图片描述
(4)complex modalities
point cloud
(5)instruction-following

VPGTrans: Transfer Visual Prompt Generator across LLMs

在这里插入图片描述

some conclusion:

Inheriting the trained VPG can accelerate training.
Warming up the linear projector can prevent performance drop and expedite VPG training
Initializing LLMtgt’s projector with the help of the word converter can accelerate the linear projector warm-up
Linear projector warm-up enables faster convergence with an extremely large learning rate.

在这里插入图片描述

other findings:
5. Merely tuning the projector can not achieve the best performance.
6. Word embedding converter can not replace a trained linear projector.
7. The projector warm-up is robust to a larger learning rate, while VPG can not.

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

在这里插入图片描述

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Large Multimodal Models: Training with LLM
the objective is totally different with our models.
Image-to-Text Generative Models

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

在这里插入图片描述
decoupling e2e training into two stages is crucial? 把这个问题引入到模态不平衡的问题，可以保持图片端encoder的能力。

may cause catastrophic forgetting.

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

在这里插入图片描述
检索的过程需要推理

5.future direction
(1) a more refined way of bridging
a. a more sophisticated multimodal perceiver
b. Perceiver adaptive to LLMs
c. Unifying Multi-modalities
(2)higher quality datasets
(3)more comprehensive benchmarks
(4)multimodal agent
(5)green mllms
(6)applications
(7)safety of mllms

存在的问题：
6. 面对不同的物品时，图文比例应该不同，图片与图片相似、图片与文字相似，如何评估各个相似的重要程度。目前方法都是计算一个结合比例的期望，面对不同笔记权重相同。idea：更换相似度函数，使用可学习相似度函数
7. 探索如何结合图片表征，查看多模态大语言模型的做法。
8. 如何界定图片信息的重要性
9. 图片意图分析：为什么需要放图片