Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Modelsfor Hateful Meme Detection

部署运行你感兴趣的模型镜像

 论文链接:[2407.21004] Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models for Hateful Meme Detection

投稿:ACL

Abstract

通过融合进步的文化理念,仇恨表情包(meme)不断演变,新的表情包不断出现,使得依赖广泛培训的现有方法变得过时或无效。在这项工作中,我们提出了Evolver,通过整合模因的进化属性和上下文信息,通过进化链(CoE)提示融合了大型多模态模型(Large Multimodal Models, LLM)。具体而言,Evolver使用进化配对挖掘模块(evolutionary pair min-ing module)、进化信息提取器(evolutionary information ex-tractor)和语境关联放大器(contextual relevance amplifier),逐步模拟模因和原因通过lmm的进化和表达过程。在公共FHM、MAMI和HarM数据集上进行的大量实验表明,可以将CoE提示合并到现有的LLM中,以提高其性能。更令人鼓舞的是,它可以作为一种解释工具,促进对表情包进化的理解。

fig1.a 仇恨表情包的融合和进化

Memes evolve by fusing new cultural concepts. The meme of Trump is influenced by the meme of a sad frog in an image and text symbol, which creates a new hateful meme. 

fig1.b 传统的 two-stream 方法:缺点:poor interpretability, over-fitting on training sets and diminished effectiveness in the dynamic and evolving meme landscape

fig1.c 我们的进化方法:captures the evolution and context of memes, utilizing them as prompts for large multimodal models to obtain a comprehensive understanding of memes.

Evolver is an innovative approach in zero-shot hateful meme detection significantly

Main :

-We establish an LMM-based zero-shot hateful meme detection benchmark, which provides a comprehensive evaluation of the application of LMMs in social media.
- We propose a simple yet effective Evolver frame-work, which advances LMMs with Chain-of-Evolution prompting. It expands LMMs with an evolution reasoning ability while offering good interpretability.
- Extensive experiments on commonly used zero-shot hateful meme detection datasets with supe-rior performance validate the efficacy and gen-eralization of our method.

Method

Problem Definition

hateful meme detection 当二分类问题

Meme, in our case, consists of an image-text pair.

where yˆj denotes the j-th prediction. The pre-diction yˆj ∈ {0, 1} indicates the target image-text pair is hateful or not.

g(·) is the multimodal model.

{Xvj , Xtj } is the j-th image-text paired input.

Large Multimodal Models

Vision Encoder

 help language models understand visual content.

leverages frozen pretrained vision models such as CLIP (Radford et al., 2021) and ViT (Dosovitskiy et al., 2020) to encode visual content

hv is the language embedding tokens

W is the projection to transform visual features into language embedding tokens

Encvis(Xv) denotes the visual feature extracted by pretrained model.

先转化成图像特征然后再把图像特征映射到language embedding tokens

Large Language Model Decoder

是标准 decoder 的生成逻辑 + 多模态输入(文本和图像)

generates a sentence given tokens:

Xt is the input text

Tokenize(·) transform text into tokens

hv is the image tokens

p(w) is the probability of generating a sentence by a language model

p(wi|w<i, ht, hv) is the probability of generating a token at the i-th position given the previously generated tokens, input text, and image tokens. 

是典型的 自回归(autoregressive)语言模型,一步一步生成每一个 token

The visual tokens are incorporated with textual tokens and then fed to the language model.

Evolver: Chain-of-Evolution Prompting

找相似 → 抽仇恨特征 → 上下文增强 → LMM推断

we design a novel Chain-of-Evolution (CoE):

evolutionary pair mining 的目标是通过寻找与目标 meme 相关的最接近的 K 个 meme,并从这些 meme 中提取出什么特征是与 hatefulness 相关的。这些提取出来的信息会作为 evolution information,并用来帮助模型理解这些 meme 是如何演变过来的。

然后,这些 evolutionary information 会被输入到 Evolution Information Extractor 中,以提取出与 hatefulness 相关的关键特征。接下来,Contextual Relevance Amplifier 就是对这些提取的信息进行增强,它的作用是将信息进一步调整成对当前任务更有帮助的形式。

最后,在 LMM (Large Multimodal Model) 进行推断时,模型会结合这些增强过的信息来进行最终的判断。你可以理解为 Amplifier 就是通过增强信息和加入明确的指令,来提升整个判断的准确度。

简而言之,信息增强(Amplifier)确实是在第二和第三阶段之间作为一种增强方式,通过这种方式,最终的判断会更加准确,因为它引入了一个对当前任务更加相关的背景信息。

Evolutionary Pair Mining

通过余弦相似度找这个meme 可能的k个衍生/来源

在Qu等人(2023)的推动下,模因的进化被定义为通过融合其他模因或文化观念而出现的新模因。因此,模因的演变与旧模因具有相似的文本和视觉语义规律。我们利用这一特性,根据进化的模因来识别这些旧模因。为此,我们可以利用仇恨表情包的进化来增强lmm的仇恨表情包理解能力

1.  CLIP: generate the textual and visual embeddings from an external meme pool and target memes.

external meme pool:  (1) do not overlap with the test set. (2) have enough evolutionary information.

 training set as the carefully curated meme pools rather than any other dataset, where the memes follow the same definition of hate-fulness/harm/misogyny.

2. fuse the textual and visual embeddings with a fixed ratio

* 现有技术很难确认进化meme的上下来源

use multiple evolutionary memes to find the common characteristics of this evolution:

we retrieve the top-K similar memes using cosine similarity:

A ∈ Rn×d is embedding of n candidate memes 

B ∈ Rd is the d-dimensional vector of the target meme

cos(·) return a similarity vector
TopK(·) returns K highest values given the input vector.

For each evolution meme, we pair K memes that the evolution meme is derived from.

Evolution Information Extractor

To extract the information we are interested in (e.g., hateful component), we summarize paired memes with the help of a large multimodal model.

Info stands for our evolutionary information 

LMM indicates the large multimodal model.

memesT opK are the T opK memes retrieved in the previous step

Xextract is the instruction to guide the LMM to extract information

We present the detailed instruction of Xextract as shown in Table 1.

More definitions of hateful memes from different datases are shown in Appendix D.

Contextual Relevance Amplifier

To enhance the in-context hatefulness information, we add a contextual relevance amplifier to the LMM during evolution information extraction and final prediction to increase the search for hateful components

contextual relevance amplifier is the definition of a hateful meme given by the dataset we use

we combine the extracted information and contextual relevance amplifier as the in-context enhancement and feed them to the model:

yˆ is the final prediction,

Info stands for the information extracted pre-viously,

memesT is the memes we want to detect,

XD is the instruction to ask the large multimodal model to detect hateful memes.

Amp refers to contextual relevance amplifier

 The example of the amplifier is the same as the blue part in Table 1

可不可以理解为先用最接近的k个meme得出什么是hateful的 info,再把这个info放入最后判断,这个amplifier是作为信息增强作为一段instruction加入2,3阶段的?

The Principles of Prompts

principal:

(1) include the hateful meme definition and (2) limit the prompt to 30 words, addressing LMMs’ challenges with long-text comprehension. 

We di-rectly applied definitions from the FHM, MAMI, and HarM datasets, summarized to meet this length requirement with the whole prompt using GPT-4.  To this end, we do not use any specialized prompt design, highlighting the robustness of our method.

Experiment

Datasets 

Facebook Hateful Meme dataset (FHM) (Kiela et al., 2020)

Harmful Meme dataset (HarM) (Pramanick et al., 2021)

Multimedia Automatic Misogyny Identification (MAMI) (Fersini et al., 2022)

Implementation

We implement our Evolver based on the

MMICL: 
Evolution Info Extractor:
minimum length for the generation: 50
maximum length for the generation: 80
temperature: 0.2.

final prediction:
minimum length of generation: 1
maximum length for generation: 50
 

LLaVA-1.5: 

both stage:
temperature: 0.2
maxi-mum generated tokens: 1024 


embedding size of textual and visual embeddings: N × 768, N is the number of memes
fuse textual and visual embeddings with a fixed ratio of 4:1 by element-wise add in practice.

Baselines

The detailed description of the base-line models including API-based (e.g., GPT-4V) and open-source LMMs (e.g., LLaVA) can be found in Appendix A.

Metrics

We adopt ACC (accuracy) and AUC (area under the ROC curve) as evaluation metrics.

LMM Backbones

We implement our Evolver based on MMICL (Zhao et al., 2023) and LLaVA-1.5 (Liu et al., 2023a). Please refer to the Ap-pendix B for more details.

Result

还是有蛮大提升空间的

The origin model (MMICL) without considering the meme’s evolu-tion, predicts non-hateful because it is not easy to detect the boy’s disability or genetic conditions. After obtaining the evolution information like “dis-criminates against individuals with disabilities or genetic conditions.", our Evolver rectify the predic-tion, which supports the rationale of our method.

model size does not necessar-ily lead to a better understanding of hateful memes.

closed-source models still outper-form open-source LMMs in zero-shot results, sug-gesting a significant gap for open-source models to close. (大概指的是那种qwen等开源不如gpt这种闭源)

 typical existing training-based mod-els are fully supervised settings, mainly based on CLIP (Radford et al., 2021) and BERT (Devlin, 2018), which can not provide interpretability.

Ablation Study

As shown in Figure 3, we show the effect of the number K of the evolutionary memes. 
 we set the K = 5 to achieve the best

Limitation

First, the effectiveness of our approach relies heav-ily on the quality and diversity of the curated meme pool used for seeking evolutionary memes.

Fur-thermore, biases inherent in these datasets could potentially affect the model’s ability to generalize across different cultural contexts and meme evolution patterns not represented in the related data.

Conclusion

We present Evolver to seamlessly boost LMMs for hateful meme detection via Chain-of-Evolution prompting. By integrating evolution of memes, our method can adapt to unseen memes. Experimental results demonstrate the effectiveness of Evolver.

您可能感兴趣的与本文相关的镜像

Qwen3-VL-30B

Qwen3-VL-30B

图文对话
Qwen3-VL

Qwen3-VL是迄今为止 Qwen 系列中最强大的视觉-语言模型,这一代在各个方面都进行了全面升级:更优秀的文本理解和生成、更深入的视觉感知和推理、扩展的上下文长度、增强的空间和视频动态理解能力,以及更强的代理交互能力

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值