CrossAttentionControl：文本引导的图像编辑与语义强化-优快云博客

本文链接：https://blog.youkuaiyun.com/wangyunpeng33/article/details/129772170

文章目录

Cross Attention Control
Replacement
Refinement
Re-weight

在这里插入图片描述

Cross Attention Control

我们可以通过在扩散过程中注入交叉注意力映射来编辑图像，控制哪个像素在哪个扩散步骤中关注提示文本的哪个标记。使用源图像的注意图来控制生成图像的空间布局和几何形状。当在提示符中交换一个单词时，我们注入源图像映射Mt，覆盖目标图像映射M∗t，以保持空间布局。在添加新短语的情况下，我们只注入与提示符未更改部分对应的映射。通过调整注意图的权重来放大或减弱单词的语义效果。

在这里插入图片描述

生成图像的文本中每个词对应的平均注意力掩码：
在这里插入图片描述
bear， bird在各个扩散步数中对应的注意力图：

为了将我们的方法应用于各种创造性编辑应用，我们展示了几种通过简单和语义接口控制交叉注意力映射的方法：

Replacement

第一种方法是在固定交叉注意力映射的同时，改变提示符中的单个标记值(例如，“狗”改为“猫”)，以保留场景组成。

In this case, the user swaps tokens of the original prompt with others, e.g., the editing the prompt “A painting of a squirrel eating a burger” to “A painting of a squirrel eating a lasagna” or “A painting of a lion eating a burger”. For this we define the class AttentionReplace.

Refinement

第二种是全局编辑图像，例如，通过在提示符中添加新词并冻结对先前标记的注意力，同时允许新的注意力流向新的标记来更改样式。

In this case, the user adds new tokens to the prompt, e.g., editing the prompt “A painting of a squirrel eating a burger” to “A watercolor painting of a squirrel eating a burger”. For this we define the class AttentionEditRefine.

Re-weight

第三是在生成的图像中放大或减弱单词的语义效果

In this case, the user changes the weight of certain tokens in the prompt, e.g., for the prompt “A photo of a poppy field at night”,strengthen or weaken the extent to which the word night affects the resulting image. For this we define the class AttentionReweight.

在这里插入图片描述

Attention Control Options

cross_replace_steps: specifies the fraction of steps to edit the cross attention maps. Can also be set to a dictionary [str:float] which specifies fractions for different words in the prompt.
self_replace_steps: specifies the fraction of steps to replace the self attention maps.
local_blend (optional): LocalBlend object which is used to make local edits. LocalBlend is initialized with the words from each prompt that correspond with the region in the image we want to edit.
equalizer: used for attention Re-weighting only. A vector of coefficients to multiply each cross-attention weight