IP-Adapter 和 InstantStyle 代码笔记_ip-adapter代码-优快云博客

本文链接：https://blog.youkuaiyun.com/zhilaizhiwang/article/details/144938798

IP-Adapter：Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
InstantStyle：Github
IP-Adapter主要包含两个部分：Image Encoder 和 Decoupled Cross-Attention

Image Encoder

To effectively decompose the global image embedding, we use a small trainable projection network to project the image
embedding into a sequence of features with length N (we use N = 4 in this study), the dimension of the image features
is the same as the dimension of the text features in the pretrained diffusion model. The projection network we used in
this study consists of a linear layer and a Layer Normalization

class ImageProjModel(torch.nn.Module):
    """Projection Model"""

    def __init__(self, cross_attention_dim=1024, clip_embeddings_dim=1024, clip_extra_context_tokens=4):
        super().__init__()

        self.generator = None
        self.cross_attention_dim = cross_attention_dim
        self.clip_extra_context_tokens = clip_extra_context_tokens
        self.proj = torch.nn.Linear(clip_embeddings_dim, self.clip_extra_context_tokens * cross_attention_dim)
        self.norm = torch.nn.LayerNorm(cross_attention_dim)

    def forward(self, image_embeds):
        embeds = image_embeds
        clip_extra_context_tokens = self.proj(embeds).reshape(
            -1, self.clip_extra_context_tokens, self.cross_attention_dim
        )
        clip_extra_context_tokens = self.norm(clip_extra_context_tokens)
        return clip_extra_context_tokens

Decoupled Cross-Attention

IP-Adapter 主要是在stable diffusion模型的unet部分插入图像特征，unet可以参考stable diffusion中的UNet2DConditionModel代码解读

在SD模型中，给定查询特征Z和文本特征 $c_{t}$ ,corss-attention 的计算公式如下，其中，K和V都由 $c_{t}$ 计算得到。

在这里插入图片描述

插入图像特征的一种直接方法就是把图像特征和文本特征拼在一起，输入cross-attention中，但是作者发现这种方法不够有效。（但是在图像理解方面却很有效，LLAVA就是这么做的。）然后设计了一种解耦的交叉注意力机制，把文本特征和图像特征分别计算cross-attention，然后再把结果相加。
图像cross-attention计算的时候，Q和之前一样，但是 $K^{'}$ 和 $V^{'}$ 由图像特征 $c_{i}$ 计算得到。
$K^{'}$ 和 $V^{'}$ 使用K和V初始化。训练的时候也只训练 $K^{'}$ 和 $V^{'}$ 。

在这里插入图片描述

在推理的时候，使用下面的公式，可以设置图像条件的权重。
在这里插入图片描述
权重大小对于生成图像的影响，权重越小，图片对最终结果的影响也越小。

InstantStyle

InstantStyle 研究了 IP-Adapter （SDXL）不同的attention 层对生成结果的影响，发现up_locks.0.attentions.1 层影响style，down_blocks.2.attentions.1 层影响layout。

To be more specific, we find up blocks.0.attentions.1 and down blocks.2.attentions.1 capture style (color, material, atmosphere) and
spatial layout (structure, composition) respectively, the consideration of layout as a stylistic element is subjective and can vary among individuals.

在这里插入图片描述

这样就可以把style和layout分离，更准确的控制生成结果。
控制方法也很简单，给IP-Adapter不同的层添加不同的权重，其他层的默认权重则为0。

scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}

Text-to-image代码（Style & layout control）

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")

scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
#scale = {"up": {"block_0": [0.0, 1.0, 0.0]},}
pipeline.set_ip_adapter_scale(scale)
#pipeline.set_ip_adapter_scale(0.4)

style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg")

generator = torch.Generator(device="cpu").manual_seed(26)
image = pipeline(
    prompt="a cat, masterpiece, best quality, high quality",
    ip_adapter_image=style_image,
    negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
    guidance_scale=5,
    num_inference_steps=30,
    generator=generator,
).images[0]
image.save("ip.jpg")

最佳实践

如果只使用图像prompt，可以设置 scale=1.0，把text prompt 设置为 “” 或者 “best quality”。
一般设置 scale=0.5，可以得到一个较好的效果。
CLIP默认使用了 center cropped 处理图片，那么IP-Adapter在正方形的图片上表现较好。对于非正方形，可能会丢失中间之外的信息。因此，建议把图片resize到224*224.

IPAdapterAttnProcessor2_0

IPAdapterAttnProcessor2_0的部分代码，主要是ip-adapter部分，即前面公式的后半部分。
这里使用了for循环，可以让ip_adapter_image 传入多张图片，形成混合效果，同时pipeline.load_ip_adapter也需要加载多个权重。

for current_ip_hidden_states, scale, to_k_ip, to_v_ip, mask in zip(
    ip_hidden_states, self.scale, self.to_k_ip, self.to_v_ip, ip_adapter_masks
):
     ip_key = to_k_ip(current_ip_hidden_states)
     ip_value = to_v_ip(current_ip_hidden_states)

     ip_key = ip_key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
     ip_value = ip_value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)

     # the output of sdp = (batch, num_heads, seq_len, head_dim)
     # TODO: add support for attn.scale when we move to Torch 2.1
     current_ip_hidden_states = F.scaled_dot_product_attention(
         query, ip_key, ip_value, attn_mask=None, dropout_p=0.0, is_causal=False
     )

     current_ip_hidden_states = current_ip_hidden_states.transpose(1, 2).reshape(
         batch_size, -1, attn.heads * head_dim
     )
     current_ip_hidden_states = current_ip_hidden_states.to(query.dtype)

     hidden_states = hidden_states + scale * current_ip_hidden_states