IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

问题引入

  • 现在需要image作为prompt的模型,但是当前直接从基础模型进行finetune的方法计算量大且当基础模型更换之后需要重新进行训练,且和controlnet等模型不相容;
  • sd image variation和stable unclip使用clip image encoder得到的image embedding代替原来的text prompt,需要额外的大量训练,以及训练之后的模型不再具备文本为条件的能力,且训练之后的模型没有办法直接迁移到以base模型为基础的其他模型,且和controlnet不匹配;
  • 本文提出IP-Adapter来使得基础的文生图模型具有以image为生成条件的能力;方法的核心是解耦的cross attention,将text和image的cross attention层分开;
  • 优点:不需要finetune原始diffusion参数,需训练参数量小,可以适配相同基础模型微调出的模型,可以同时支持image和text prompt,可以和controlnet同时使用;

methods

在这里插入图片描述

  • IP-Adapter包含两个部分,image encoder和decoupled cross-attention module;
  • image encoder:CLIP image encoder,在训练的时候参数是冻结的;除此之外还有一个projection layer,将image embedding映射到 N × d N\times d N×d,N是4,d和text feature相同;
  • Decoupled Cross-Attention:text的cross attention计算: Z ′ = A t t e n t i o n ( Q , K , V ) = S o f t m a x ( Q K T d ) V , Q = Z W q , K = c t W k , V = c t W v Z' = Attention(Q,K,V) = Softmax(\frac{QK^T}{\sqrt{d}})V,Q = ZW_q,K = c_tW_k,V = c_tW_v Z=Attention(Q,K,V)=Softmax(d QKT)V,Q=ZWq,K=ctWk,V=ctWv,其中 Z , Z ′ Z,Z' Z,Z分别是输入和输出的feature, c t c_t ct是text经过text encoder的输出,之前的方法将image的feature和text feature进行concat进行cross attention,但是本文将text和image分开,新增了cross attention module,image的cross attention, Z ′ ′ = A t t e n t i o n ( Q , K ′ , V ′ ) = S o f t m a x ( Q ( K ′ ) T d ) V ′ , Q = Z W q , K ′ = c i W k ′ , V ′ = c i W v ′ Z'' = Attention(Q,K',V') = Softmax(\frac{Q(K')^T}{\sqrt{d}})V',Q = ZW_q,K' = c_iW_k',V' = c_iW_v' Z′′=Attention(Q,K,V)=Softmax(d Q(K)T)V,Q=ZWq,K=ciWk,V=ciWv,其中 W q W_q Wq是共用的,所以新增的module只有 W k ′ , W v ′ W_k',W_v' Wk,Wv,只有这两部分参数和projection network是可以训练的,且用 W k , W v W_k,W_v Wk,Wv进行初始化加速训练,最后的输出结果是text和image cross attention结果的加和 Z n e w = Z ′ + Z ′ ′ Z^{new} = Z'+Z'' Znew=Z+Z′′;
### Stable Diffusion LoRA Model Resources and Recommendations For users interested in exploring or utilizing LoRA models with Stable Diffusion, several key resources are available that can significantly enhance the capabilities of text-to-image generation tasks. LaVi-Bridge integrates multiple pretrained language models alongside generative visual models specifically designed for this purpose; notably supporting combinations such as T5-Large + U-Net (SD) and Llama-2 + U-Net (SD)[^1]. This indicates a strong compatibility between these frameworks and Stable Diffusion. #### Key Resource Platforms Several platforms offer curated collections of LoRA models compatible with Stable Diffusion: - **Hugging Face Hub**: A comprehensive repository where developers share pre-trained models including those optimized for use within Stable Diffusion pipelines. - **Civitai**: Specializes in AI art tools offering both free and premium access to various types of diffusion-based models like LoRAs which integrate seamlessly into existing workflows involving Stable Diffusion. #### Recommended Practices When Using LoRA Models With Stable Diffusion To maximize performance while minimizing computational overhead when working with LoRA models on top of Stable Diffusion: - Utilize lightweight adapters instead of retraining entire networks from scratch whenever possible since they allow fine-tuning without altering original weights thus preserving generalization properties across diverse datasets. - Experimentation with different adapter configurations may yield better results depending upon specific application requirements ensuring optimal trade-offs between speed versus quality metrics during inference phases. ```python from diffusers import StableDiffusionPipeline, EulerAncestralDiscreteScheduler import torch model_id = "stabilityai/stable-diffusion-2-base" scheduler = EulerAncestralDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler") pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler).to("cuda") prompt = "A fantasy landscape" image = pipe(prompt).images[0] image.save("./fantasy_landscape.png") ``` --related questions-- 1. What are some best practices for training custom LoRA models? 2. How do adapter mechanisms improve efficiency compared to full network modifications? 3. Can you provide examples of successful applications using LaVi-Bridge's supported model pairs? 4. Are there any limitations associated with integrating third-party LoRA models into Stable Diffusion projects?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值