GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

尔呦
于 2024-10-18 22:33:15 发布
阅读量657
点赞数 15
分类专栏： paper reading 文章标签：深度学习
本文链接：https://blog.youkuaiyun.com/weixin_44994838/article/details/143062409
版权
paper reading 专栏收录该内容
22 篇文章
订阅专栏
                    
                    https://arxiv.org/pdf/2112.10741
 
问题引入
 
针对的是在diffusion模型的text引导方式的问题，在CLIP guidance和classifier-free guidance当中，后者的效果更好，其次，该模型还可以微调来完成image inpainting的任务；
 
methods
 
classifier guidance: 模型原始得到的mean  
      
           μ 
          
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          y 
         
          ) 
         
         \mu_\theta(x_t|y) 
        
     μθ​(xt​∣y)，variance  
      
           ∑ 
          
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          y 
         
          ) 
         
         \sum_\theta(x_t|y) 
        
     ∑θ​(xt​∣y)，classifier log probalibity  
      
          log 
         
          ⁡ 
         
           p 
          
           ϕ 
          
          ( 
         
          y 
         
          ∣ 
         
           x 
          
           t 
          
          ) 
         
         \log p_\phi(y|x_t) 
        
     logpϕ​(y∣xt​)，引导之后的mean变为 
      
            μ 
           
            ^ 
           
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          y 
         
          ) 
         
          = 
         
           μ 
          
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          y 
         
          ) 
         
          + 
         
          s 
         
          ⋅ 
         
           ∑ 
          
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          y 
         
          ) 
         
           ∇ 
          
            x 
           
            t 
           
          log 
         
          ⁡ 
         
           p 
          
           ϕ 
          
          ( 
         
          y 
         
          ∣ 
         
           x 
          
           t 
          
          ) 
         
         \hat{\mu}_\theta(x_t|y) = \mu_\theta(x_t|y) + s\cdot\sum_\theta(x_t|y)\nabla_{x_t}\log p_\phi(y|x_t) 
        
     μ^​θ​(xt​∣y)=μθ​(xt​∣y)+s⋅∑θ​(xt​∣y)∇xt​​logpϕ​(y∣xt​)，其中 
      
          s 
         
         s 
        
     s是guidance scale，增大 
      
          s 
         
         s 
        
     s会导致sample quality  
      
          ↑ 
         
         \uparrow 
        
     ↑,diversity  
      
          ↓ 
         
         \downarrow 
        
     ↓；
classifier-free guidance：此时不需要额外训练一个classifier model，在训练的时候condition label会以一定概率替换为空，在采样的时候，模型的输出是插值的结果 
      
            ϵ 
           
            ^ 
           
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          y 
         
          ) 
         
          = 
         
           ϵ 
          
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          ∅ 
         
          ) 
         
          + 
         
          s 
         
          ⋅ 
         
          ( 
         
           ϵ 
          
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          y 
         
          ) 
         
          − 
         
           ϵ 
          
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          ∅ 
         
          ) 
         
          ) 
         
         \hat\epsilon_\theta(x_t|y)=\epsilon_\theta(x_t|\empty) + s\cdot(\epsilon_\theta(x_t|y)-\epsilon_\theta(x_t|\empty)) 
        
     ϵ^θ​(xt​∣y)=ϵθ​(xt​∣∅)+s⋅(ϵθ​(xt​∣y)−ϵθ​(xt​∣∅))，其中 
      
          s 
         
          ≥ 
         
          1 
         
         s \geq 1 
        
     s≥1为guidance scale；
CLIP guidance：CLIP包含image encoder  
      
          f 
         
          ( 
         
          x 
         
          ) 
         
         f(x) 
        
     f(x)+text encoder 
      
          g 
         
          ( 
         
          c 
         
          ) 
         
         g(c) 
        
     g(c)，在训练的时候训练的目标是contrastive cross entropy，使得成对数据的 
      
          f 
         
          ( 
         
          x 
         
          ) 
         
          ⋅ 
         
          g 
         
          ( 
         
          c 
         
          ) 
         
         f(x)\cdot g(c) 
        
     f(x)⋅g(c)的dot product大，不成对的小，现在已经有方法使用CLIP来引导GAN依据caption进行生成，应用到diffusion model上就是将classifier model替换为CLIP，此时 
      
            μ 
           
            ^ 
           
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          c 
         
          ) 
         
          = 
         
           μ 
          
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          c 
         
          ) 
         
          + 
         
          s 
         
          ⋅ 
         
           ∑ 
          
           θ 
          
          ( 
         
           x 
          
           t 
          
          ∣ 
         
          c 
         
          ) 
         
           ∇ 
          
            x 
           
            t 
           
          ( 
         
          f 
         
          ( 
         
           x 
          
           t 
          
          ) 
         
          ⋅ 
         
          g 
         
          ( 
         
          c 
         
          ) 
         
          ) 
         
         \hat{\mu}_\theta(x_t|c) = \mu_\theta(x_t|c) + s\cdot\sum_\theta(x_t|c)\nabla_{x_t}(f(x_t)\cdot g(c)) 
        
     μ^​θ​(xt​∣c)=μθ​(xt​∣c)+s⋅∑θ​(xt​∣c)∇xt​​(f(xt​)⋅g(c))，此时需要将CLIP在noised image-text pair数据上面进行训练；
本文训练了3.5b  
      
          64 
         
          × 
         
          64 
         
         64\times64 
        
     64×64的text conditional diffusion model以及1.5b参数的四倍上采样的diffusion model，还训练了noised 
      
          64 
         
          × 
         
          64 
         
         64\times64 
        
     64×64的VIT-L CLIP model；
Text-Conditional Diffusion Models：扩展了ADM，增加了text conditioning information；2.5M iters bs=2048进行训练，upsample模型1.6M iters bs=512，还进行了微调，0.2的概率将text替换为空字符串；
之前已经有方法不另外训练inpaint模型完成inpaint任务，基本思想和sdedit的mask版本类似，本文是专门进行了训练，增加了额外的四个通道，对应三个RGB通道以及一个mask通道，新增的参数使用0初始化；