深入解析HuggingFace扩散模型：从理论到PyTorch实现-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00869/article/details/148549330

深入解析HuggingFace扩散模型：从理论到PyTorch实现

notebooks Notebooks using the Hugging Face libraries 🤗 项目地址: https://gitcode.com/gh_mirrors/note/notebooks

引言：扩散模型概述

扩散模型（Diffusion Models）是当前生成式人工智能领域最引人注目的技术之一，它通过逐步去噪的过程从纯噪声生成高质量数据样本。这类模型在图像、音频和视频生成任务中取得了突破性成果，如OpenAI的DALL-E 2、Google的ImageGen等著名模型都基于扩散模型技术。

本文将基于HuggingFace提供的annotated_diffusion示例，深入解析扩散模型的核心原理，并逐步实现一个完整的扩散模型。我们将重点讨论Denoising Diffusion Probabilistic Models (DDPM)这一经典扩散模型架构。

扩散模型基本原理

扩散模型的核心思想是通过两个相互关联的过程实现数据生成：

前向扩散过程：这是一个固定的过程，逐步向数据添加高斯噪声，最终将数据完全转化为噪声
反向去噪过程：这是一个学习的过程，神经网络学习如何从噪声中逐步恢复原始数据

前向扩散过程数学描述

前向扩散过程定义为马尔可夫链，每一步都向数据添加少量高斯噪声。给定原始数据点x₀，前向过程在T个时间步中逐步生成噪声数据x₁,...,x_T：

q(xₜ|xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)

其中βₜ是预先定义的噪声调度参数，控制每个时间步添加的噪声量。

反向去噪过程

反向过程的目标是从噪声x_T开始，逐步恢复原始数据x₀。由于直接建模p(xₜ₋₁|xₜ)非常困难，我们使用神经网络来近似这个条件分布：

pθ(xₜ₋₁|xₜ) = N(xₜ₋₁; μθ(xₜ,t), Σθ(xₜ,t))

在DDPM中，作者将方差Σθ固定为常数，只让神经网络学习均值μθ。

实现准备

在开始实现之前，我们需要安装并导入必要的Python库：

!pip install -q -U einops datasets matplotlib tqdm

import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import einsum
from einops import rearrange
from tqdm.auto import tqdm
import matplotlib.pyplot as plt

核心组件实现

1. 时间步嵌入

由于我们的网络需要处理不同时间步的噪声水平，我们需要将时间步信息编码为网络可以理解的格式：

class SinusoidalPositionEmbeddings(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim
        
    def forward(self, time):
        device = time.device
        half_dim = self.dim // 2
        embeddings = math.log(10000) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings
        embeddings = time[:, None] * embeddings[None, :]
        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
        return embeddings

2. 残差块实现

残差连接有助于梯度流动，是U-Net架构的关键部分：

class Block(nn.Module):
    def __init__(self, dim_in, dim_out, groups=8):
        super().__init__()
        self.proj = nn.Conv2d(dim_in, dim_out, 3, padding=1)
        self.norm = nn.GroupNorm(groups, dim_out)
        self.act = nn.SiLU()
        
    def forward(self, x, scale_shift=None):
        x = self.proj(x)
        x = self.norm(x)
        
        if scale_shift is not None:
            scale, shift = scale_shift
            x = x * (scale + 1) + shift
            
        x = self.act(x)
        return x

3. U-Net架构实现

完整的U-Net模型包含下采样和上采样路径，中间通过残差连接：

class Unet(nn.Module):
    def __init__(self, dim=32, init_dim=None, dim_mults=(1,2,4,8)):
        super().__init__()
        
        dims = [init_dim or dim//2, *[dim * m for m in dim_mults]]
        in_out = list(zip(dims[:-1], dims[1:]))
        
        # 时间嵌入
        time_dim = dim * 4
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbeddings(dim),
            nn.Linear(dim, time_dim),
            nn.GELU(),
            nn.Linear(time_dim, time_dim)
        )
        
        # 下采样路径
        self.downs = nn.ModuleList([])
        self.ups = nn.ModuleList([])
        
        # 中间层
        self.mid_block1 = Block(dim, dim)
        self.mid_block2 = Block(dim, dim)
        
        # 上采样路径
        for ind, (dim_in, dim_out) in enumerate(in_out):
            self.downs.append(nn.ModuleList([
                Block(dim_in, dim_in),
                Block(dim_in, dim_in),
                nn.Conv2d(dim_in, dim_out, 3, 2, 1)
            ]))
            
        for ind, (dim_in, dim_out) in enumerate(reversed(in_out)):
            self.ups.append(nn.ModuleList([
                Block(dim_out + dim_in, dim_out),
                Block(dim_out + dim_in, dim_out),
                nn.ConvTranspose2d(dim_out, dim_in, 4, 2, 1)
            ]))
            
        self.final_conv = nn.Conv2d(dim, 3, 1)
        
    def forward(self, x, time):
        t = self.time_mlp(time)
        
        h = []
        for block1, block2, downsample in self.downs:
            x = block1(x)
            h.append(x)
            
            x = block2(x, t)
            h.append(x)
            
            x = downsample(x)
            
        x = self.mid_block1(x)
        x = self.mid_block2(x)
        
        for block1, block2, upsample in self.ups:
            x = torch.cat((x, h.pop()), dim=1)
            x = block1(x)
            
            x = torch.cat((x, h.pop()), dim=1)
            x = block2(x, t)
            
            x = upsample(x)
            
        return self.final_conv(x)

扩散过程实现

1. 前向扩散过程

实现从x₀直接计算任意时间步xₜ的函数：

def linear_beta_schedule(timesteps):
    scale = 1000 / timesteps
    beta_start = scale * 0.0001
    beta_end = scale * 0.02
    return torch.linspace(beta_start, beta_end, timesteps)

def forward_diffusion(x0, t, sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod):
    noise = torch.randn_like(x0)
    sqrt_alpha = sqrt_alphas_cumprod[t]
    sqrt_one_minus_alpha = sqrt_one_minus_alphas_cumprod[t]
    
    # 根据重参数化技巧计算xₜ
    x_t = sqrt_alpha * x0 + sqrt_one_minus_alpha * noise
    return x_t, noise

2. 训练循环

实现扩散模型的训练过程：

def train(model, dataloader, optimizer, timesteps, device):
    # 定义噪声调度
    betas = linear_beta_schedule(timesteps)
    alphas = 1. - betas
    alphas_cumprod = torch.cumprod(alphas, axis=0)
    sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
    sqrt_one_minus_alphas_cumprod = torch.sqrt(1. - alphas_cumprod)
    
    model.train()
    for batch in dataloader:
        optimizer.zero_grad()
        
        batch = batch.to(device)
        # 随机采样时间步
        t = torch.randint(0, timesteps, (batch.size(0),), device=device).long()
        
        # 前向扩散过程
        x_noisy, noise = forward_diffusion(
            batch, t, sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod)
        
        # 预测噪声
        noise_pred = model(x_noisy, t)
        
        # 计算损失
        loss = F.mse_loss(noise_pred, noise)
        loss.backward()
        optimizer.step()

采样过程实现

训练完成后，我们可以使用模型从纯噪声生成新样本：

@torch.no_grad()
def sample(model, image_size, batch_size=16, channels=3, timesteps=1000):
    # 准备噪声调度参数
    betas = linear_beta_schedule(timesteps)
    alphas = 1. - betas
    alphas_cumprod = torch.cumprod(alphas, axis=0)
    alphas_cumprod_prev = F.pad(alphas_cumprod[:-1], (1,0), value=1.0)
    sqrt_recip_alphas = torch.sqrt(1.0 / alphas)
    sqrt_one_minus_alphas_cumprod = torch.sqrt(1. - alphas_cumprod)
    
    # 从纯噪声开始
    img = torch.randn((batch_size, channels, image_size, image_size))
    
    for t in reversed(range(0, timesteps)):
        # 准备时间步
        t_tensor = torch.full((batch_size,), t, dtype=torch.long)
        
        # 预测噪声
        pred_noise = model(img, t_tensor)
        
        # 计算xₜ₋₁
        alpha_t = alphas[t]
        alpha_t_cumprod = alphas_cumprod[t]
        alpha_t_cumprod_prev = alphas_cumprod_prev[t]
        beta_t = betas[t]
        
        # 根据预测噪声计算均值
        model_mean = sqrt_recip_alphas[t] * (
            img - beta_t / sqrt_one_minus_alphas_cumprod[t] * pred_noise
        )
        
        if t == 0:
            img = model_mean
        else:
            posterior_variance = beta_t * (1. - alpha_t_cumprod_prev) / (1. - alpha_t_cumprod)
            noise = torch.randn_like(img)
            img = model_mean + torch.sqrt(posterior_variance) * noise
            
    return img