U-net 原理部分之前的博客有些了,这里主要记录一下代码实现
U-net往期博客:https://blog.youkuaiyun.com/qq_19841133/article/details/126927383
基于Attention-based(用的是自注意力机制)的U-net
代码来源IDDPM项目:
https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/unet.py
文章目录
IDDPM的NN模型用的是attention-based Unet
Unet很熟悉了,除了有两部分编码器和解码器(input和output),还有mid block中间模块,如有ResBlock,MHSA Block
input block组成:Res(接收输入x和emb timestep表示成emb,condition表示成emb),MHSA(像素对像素的注意力机制),Downsample
mid block:Res,MHSA, Res
output block:Res(与input block对应层的输出进行拼接),MHSA,Upsample
U-net
第一个模块,time_emb,对输入进来的time_step进行变换,
time_embed_dim = model_channels * 4
self.time_embed = nn.Sequential(
linear(model_channels, time_embed_dim),
SiLU(),
linear(time_embed_dim, time_embed_dim),
)
如果是条件式生成那么还有一个label_emb,作为条件的embeding和x一起输入
if self.num_classes is not None:
self.label_emb = nn.Embedding(num_classes, time_embed_dim)
input_block,U_net的左边的那部分,把所有的Module放入ModuleList,ModuleList实例化参数一般是List,List每个元素是一个Module,第一个Module是TimestepEmbedSequential,此时List只有一个Module,后面会慢慢append进来,TimestepEmbedSequential里面是conv_nd,dims=1就是1d,dim=2就是2d,默认kernelsize是3,padding=1 (TimestepEmbedSequential见下面,其实就一个nn.Sequential的封装,其起到的作用就是选择是否传入emb,像conv_nd不用,只有ResBlock会传)
self.input_blocks = nn.ModuleList(
[
TimestepEmbedSequential(
conv_nd(dims, in_channels, model_channels, 3, padding=1)
)
]
)
接下来对U-net左边进行搭建,首先是对channel_mult(通道乘子)进行遍历,就是乘以几倍几倍,通道乘子其实就是定义了有几层的U-net,一般是逐层扩大,输出通道数是mult * model_channels,乘子和当前model通道数的乘积,通道数正在扩大
每个乘子中有很多的Resblock,对Resblock进行遍历
ResBlock继承自TimestepEmbedSequential,那么就需要传入一个emb,这里传了time-embed-dim
for level, mult in enumerate(channel_mult):
for _ in range(num_res_blocks):
layers = [
ResBlock(
ch,
time_embed_dim,
dropout,
out_channels=mult * model_channels,
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
)
]
我们还在num_res_blocks的遍历中,下面是创建attention-block,ds就是我们下采样的比例,如果ds在attention_resolutions的列表中,我们就插入一个AttentionBlock,所以具体在哪些地方插入取决与attention_resolutions
ch = mult * model_channels
if ds in attention_resolutions:
layers.append(
AttentionBlock(
ch, use_checkpoint=use_checkpoint, num_heads=num_heads
)
)
self.input_blocks.append(TimestepEmbedSequential(*layers))
input_block_chans.append(ch)
最后遍历完num_res_blocks,我们再跟一个下采样层
if level != len(channel_mult) - 1:
self.input_blocks.append(
TimestepEmbedSequential(Downsample(ch, conv_resample, dims=dims))
)
input_block_chans.append(ch)
ds *= 2
U-net中间部分就是两个ResBlock和一个AttentionBlock构成
通道数和空间数都没有改变
self.middle_block = TimestepEmbedSequential(
ResBlock(
ch,
time_embed_dim,
dropout,
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
),
AttentionBlock(ch, use_checkpoint=use_checkpoint, num_heads=num_heads),
ResBlock(
ch,
time_embed_dim,
dropout,
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
),
)
U-net右边,和左边镜像,由ResBlock和Attentionblock,和UpSample构成
注意的是ResBlock输入的通道数是ch + input_block_chans.pop(),因为U-net左右两边是连起来的,所以通道数应该是两者之和
self.output_blocks = nn.ModuleList([])
for level, mult in list(enumerate(channel_mult))[::-1]:
for i in range(num_res_blocks + 1):
layers = [
ResBlock(
ch + input_block_chans.pop(),
time_embed_dim,
dropout,
out_channels=model_channels * mult,
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
)
]
ch = model_channels * mult
if ds in attention_resolutions:
layers.append(
AttentionBlock(
ch,
use_checkpoint=use_checkpoint,
num_heads=num_heads_upsample,
)
)
if level and i == num_res_blocks:
layers.append(Upsample(ch, conv_resample, dims=dims))
ds //= 2
self.output_blocks.append(TimestepEmbedSequential(*layers))
最后是输出模块,最后一个变换,得到卷积的输出
self.out = nn.Sequential(
normalization(ch),
SiLU(),
zero_module(conv_nd(dims, model_channels, out_channels, 3, padding=1)),
)
forward函数,输入x,timesteps,y,y是条件生成
timesteps经过timeset-embedding得到emb表示,这里用的正余弦timeEmbbing,总之,能对不同的timestep实现差异化表示即可
y还得表示成条件的emb
然后对input block的遍历,这个模块的输入是上个模块的输出,中间是middle-block,最后是output-block,之所以有个hs列表,是因为我们得保存input-block的输出,给out-block使用
def forward(self, x, timesteps, y=None):
"""
Apply the model to an input batch.
:param x: an [N x C x ...] Tensor of inputs.
:param timesteps: a 1-D batch of timesteps.
:param y: an [N] Tensor of labels, if class-conditional.
:return: an [N x C x ...] Tensor of outputs.
"""
assert (y is not None) == (
self.num_classes is not None
), "must specify y if and only if the model is class-conditional"
hs = []
emb = self.time_embed(timestep_embedding(timesteps, self.model_channels))
if self.num_classes is not None:
assert y.shape == (x.shape[0],)
emb = emb + self.label_emb(y)
h = x.type(self.inner_dtype)
for module in self.input_blocks:
h = module(h, emb)
hs.append(h)
h = self.middle_block(h, emb)
for module in self.output_blocks:
cat_in = th.cat([h, hs.pop()], dim=1)
h = module(cat_in, emb)
h = h.type(x.dtype)
return self.out(h)
conv_nd
只是对nn.Conv函数的一个封装
def conv_nd(dims, *args, **kwargs):
"""
Create a 1D, 2D, or 3D convolution module.
"""
if dims == 1:
return nn.Conv1d(*args, **kwargs)
elif dims == 2:
return nn.Conv2d(*args, **kwargs)
elif dims == 3:
return nn.Conv3d(*args, **kwargs)
raise ValueError(f"unsupported dimensions: {dims}")
TimestepEmbedSequential emb传入层
TimestepEmbedSequential作用就是对TimestepBlock的子类输入x之外,多传入一个emb
class TimestepEmbedSequential(nn.Sequential, TimestepBlock):
"""
A sequential module that passes timestep embeddings to the children that
support it as an extra input.
"""
def forward(self, x, emb):
for layer in self:
if isinstance(layer, TimestepBlock):
x = layer(x, emb)
else:
x = layer(x)
return x
那么有哪些TimestepBlock的子类呢,只有一个ResBlock类是继承了TimestepBlock,也就是说只有ResBlock才用传入emb,像在上采样和下采样就不用emb了
Downsample 下采样层
下采样层,直接调用了self.op,self.op有卷积的下采样,和直接平均池化的下采样,2d图像中stride=2(3d的stride=(1,2,2)),stride=2作用是对图像空间h,w=1/2h, 1/2w,长和宽减少一半
class Downsample(nn.Module):
"""
A downsampling layer with an optional convolution.
:param channels: channels in the inputs and outputs.
:param use_conv: a bool determining if a convolution is applied.
:param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
downsampling occurs in the inner-two dimensions.
"""
def __init__(self, channels, use_conv, dims=2):
super().__init__()
self.channels = channels
self.use_conv = use_conv
self.dims = dims
stride = 2 if dims != 3 else (1, 2, 2)
if use_conv:
self.op = conv_nd(dims, channels, channels, 3, stride=stride, padding=1)
else:
self.op = avg_pool_nd(stride)
def forward(self, x):
assert x.shape[1] == self.channels
return self.op(x)
Upsample 上采样层
用临近插值interpolate扩大自己空间h,w两倍,如果要卷积,再做一个通道数不变的上卷积
class Upsample(nn.Module):
"""
An upsampling layer with an optional convolution.
:param channels: channels in the inputs and outputs.
:param use_conv: a bool determining if a convolution is applied.
:param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
upsampling occurs in the inner-two dimensions.
"""
def __init__(self, channels, use_conv, dims=2):
super().__init__()
self.channels = channels
self.use_conv = use_conv
self.dims = dims
if use_conv:
self.conv = conv_nd(dims, channels, channels, 3, padding=1)
def forward(self, x):
assert x.shape[1] == self.channels
if self.dims == 3:
x = F.interpolate(
x, (x.shape[2], x.shape[3] * 2, x.shape[4] * 2), mode="nearest"
)
else:
x = F.interpolate(x, scale_factor=2, mode="nearest")
if self.use_conv:
x = self.conv(x)
return x
AttentionBlock 注意力机制层
直接看_forward,首先x变成3维的[batch-size,channel,-1],将x归一化norm,再送入qkv,得到qkv三个量
将qkv reshape,变成 batch-size×num_head,-1(序列长度),qkv.shape[2](特征维度)
将qkv送如QKVAttention类,得到h,h是经过注意力之后的结果,将h reshape,再经过投影层,加回x,所以这是一个带残差的attention注意力机制
class AttentionBlock(nn.Module):
"""
An attention block that allows spatial positions to attend to each other.
Originally ported from here, but adapted to the N-d case.
https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66.
"""
def __init__(self, channels, num_heads=1, use_checkpoint=False):
super().__init__()
self.channels = channels
self.num_heads = num_heads
self.use_checkpoint = use_checkpoint
self.norm = normalization(channels)
self.qkv = conv_nd(1, channels, channels * 3, 1)
self.attention = QKVAttention()
self.proj_out = zero_module(conv_nd(1, channels, channels, 1))
def forward(self, x):
return checkpoint(self._forward, (x,), self.parameters(), self.use_checkpoint)
def _forward(self, x):
b, c, *spatial = x.shape
x = x.reshape(b, c, -1)
qkv = self.qkv(self.norm(x))
qkv = qkv.reshape(b * self.num_heads, -1, qkv.shape[2])
h = self.attention(qkv)
h = h.reshape(b, -1, h.shape[-1])
h = self.proj_out(h)
return (x + h).reshape(b, c, *spatial)
QKVAttention
这个就是标准的attention计算
class QKVAttention(nn.Module):
"""
A module which performs QKV attention.
"""
def forward(self, qkv):
"""
Apply QKV attention.
:param qkv: an [N x (C * 3) x T] tensor of Qs, Ks, and Vs.
:return: an [N x C x T] tensor after attention.
"""
ch = qkv.shape[1] // 3
q, k, v = th.split(qkv, ch, dim=1)
scale = 1 / math.sqrt(math.sqrt(ch))
weight = th.einsum(
"bct,bcs->bts", q * scale, k * scale
) # More stable with f16 than dividing afterwards
weight = th.softmax(weight.float(), dim=-1).type(weight.dtype)
return th.einsum("bts,bcs->bct", weight, v)
@staticmethod
def count_flops(model, _x, y):
"""
A counter for the `thop` package to count the operations in an
attention operation.
Meant to be used like:
macs, params = thop.profile(
model,
inputs=(inputs, timestamps),
custom_ops={QKVAttention: QKVAttention.count_flops},
)
"""
b, c, *spatial = y[0].shape
num_spatial = int(np.prod(spatial))
# We perform two matmuls with the same number of ops.
# The first computes the weight matrix, the second computes
# the combination of the value vectors.
matmul_ops = 2 * b * (num_spatial ** 2) * c
model.total_ops += th.DoubleTensor([matmul_ops])
ResBlock
有很多层,有in_layer层,emb-layer层,out-layers层,还有skip-connection层,如果通道数一致则直接连接起来就好,如果通道数目不一致,可以用一个大小不变的卷积或者1×1的卷积改变一下dim
self.in_layers = nn.Sequential(
normalization(channels),
SiLU(),
conv_nd(dims, channels, self.out_channels, 3, padding=1),
)
self.emb_layers = nn.Sequential(
SiLU(),
linear(
emb_channels,
2 * self.out_channels if use_scale_shift_norm else self.out_channels,
),
)
self.out_layers = nn.Sequential(
normalization(self.out_channels),
SiLU(),
nn.Dropout(p=dropout),
zero_module(
conv_nd(dims, self.out_channels, self.out_channels, 3, padding=1)
),
)
if self.out_channels == channels:
self.skip_connection = nn.Identity()
elif use_conv:
self.skip_connection = conv_nd(
dims, channels, self.out_channels, 3, padding=1
)
else:
self.skip_connection = conv_nd(dims, channels, self.out_channels, 1)
forward函数,传入x和emb,x经过in_layers得到h,emb经过emb_layers得到emb-out,h + emb_out输入out_layers得到h,x再和h相加,所以大致就是x和h的一个残差连接
def _forward(self, x, emb):
h = self.in_layers(x)
emb_out = self.emb_layers(emb).type(h.dtype)
while len(emb_out.shape) < len(h.shape):
emb_out = emb_out[..., None]
if self.use_scale_shift_norm:
out_norm, out_rest = self.out_layers[0], self.out_layers[1:]
scale, shift = th.chunk(emb_out, 2, dim=1)
h = out_norm(h) * (1 + scale) + shift
h = out_rest(h)
else:
h = h + emb_out
h = self.out_layers(h)
return self.skip_connection(x) + h
写在后面
其实实现的思路很简单,只是要把它写成模块就稍显的有些复杂了,这就是我们可以多多学习的地方,有时间仿照着这个写一下…