Diffusion heads

最新推荐文章于 2025-12-04 16:10:29 发布

原创最新推荐文章于 2025-12-04 16:10:29 发布 · 204 阅读

CC 4.0 BY-SA版权

文章标签：

贡献

方法

diffusion model采用improved diffusion model
输入为Motion frames（t-2,t-1时刻图像），Identity frame随机选择的一帧，Noisy target当前时刻加噪图像
Audio encoder选用的是梅尔普图特征+1层卷积（pretrain), audio condition类似 label embedding采用adaptive IN方式
Lip sync loss，单独对嘴部区域约束

实验

数据集
- CREMA
- LRW
实现细节
- 128*128分辨率
- UNet 256-512-768通道，只在middle block使用attention, 4head和64head channels
主观效果

量化结果（
- Moreover, as explained in [20], PSNR favors blurry images and is not a perfect metric in our task,although used commonly.
- we do not provide anything but a single frame and audio, allowing the model to generate anything itwants. For that reason, our synthesized videos are not consistent with the reference ones and get worse measures inthe standard metrics