TalkingGaussian：基于高斯溅射的结构保持3D说话人头合成-优快云博客

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting
TalkingGaussian：基于高斯溅射的结构保持3D说话人头合成

Jiahe

Abstract 摘要 TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this conflict, we decompose the model into two branches separately for the face and inside mouth areas, therefore simplifying the learning tasks to help reconstruct more accurate motion and structure of the mouth region. Extensive experiments demonstrate that our method renders high-quality lip-synchronized talking head videos, with better facial fidelity and higher efficiency compared with previous methods.
辐射场在合成逼真的3D说话头方面表现出令人印象深刻的性能。然而，由于难以拟合陡峭的外观变化，通过直接修改点外观来呈现面部运动的流行范例可能导致动态区域中的失真。为了解决这一挑战，我们引入了TalkingGaussian，一个基于变形的辐射场框架，用于高保真的说话头合成。利用基于点的高斯飞溅，面部运动可以在我们的方法中通过对持久的高斯基元应用平滑和连续的变形来表示，而不需要像以前的方法那样学习困难的外观变化。由于这种简化，可以合成精确的面部运动，同时保持高度完整的面部特征。在这样的变形模式下，我们进一步确定了一个脸嘴运动的不一致性，这将影响详细的说话动作的学习。为了解决这一冲突，我们将模型分解为两个分支，分别用于面部和嘴巴内部区域，从而简化了学习任务，以帮助重建更准确的嘴巴区域的运动和结构。大量的实验表明，我们的方法呈现高质量的嘴唇同步说话的头部视频，更好的面部保真度和更高的效率相比，以前的方法。

Keywords:

talking head synthesis 3D Gaussian Splatting
关键词：会说话的头合成3D高斯散射

Refer to caption

Figure 1:Inaccurate predictions of the rapidly changing appearance often produce distorted facial features in previous NeRF-based methods. By keeping a persistent head structure and predicting deformation to represent facial motion, our TalkingGaussian outperforms previous methods in synthesizing more precise and clear talking heads.
图1：在以前基于NeRF的方法中，对快速变化的外观的不准确预测通常会产生扭曲的面部特征。通过保持持久的头部结构和预测变形来表示面部运动，我们的TalkingGaussian在合成更精确和清晰的说话头部方面优于以前的方法。

1Introduction 1介绍

Synthesizing audio-driven talking head videos is valuable to a wide range of digital applications such as virtual reality, film-making, and human-computer interaction. Recently, radiance fields like Neural Radiance Fields (NeRF) [31] have been adopted by many methods [15, 43, 24, 52, 36, 40, 5] to improve the stability of 3D head structure while providing photo-realistic rendering, which has achieved great success in synthesizing high-fidelity talking head videos.
合成音频驱动的说话头部视频对于诸如虚拟现实、电影制作和人机交互等广泛的数字应用是有价值的。最近，许多方法[15，43，24，52，36，40，5]都采用了像神经辐射场（NeRF）[ 31]这样的辐射场，以提高3D头部结构的稳定性，同时提供照片级真实感渲染，这在合成高保真度说话头部视频方面取得了巨大成功。

Most of these NeRF-based approaches [15, 43, 24, 52, 36] synthesize different face motions by directly modifying color and density with neural networks, predicting a temporary condition-dependent appearance for each spatial point in the radiance fields whenever receiving a condition feature. This appearance-modification paradigm enables previous methods to achieve dynamic lip-audio synchronization in a fixed space representation. However, since even neighbor regions can also show significantly different colors and various structures on a human face, it’s challenging for these continuous and smooth neural fields to accurately fit the rapidly changing appearance to represent facial motions, which may lead to some heavy distortions on the facial features like a messy mouth and transparent eyelids, as shown in Fig. 1.
大多数基于NeRF的方法[15，43，24，52，36]通过直接使用神经网络修改颜色和密度来合成不同的面部运动，每当接收到条件特征时，预测辐射场中每个空间点的临时条件依赖外观。这种外观修改范例使得以前的方法能够在固定的空间表示中实现动态唇音频同步。然而，由于即使是相邻区域也可以在人脸上显示出显著不同的颜色和各种结构，因此这些连续且平滑的神经场要准确地适应快速变化的外观以表示面部运动是具有挑战性的，这可能导致面部特征上的一些严重失真，如凌乱的嘴巴和透明的眼睑，如图1所示。

In this paper, we propose TalkingGaussian, a deformation-based talking head synthesis framework, that attempts to utilize the recent 3D Gaussian Splatting (3DGS) [20] to address the facial distortion problem in existing radiance-fields-based methods. The core idea of our method is to represent complex and fine-grained facial motions with several individual smooth deformations to simplify the learning task. To achieve this goal, we first obtain a persistent head structure that keeps an unchangeable appearance and stable geometry with 3DGS. Then, motions can be precisely represented just by the deformation applied to the head structure, therefore eliminating distortions produced from inaccurately predicted appearance, and leading to better facial fidelity while synthesizing high-quality talking heads.
在本文中，我们提出了TalkingGaussian，一个基于变形的说话头部合成框架，它试图利用最近的3D高斯飞溅（3DGS）[ 20]来解决现有基于辐射场的方法中的面部失真问题。我们的方法的核心思想是用几个单独的平滑变形来表示复杂和细粒度的面部运动，以简化学习任务。为了实现这一目标，我们首先获得一个持久的头部结构，保持不变的外观和稳定的几何形状与3DGS。然后，可以仅通过应用于头部结构的变形来精确地表示运动，因此消除了由不准确预测的外观产生的失真，并且在合成高质量的说话头部的同时导致更好的面部保真度。

Specifically, we represent the dynamic talking head with a 3DGS-based Deformable Gaussian Field, consisting of a static Persistent Gaussian Field and a neural Grid-based Motion Field to decouple the persistent head structure and dynamic facial motions. Unlike previous continuous neural-based backbones [31, 32, 24], 3DGS provides an explicit space representation by a definite set of Gaussian primitives, enabling us to obtain a more stable head structure and accurate control of spatial points. Based on this, we apply a point-wise deformation, which changes the position and shape of each primitive while persisting its color and opacity, to represent facial motions via the motion fields. Then the deformed primitives are input into the 3DGS rasterizer to render the target images. To facilitate the smooth learning for a target facial motion, we introduce an incremental sampling strategy that utilizes face action priors to schedule the optimization process of deformation.
具体来说，我们表示动态说话的头部与基于3DGS的可变形高斯场，由一个静态的持久高斯场和一个神经网格为基础的运动场解耦的持久头部结构和动态面部运动。与之前的连续神经骨干不同[31，32，24]，3DGS通过一组确定的高斯基元提供了显式的空间表示，使我们能够获得更稳定的头部结构和对空间点的精确控制。在此基础上，我们应用逐点变形，改变每个图元的位置和形状，同时保持其颜色和不透明度，通过运动场来表示面部运动。然后将变形后的图元输入到3DGS光栅化器中绘制目标图像。为了促进目标面部运动的平滑学习，我们引入了一种增量采样策略，该策略利用面部动作先验来调度变形的优化过程。

In the Deformable Gaussian Fields, we further decompose the entire head as a face branch and an inside mouth branch to solve the motion inconsistency between these two regions, which hugely improves the synthesis quality both in static structure and dynamic performance. Since the motions of the face and inside mouth are not related totally in tight and may be much different sometimes, it is hard to accurately represent these delicate but conflicted motions with just one single motion field. To simplify the learning of both these two distinct motions, we divide these two regions in 2D input images with a semantic mask, and build two model branches to represent them individually. As the motion in each branch has been simplified to become smooth, our method can achieve better visual-audio synchronization and reconstruc