FaceFormer：语音驱动的3D面部动画与变形金刚

c2a2o2

于 2024-12-26 10:24:45 发布

阅读量394

点赞数 1

CC 4.0 BY-SA版权

文章标签： 3d

原文链接：https://ar5iv.labs.arxiv.org/html/2112.05329?_immersive_translate_auto_translate=1

Abstract 摘要

Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. Prior works typically focus on learning phoneme-level features of short audio windows with limited context, occasionally resulting in inaccurate lip movements. To tackle this limitation, we propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes. To cope with the data scarcity issue, we integrate the self-supervised pre-trained speech representations. Also, we devise two biased attention mechanisms well suited to this specific task, including the biased cross-modal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy. The former effectively aligns the audio-motion modalities, whereas the latter offers abilities to generalize to longer audio sequences. Extensive experiments and a perceptual user study show that our approach outperforms the existing state-of-the-arts. We encourage watching the video11The supplementary video and code are available at: FaceFormer
补充视频和代码可在https://evelynfan.github.io/audio2face/上获得
由于人脸的复杂几何形状和3D视听数据的有限性，语音驱动的3D人脸动画具有挑战性。先前的工作通常集中在学习音素级功能的短音频窗口有限的上下文，偶尔会导致不准确的嘴唇运动。为了解决这一限制，我们提出了一个基于变换器的自回归模型，FaceFormer，它编码的长期音频上下文和自回归预测动画3D人脸网格序列。为了科普数据不足的问题，我们整合了自我监督的预训练语音表征。此外，我们设计了两个偏置注意机制非常适合这个特定的任务，包括偏置跨模态多头（MH）注意和偏置因果MH自我注意与周期性位置编码策略。前者有效地对齐音频-运动模态，而后者提供了推广到较长音频序列的能力。大量的实验和感知用户研究表明，我们的方法优于现有的国家的最先进的。我们鼓励观看视频 1.

∗ Corresponding author ∗ 通讯作者† Work done at HKUST
† 在HKUST工作

1Introduction 1介绍

Speech-driven 3D facial animation has become an increasingly attractive research area in both academia and industry. It is potentially beneficial to a broad range of applications such as virtual reality, film production, games and education. Realistic speech-driven 3D facial animation aims to automatically animate vivid facial expressions of the 3D avatar from an arbitrary speech signal.
语音驱动的3D人脸动画已经成为学术界和工业界越来越有吸引力的研究领域。它对虚拟现实、电影制作、游戏和教育等广泛的应用都有潜在的好处。真实感语音驱动的3D面部动画旨在根据任意语音信号自动地动画化3D化身的生动面部表情。

Refer to caption

Figure 1: Concept diagram of FaceFormer. Given the raw audio input and a neutral 3D face mesh, our proposed end-to-end Transformer-based architecture, dubbed FaceFormer, can autoregressively synthesize a sequence of realistic 3D facial motions with accurate lip movements.
图1：FaceFormer的概念图。鉴于原始音频输入和中性3D面部网格，我们提出的基于Transformer的端到端架构（称为FaceFormer）可以自回归合成一系列具有准确嘴唇运动的真实3D面部运动。

We focus on animating the 3D geometry rather than the 2D pixel values, e.g. photorealistic talking-head animation [15, 52, 63, 12, 69, 42, 67]. The majority of existing works aim to produce 2D videos of talking heads, given the availability of massive 2D video datasets. However, the generated 2D videos are not directly applicable to applications like 3D games and VR, which need to animate 3D models in a 3D environment. Several methods [47, 27, 60] harness 2D monocular videos to obtain 3D facial parameters, which might lead to unreliable results. This is because the quality of the synthetic 3D data is bounded by the accuracy of 3D reconstruction techniques, which cannot capture the subtle changes in 3D. In speech-driven 3D facial animation, most 3D mesh-based works [17, 8, 39] formulate the input as short audio windows, which might result in ambiguities in variations of facial expressions. As pointed out by Karras et al. [31], a longer-term audio context is required for realistically animating the whole face. While MeshTalk [51] has considered a longer audio context by modeling the audio sequence, training the model with Mel spectral audio features fails to synthesize accurate lip motions in data-scarce settings. Collecting 3D motion capture data is also considerably expensive and time-consuming.
我们专注于3D几何图形的动画制作，而不是2D像素值，例如照片级逼真的说话头动画[15，52，63，12，69，42，67]。现有的大多数工作都是在大量二维视频数据的基础上，制作出说话头部的二维视频。然而，所生成的2D视频并不直接适用于3D游戏和VR等应用，这些应用需要在3D环境中对3D模型进行动画处理。几种方法[47，27，60]利用2D单眼视频来获得3D面部参数，这可能导致不可靠的结果。这是因为合成3D数据的质量受到3D重建技术的准确性的限制，3D重建技术无法捕捉3D中的细微变化。在语音驱动的3D面部动画中，大多数基于3D网格的作品[17，8，39]将输入制定为短音频窗口，这可能导致面部表情变化的模糊性。正如Karras et al.[ 31]，需要更长时间的音频上下文来逼真地动画整个面部。虽然MeshTalk [ 51]通过对音频序列进行建模来考虑更长的音频上下文，但使用Mel频谱音频特征训练模型无法在数据稀缺的环境中合成准确的嘴唇运动。收集3D运动捕捉数据也相当昂贵和耗时。

To address the issues about long-term context and lack of 3D audio-visual data, we propose a transformer-based autoregressive model (Fig. 1) which (1) captures longer-term audio context to enable highly realistic animation of the entire face, i.e. both upper and lower face expressions, (2) effectively utilizes the self-supervised pre-trained speech representations to handle the data scarcity issue, and (3) considers the history of face motions for producing temporally stable facial animation.
为了解决长期背景和缺乏3D视听数据的问题，我们提出了一种基于变换的自回归模型（图1）其（1）捕获长期音频上下文以实现整个面部（即，上部和下部面部表情）的高度逼真的动画，（2）有效地利用自监督的预训练语音表示来处理数据稀缺问题，以及（3）考虑面部运动的历史以产生时间稳定的面部动画。

Transformer [58] has achieved remarkable performance in both natural language processing [20, 58] and computer vision [13, 44, 21] tasks. The sequential models like LSTM have a bottleneck that hinders the ability to learn longer-term context effectively [45]. Compared to RNN-based models, transformer can better capture long-range context dependencies based solely on attention mechanisms [58]. Recently, transformer has also made the encouraging progress in body motion synthesis [1, 46, 4] and dance generation [36, 37, 57]. The success of transformer is mainly attributed to its design incorporating the self-attention mechanism, which is effective in modeling both the short- and long-range relations by explicitly attending to all parts of the representation. Speech-driven 3D facial animation has not been explored in this direction.
Transformer [ 58]在自然语言处理[20，58]和计算机视觉[13，44，21]任务方面都取得了卓越的性能。像LSTM这样的顺序模型有一个瓶颈，阻碍了有效学习长期上下文的能力[ 45]。与基于RNN的模型相比，Transformer可以更好地捕获仅基于注意力机制的远程上下文依赖关系[ 58]。最近，Transformer在身体运动合成[1，46，4]和舞蹈生成[36，37，57]方面也取得了令人鼓舞的进展。Transformer的成功主要归功于它的设计中引入了自注意机制，通过明确地关注表示的所有部分，该机制在建模短期和长期关系方面都是有效的。语音驱动的3D人脸动画还没有在这个方向上探索。

Direct application of a vanilla transformer architecture to audio sequences does not perform well on the task of speech-driven 3D facial animation, and we thus need to address these issues. First, transformer is data-hungry in nature, requiring sufficiently large datasets for training [32]. Given the limited availability of 3D audio-visual data, we explore the use of the self-supervised pre-trained speech model wav2vec 2.0 [2]. Wav2vec 2.0 has learned rich phoneme information, since it has been trained on a large-scale corpus [43] of unlabeled speech. While the limited 3D audio-visual data might not cover enough phonemes, we expect the pre-trained speech representations can benefit the speech-driven 3D facial animation task in data-scarce settings. Second, the default encoder-decoder attention of transformer can not handle modality alignment, and thus we add an alignment bias for audio-motion alignment. Third, we argue that modeling the correlation between speech and face motions needs to consider long-term audio context dependencies [31]. Accordingly, we do not restrict the attention scope of the encoder self-attention, thus maintaining its ability to capture long-range audio context dependencies. Fourth, transformer with the sinusoidal position encoding has weak abilities to generalize to sequence lengths longer than the ones seen during training [50, 19]. Inspired by Attention with Linear Biases (ALiBi) [50], we add a temporal bias to the query-key attention score and design a periodic positional encoding strategy to improve the model’s generalization ability to longer audio sequences.
直接应用香草Transformer架构的音频序列不执行语音驱动的3D面部动画的任务，因此，我们需要解决这些问题。首先，Transformer本质上是数据饥渴的，需要足够大的数据集进行训练[ 32]。鉴于3D视听数据的可用性有限，我们探索了自监督预训练语音模型wav 2 vec 2.0的使用[ 2]。Wav 2 vec 2.0已经学习了丰富的音素信息，因为它已经在大规模的未标记语音语料库[ 43]上进行了训练。虽然有限的3D视听数据可能无法覆盖足够的音素，但我们预计预训练的语音表示可以在数据稀缺的情况下使语音驱动的3D面部动画任务受益。其次，Transformer的默认编码器-解码器注意力不能处理模态对齐，因此我们为音频运动对齐添加对齐偏置。第三，我们认为对语音和面部运动之间的相关性进行建模需要考虑长期的音频上下文依赖性[ 31]。因此，我们不限制编码器自注意的注意范围，从而保持其捕获长距离音频上下文依赖性的能力。第四，具有正弦位置编码的Transformer推广到比训练期间看到的更长的序列长度的能力较弱[ 50，19]。受线性偏差注意力（ALiBi）[ 50]的启发，我们将时间偏差添加到查询键注意力分数中，并设计了周期性位置编码策略，以提高模型对较长音频序列的泛化能力。

The main contributions of our work are as follows:
我们工作的主要贡献如下：

•

An autoregressive transformer-based architecture for speech-driven 3D facial animation. FaceFormer encodes the long-term audio context and the history of face motions to autoregressively predict a sequence of animated 3D face meshes. It achieves highly realistic and temporally stable animation of the whole face including both the upper face and the lower face.
基于自回归变换器的语音驱动3D人脸动画架构。FaceFormer对长期音频上下文和面部运动历史进行编码，以自回归预测动画3D面部网格序列。它实现了包括上面部和下面部的整个面部的高度真实和时间稳定的动画。
•

The biased attention modules and a periodic position encoding strategy. We carefully design the biased cross-modal MH attention to align the different modalities, and the biased causal MH self-attention with a periodic position encoding strategy to improve the generalization to longer audio sequences.
有偏注意模块与周期位置编码策略。我们仔细设计了有偏的跨模态MH注意力，以对齐不同的模态，和有偏的因果MH自注意力与周期性位置编码策略，以提高泛化到更长的音频序列。
•

Effective utilization of the self-supervised pre-trained speech model. Incorporating the self-supervised pre-trained speech model in our end-to-end architecture can not only handle the data limitation problem, but also notably improve the accuracy of mouth movements for the difficult cases, e.g., the lips are fully closed on /b/,/m/,/p/ phonemes.
有效利用自监督预训练语音模型。在我们的端到端架构中使用自监督预训练语音模型不仅可以处理数据限制问题，而且还可以显着提高困难情况下的嘴部运动的准确性，例如，嘴唇在/B/、/m/、/p/音素上完全闭合。
•

Extensive experiments and the user study to assess the quality of synthesized face motions. The results demonstrate the superiority of FaceFormer over existing state-of-the-art methods in terms of realistic facial animation and lip sync on two 3D datasets [24, 17].
广泛的实验和用户研究，以评估合成的面部运动的质量。结果证明了FaceFormer在两个3D数据集上的逼真面部动画和唇同步方面优于现有的最先进的方法[ 24，17]。

2Related Work 2相关工作

2.1Speech-Driven 3D Facial Animation
2.1语音驱动的3D面部动画

Facial animation [62, 35, 5, 72, 33, 25, 55, 34] has attracted considerable attention over the years. While aware of extensive 2D-based approaches [23, 16, 11, 49, 59, 18, 66, 10, 29, 70], we focus on animating a 3D model in this work. Typically, the procedural methods [41, 54, 65, 22] establish a set of explicit rules for animating the talking mouth. For example, the dominance functions [41] are used to characterize the speech control parameters. The dynamic viseme model proposed by Taylor et al. [54] exploits the one-to-many mapping of phonemes to lip motions. Xu et al. [65] construct a canonical set for modeling coarticulation effects. The state-of-the-art procedural approach JALI [22] utilizes two anatomical actions to animate a 3D facial rig.
面部动画[ 62，35，5，72，33，25，55，34]多年来吸引了相当大的关注。虽然知道广泛的基于2D的方法[ 23，16，11，49，59，18，66，10，29，70]，我们专注于在这项工作中动画3D模型。通常，程序方法[ 41，54，65，22]建立了一套明确的规则，用于动画说话的嘴。例如，优势函数[ 41]用于表征语音控制参数。泰勒等人提出的动态视位模型。[ 54]利用音素到嘴唇运动的一对多映射。Xu等人。[ 65]构建了一个用于建模协同发音效应的规范集。最先进的程序方法JALI [ 22]利用两个解剖动作来制作3D面部装备的动画。

One appealing strength of the above procedural methods is the explicit control of the system to ensure the accuracy of the mouth movements. However, they require a lot of manual effort in parameter tuning. Alternatively, a wide variety of data-driven approaches [6, 40, 31, 53, 48, 17, 28, 51] has been proposed to produce 3D facial animation. Cao et al. [6] synthesize 3D facial animation based on the proposed Anime Graph structure and a search-based technique. The sliding window approach [53] requires the transcribed phoneme sequences as input and can re-target the output to other animation rigs. An end-to-end convolutional network elaborated by Karras et al. [31] leverages the linear predictive coding method to encode audio and designs a latent code to disambiguate the variations in facial expression. Zhou et al. [71] employ a three-stage network that combines phoneme groups, landmarks and audio features to predict viseme animation curves. VOCA [17] is a speaker-independent 3D facial animation method that captures a variety of speaking styles, yet the generated face motions are mostly present in the lower face. Recently, MeshTalk [51] learns a categorical latent space that successfully disentangles audio-correlated and audio-uncorrelated face motions.
上述程序方法的一个吸引人的优点是系统的显式控制，以确保嘴部运动的准确性。然而，它们在参数调整方面需要大量的手动工作。或者，已经提出了各种各样的数据驱动方法[6，40，31，53，48，17，28，51]来产生3D面部动画。Cao等人[ 6]基于提出的动画图结构和基于搜索的技术合成3D面部动画。滑动窗口方法[ 53]需要转录的音素序列作为输入，并且可以将输出重新定位到其他动画设备。Karras等人详细阐述的端到端卷积网络。[ 31]利用线性预测编码方法对音频进行编码，并设计潜在代码来消除面部表情的变化。Zhou等人[ 71]采用结合音素组、地标和音频特征的三阶段网络来预测视位动画曲线。 VOCA [ 17]是一种与说话者无关的3D面部动画方法，可以捕捉各种说话风格，但生成的面部运动主要存在于下面部。最近，MeshTalk [ 51]学习了一个分类潜在空间，成功地解开了音频相关和音频不相关的面部运动。

Most related to our work are methods [31, 17, 51] whereby the high-resolution 3D data are used for training and the output is represented as the high-dimensional vector in 3D vertex space. The former two models [31, 17] are trained using short audio windows, thus ignoring the long-term audio context. Despite the highly realistic facial animation achieved the latter method [51], it requires large amounts of high-fidelity 3D facial data to ensure the animation quality and the generalization to unseen identities.
与我们的工作最相关的是方法[31，17，51]，其中高分辨率3D数据用于训练，输出表示为3D顶点空间中的高维向量。前两个模型[31，17]使用短音频窗口进行训练，因此忽略了长期音频上下文。尽管后一种方法实现了高度逼真的面部动画[ 51]，但它需要大量的高保真3D面部数据来确保动画质量和对不可见身份的泛化。

Refer to caption

Figure 2: Overview of FaceFormer. An encoder-decoder model with Transformer architecture takes raw audio as input and autoregressively generates a sequence of animated 3D face meshes. Layer normalizations and residual connections are omitted for simplicity. The overall design of the FaceFormer encoder follows wav2vec 2.0 [2]. In addition, a linear interpolation layer is added after TCN for resampling the audio features. We initialize the encoder with the corresponding pre-trained wav2vec 2.0 weights. The FaceFormer decoder consists of two main modules: a biased causal MH self-attention with a periodic positional encoding for generalizing to longer input sequences, and a biased cross-modal multi-head (MH) attention for aligning audio-motion modalities. During training, the parameters of TCN are fixed, whereas the other parts of the model are learnable.
图2：FaceFormer概述。具有Transformer架构的编码器-解码器模型将原始音频作为输入，并自回归地生成动画3D人脸网格序列。为了简单起见，省略了层规范化和残余连接。FaceFormer编码器的整体设计遵循wav 2 vec 2.0 [ 2]。此外，在TCN之后增加了线性插值层，用于恢复音频特征。我们使用相应的预训练wav 2 vec 2.0权重初始化编码器。FaceFormer解码器由两个主要模块组成：具有周期性位置编码的偏置因果MH自关注，用于推广到较长的输入序列，以及用于对齐音频运动模态的偏置交叉模态多头（MH）关注。在训练过程中，TCN的参数是固定的，而模型的其他部分是可学习的。

2.2Transformers in Vision and Graphics
2.2视觉和图形中的变形金刚

Transformer [58] has emerged as a strong alternative to both RNN and CNN. In contrast to RNNs that process sequence tokens recursively, transformers can attend to all tokens in the input sequence parallelly, thereby modeling the long-range contextual information effectively. Vision Transformer (ViT) [21] is the first work that explores the direct application of transformers to the task of image classification. Following ViT, some follow-up works [56, 14, 9] have been introduced to boost performance for image recognition problems. Besides, transformer-based models and the variants have also been proposed in object detection [7], semantic segmentation [64], image generation [30], etc. In computer graphics, transformers have been exploited for 3D point cloud representations and 3D mesh, such as Point Transformer [68], Point Cloud Transformer [26] and Mesh Transformer [38]. We refer readers to the comprehensive survey [32] for further information.
Transformer [ 58]已成为RNN和CNN的强大替代方案。与递归神经网络递归处理序列标记不同，变换器可以并行处理输入序列中的所有标记，从而有效地对长范围上下文信息进行建模. Vision Transformer（ViT）[ 21]是探索将transformer直接应用于图像分类任务的第一项工作。在ViT之后，已经引入了一些后续工作[56，14，9]来提高图像识别问题的性能。此外，在对象检测[ 7]、语义分割[ 64]、图像生成[ 30]等中也提出了基于变换的模型和变体。在计算机图形学中，变换器已被用于3D点云表示和3D网格，例如点Transformer [ 68]、点云Transformer [ 26]和网格Transformer [ 38]。我们建议读者参考综合调查[ 32]以获取更多信息。

Some of the most recent works on 3D body motion synthesis [1, 46, 4] and 3D dance generation [36, 37, 57] have explored the power of transformer in modeling sequential data and produced impressive results. Different from dance generation where the output motion is highly unconstrained, the task of speech-driven 3D facial animation inherently requires the alignment between audio and face motions to ensure the accuracy of lip motions. Meanwhile, the long-term audio context is expected to be considered, which is important for animating the whole face [31]. Consequently, we present FaceFormer that incorporates the desirable properties for the speech-driven 3D facial animation problem.
最近的一些关于3D身体运动合成[1，46，4]和3D舞蹈生成[36，37，57]的工作已经探索了Transformer在建模序列数据中的能力，并产生了令人印象深刻的结果。与舞蹈生成中输出运动高度不受约束的情况不同，语音驱动的3D面部动画的任务本质上需要在音频和面部运动之间进行对齐以确保嘴唇运动的准确性。同时，预计将考虑长期音频上下文，这对于动画整个面部很重要[ 31]。因此，我们提出了一个FaceFormer，它结合了语音驱动的3D人脸动画问题的理想属性。

3Our Approach: FaceFormer
3我们的方法：FaceFormer

We formulate speech-driven 3D facial animation as a sequence-to-sequence (seq2seq) learning problem and propose a novel seq2seq architecture (Fig. 2) to autoregressively predict facial movements conditioned on both audio context and past facial movement sequence. Suppose that there is a sequence of ground-truth 3D face movements 𝐘𝐓=(𝐲𝟏,…,𝐲𝐓), where 𝐓 is the number of visual frames, and the corresponding raw audio 𝒳. The goal here is to produce a model that can synthesize facial movements 𝐘^𝐓 that is similar to 𝐘𝐓 given the raw audio 𝒳. In the encoder-decoder framework (Fig. 2), the encoder first transforms 𝒳 into speech representations 𝐀𝐓′=(𝐚𝟏,…,𝐚𝐓′), where 𝐓′ is the frame length of speech representations. The style embedding layer contains a set of learnable embeddings that represents speaker identities 𝐒=(𝐬𝟏,…,𝐬𝐍). Then, the decoder autoregressively predicts facial movements 𝐘^𝐓=(𝐲^𝟏,…,𝐲^𝐓) conditioned on 𝐀𝐓′, the style embedding 𝐬𝐧 of speaker 𝐧, and the past facial movements. Formally,
我们将语音驱动的3D面部动画表示为序列到序列（seq 2seq）学习问题，并提出了一种新的seq 2seq架构（图2），以自回归预测基于音频上下文和过去面部运动序列的面部运动。假设存在地面实况3D面部运动序列 𝐘𝐓=(𝐲𝟏,…,𝐲𝐓) ，其中 𝐓 是视觉帧的数量，以及对应的原始音频 𝒳 。这里的目标是生成一个模型，该模型可以合成面部运动 𝐘^𝐓 ，该面部运动 𝐘^𝐓 与给定原始音频 𝒳 的 𝐘𝐓 相似。在编码器-解码器框架（图2）中，编码器首先将 𝒳 变换成语音表示 𝐀𝐓′=(𝐚𝟏,…,𝐚𝐓′) ，其中 𝐓′ 是语音表示的帧长度。风格嵌入层包含一组可学习的嵌入，表示扬声器身份 𝐒=(𝐬𝟏,…,𝐬𝐍) 。然后，解码器以 𝐀𝐓′ 、说话者 𝐧 的风格嵌入 𝐬𝐧 和过去的面部运动为条件自回归地预测面部运动 𝐘^𝐓=(𝐲^𝟏,…,𝐲^𝐓) 。从形式上讲，

𝐲^𝐭=FaceFormerθ(𝐲^<𝐭,𝐬𝐧,𝒳),

(1)

where θ denotes the model parameters, 𝐭 is the current time-step in the sequence and 𝐲^𝐭∈𝐘^𝐓. For the remainder of this section, we describe each component of the FaceFormer architecture in detail.
其中， θ 表示模型参数， 𝐭 是序列中的当前时间步长， 𝐲^𝐭∈𝐘^𝐓 是序列中的当前时间步长。在本节的其余部分，我们将详细描述FaceFormer架构的每个组件。

3.1FaceFormer Encoder 3.1 FaceFormer编码器

3.1.1 Self-Supervised Pre-Trained Speech Model
3.1.1自监督预训练语音模型

The design of our FaceFormer encoder follows the state-of-the-art self-supervised pre-trained speech model, wav2vec 2.0 [2]. Specifically, the encoder is composed of an audio feature extractor and a multi-layer transformer encoder [58]. The audio feature extractor, which consists of several temporal convolutions layers (TCN), transforms the raw waveform input into feature vectors with frequency fa. The transformer encoder is a stack of multi-head self-attention and feed-forward layers, converting the audio feature vectors into contextualized speech representations. The outputs of the temporal convolutions are discretized to a finite set of speech units via a quantization module. Similar to masked language modeling [20], wav2vec 2.0 uses the context surrounding a masked time step to identify the true quantized speech unit by solving a contrastive task.
我们的FaceFormer编码器的设计遵循最先进的自监督预训练语音模型wav2vec 2.0 [ 2]。具体来说，编码器由音频特征提取器和多层Transformer编码器组成[ 58]。音频特征提取器由多个时间卷积层（TCN）组成，将原始波形输入转换为频率 fa 的特征向量。Transformer编码器是多头自关注和前馈层的堆栈，将音频特征向量转换为上下文化的语音表示。时间卷积的输出通过量化模块被离散化为语音单元的有限集合。类似于屏蔽语言建模[ 20]，wav2vec 2.0通过求解对比任务，使用屏蔽时间步长周围的上下文来识别真正的量化语音单元。

We initialize our encoder (Fig. 2) with the pre-trained wav2vec 2.0 weights, and add a randomly initialized linear projection layer on the top. Since the facial motion data might be captured with a frequency fm that is different to fa (e.g., fa=49Hz while for the BIWI datset [24] fm=25fps), we add a linear interpolation layer after the temporal convolutions for resampling the audio features, which results in the output length k𝐓, where k=⌈fafm⌉. Therefore, the outputs of the linear projection layer can be represented as 𝐀k𝐓=(𝐚1,…,𝐚k𝐓). In this way, the audio and motion modalities can be aligned by the biased cross-modal multi-head attention (Sec. 3.2.3).
我们使用预先训练的wav2vec 2.0权重初始化我们的编码器（图2），并在顶部添加一个随机初始化的线性投影层。由于面部运动数据可能以不同于 fa 的频率 fm 被捕获（例如， fa=49Hz 而对于BIWI数据集[ 24] fm=25fps ），我们在时间卷积之后添加线性插值层以重新扩展音频特征，这导致输出长度 k𝐓 ，其中 k=⌈fafm⌉ 。因此，线性投影层的输出可以表示为 𝐀k𝐓=(𝐚1,…,𝐚k𝐓) 。通过这种方式，音频和运动模态可以通过偏置的跨模态多头注意力来对齐（第二节）。3.2.3）。

3.2FaceFormer Decoder 3.2 FaceFormer解码器

3.2.1Periodic Positional Encoding
3.2.1周期位置编码

In practice, transformer has very limited generalization abilities for longer sequences due to the sinusoidal positional encoding method [19, 50]. Attention with Linear Biases (ALiBi) [50] method is proposed to improve generalization abilities by adding a constant bias to the query-key attention score. In our experiments, we notice that directly replacing the sinusoidal positional encoding with ALiBi would lead to a static facial expression during inference. This is because ALiBi does not add any position information to the input representation, which might influence the robustness of the temporal order information, especially for our case where the training sequences have subtle motion variations among adjacent frames. To alleviate this issue, we devise a periodic positional encoding (PPE) for injecting the temporal order information, while being compatible with ALiBi. Specifically, we modify the original sinusoidal positional encoding method [58] to make it periodic with respect to a hyper-parameter 𝐩 that indicates the period:
在实践中，由于正弦位置编码方法，Transformer对于较长序列的泛化能力非常有限[ 19，50]。带有线性偏差的注意力（ALiBi）[ 50]方法通过向查询关键字注意力分数添加恒定偏差来提高泛化能力。在我们的实验中，我们注意到直接用ALiBi替换正弦位置编码会导致推理过程中的静态面部表情。这是因为ALiBi没有向输入表示添加任何位置信息，这可能会影响时间顺序信息的鲁棒性，特别是对于训练序列在相邻帧之间具有细微运动变化的情况。为了缓解这个问题，我们设计了一个周期性的位置编码（PPE）注入的时间顺序信息，同时与ALiBi兼容。具体来说，我们修改原始正弦位置编码方法[ 58]，使其相对于指示周期的超参数 𝐩 具有周期性：

	PPE(𝐭,2𝐢)	=sin⁡((𝐭mod𝐩)/100002𝐢/𝐝)		(2)
	PPE(𝐭,2𝐢+1)	=cos⁡((𝐭mod𝐩)/100002𝐢/𝐝)		(2)

where 𝐭 denotes the token position or the current time-step in the sequence, 𝐝 is the model dimension, and 𝐢 is the dimension index. Rather than assigning a unique position identifier for each token [58], the proposed PPE strategy recurrently injects the position information within each period 𝐩 (as shown in Section 3.2.2). Before PPE, we first project the face motion 𝐲^𝐭 into a 𝐝-dimensional space via a motion encoder. To model the speaking style, we embed the one-hot speaker identity to a 𝐝-dimensional vector 𝐬𝐧 via a style embedding layer and add it to the facial motion representation:
其中 𝐭 表示序列中的令牌位置或当前时间步， 𝐝 是模型维度，并且 𝐢 是维度索引。建议的PPE策略不是为每个令牌分配唯一的位置标识符[ 58]，而是在每个周期内循环注入位置信息 𝐩 （如第3.2.2节所示）。在PPE之前，我们首先通过运动编码器将面部运动 𝐲^𝐭 投影到 𝐝 维空间中。为了对说话风格进行建模，我们通过风格嵌入层将独热说话者身份嵌入到 𝐝 维向量 𝐬𝐧 中，并将其添加到面部运动表示中：

𝐟𝐭={(𝐖𝐟⋅𝐲^𝐭−𝟏+𝐛𝐟)+𝐬𝐧,1<𝐭≤𝐓,𝐬𝐧,𝐭=1,

(3)

where 𝐖𝐟 is the weight, 𝐛𝐟 is the bias and 𝐲^𝐭−𝟏 is the prediction from the last time step. Then PPE is applied to 𝐟𝐭 to provide the temporal order information periodically:
其中 𝐖𝐟 是权重， 𝐛𝐟 是偏差， 𝐲^𝐭−𝟏 是来自最后时间步的预测。然后，PPE应用于 𝐟𝐭 以定期提供时序信息：

𝐟^𝐭=𝐟𝐭+PPE(𝐭).

(4)

3.2.2Biased Causal Multi-Head Self-Attention
3.2.2有偏因果多头自我注意

We design a biased causal multi-head (MH) self-attention mechanism based on ALiBi [50], which is reported to be beneficial for generalizing to longer sequences in language modeling. Given the temporally encoded facial motion representation sequence 𝐅^𝐭=(𝐟^𝟏,…,𝐟^𝐭), biased causal MH self-attention first linearly projects 𝐅^𝐭 into queries 𝐐𝐅^ and keys 𝐊𝐅^ of dimension 𝐝𝐤, and values 𝐕𝐅^ of dimension 𝐝𝐯. To learn the dependencies between each frame in the context of the past facial motion sequence, a weighted contextual representation is calculated by performing the scaled dot-product attention [58]:
我们基于ALiBi [ 50]设计了一种有偏因果多头（MH）自注意机制，据报道，该机制有利于在语言建模中推广到更长的序列。给定时间编码的面部运动表示序列 𝐅^𝐭=(𝐟^𝟏,…,𝐟^𝐭) ，偏置因果MH自注意首先将 𝐅^𝐭 线性地投影到维度 𝐝𝐤 的查询 𝐐𝐅^ 和键 𝐊𝐅^ 以及维度 𝐝𝐯 的值 𝐕𝐅^ 中。为了在过去的面部运动序列的上下文中学习每个帧之间的依赖性，通过执行缩放的点积注意力来计算加权的上下文表示[ 58]：

Att⁡(𝐐𝐅^,𝐊𝐅^,𝐕𝐅^,𝐁𝐅^)=softmax⁡(𝐐𝐅^(𝐊𝐅^)T𝐝𝐤+𝐁𝐅^)𝐕𝐅^,

(5)

where 𝐁𝐅^ is the temporal bias we add to ensure causality and to improve the ability to generalize to longer sequences.
其中 𝐁𝐅^ 是我们添加的时间偏差，以确保因果关系并提高推广到更长序列的能力。

More specifically, 𝐁𝐅^ is a matrix that has negative infinity in the upper triangle to avoid looking at future frames to make current predictions. For the generalization ability, we add static and non-learned biases to the lower triangle of 𝐁𝐅^. Different from ALiBi [50], we introduce the period 𝐩 and inject the temporal bias to each period ([1:𝐩],[𝐩+1:𝟐𝐩],…). Let us define i and j as the indices of 𝐁𝐅^ (1≤i≤𝐭, 1≤j≤𝐭). Then the temporal bias 𝐁𝐅^ is formulated as:
更具体地， 𝐁𝐅^ 是在上三角形中具有负无穷大的矩阵，以避免查看未来帧来进行当前预测。对于泛化能力，我们将静态和非学习偏差添加到 𝐁𝐅^ 的下三角形。与ALiBi [ 50]不同，我们引入周期 𝐩 并将时间偏差注入每个周期 ([1:𝐩],[𝐩+1:𝟐𝐩],…) 。让我们定义 i 和 j 作为 𝐁𝐅^ （ 1≤i≤𝐭 ， 1≤j≤𝐭 ）的索引。然后，时间偏差 𝐁𝐅^ 被公式化为：

𝐁𝐅^(i,j)={⌊(i−j)/𝐩⌋,j≤i,−∞,otherwise.

(6)

In this way, we bias the casual attention by assigning higher attention weights to the closer period. Intuitively, the closest facial frames period (𝐲^𝐭−𝐩,…,𝐲^𝐭−𝟏) are most likely to affect the current prediction of 𝐲^𝐭. Thus, our proposed temporal bias can be considered as a generalized form of ALiBi and ALiBi becomes a special case when 𝐩=1.
通过这种方式，我们通过将较高的注意力权重分配给较近的时间段来偏置偶然注意力。直觉上，最接近的面部帧周期 (𝐲^𝐭−𝐩,…,𝐲^𝐭−𝟏) 最有可能影响 𝐲^𝐭 的当前预测。因此，我们提出的时间偏差可以被认为是ALiBi的广义形式，并且当 𝐩=1 时，ALiBi成为特殊情况。

The MH attention mechanism, which consists of 𝐇 parallel scaled dot-product attentions, is applied to jointly extract the complementary information from multiple representation subspaces. The outputs of 𝐇 heads are concatenated together and projected forward by a parameter matrix 𝐖𝐅^:
MH注意力机制由 𝐇 并行缩放点积注意力组成，用于从多个表示子空间中联合提取补充信息。将 𝐇 头的输出连接在一起，并通过参数矩阵 𝐖𝐅^ 向前投影：

	MH⁡(𝐐𝐅^,𝐊𝐅^,𝐕𝐅^,𝐁𝐅^)	=Concat⁡(head1,…,head𝐇)𝐖𝐅^,		(7)
	where head 𝐡	=Att⁡(𝐐𝐡𝐅^,𝐊𝐡𝐅^,𝐕𝐡𝐅^,𝐁𝐡𝐅^).		(7)

Similar to ALiBi [50], we add a head-specific scalar 𝐦 for the MH setting. For each head𝐡, the temporal bias is defined as 𝐁𝐡𝐅^ = 𝐁𝐅^⋅𝐦. The scalar 𝐦 is a head-specific slope and is not learned during training. For 𝐇 heads, 𝐦 will start at 2−2(−log2⁡𝐇+3) and multiply each element by the same value to compute the next element. Concretely, if the model has 4 heads, the corresponding slopes will be 2−2, 2−4, 2−6 and 2−8.
与ALiBi [ 50]类似，我们为MH设置添加了头部特定标量 𝐦 。对于每个 head𝐡 ，时间偏差被定义为 𝐁𝐡𝐅^ = 𝐁𝐅^⋅𝐦 。标量 𝐦 是头部特定的斜率，并且在训练期间不学习。对于 𝐇 头， 𝐦 将从 2−2(−log2⁡𝐇+3) 开始，并将每个元素乘以相同的值以计算下一个元素。具体地，如果模型有4个头部，则对应的斜率将为 2−2 、 2−4 、 2−6 和 2−8 。

3.2.3Biased Cross-Modal Multi-Head Attention
3.2.3有偏跨模态多头注意

The biased cross-modal multi-head attention aims to combine the outputs of Faceformer encoder (speech features) and biased causal MH self-attention (motion features) to align the audio and motion modalities (see Fig. 2). For this purpose, we add an alignment bias to the query-key attention score, which is simple and effective. The alignment bias 𝐁𝐀 (1≤i≤𝐭,1≤j≤k𝐓) is represented as:
偏置跨模态多头注意力旨在将Faceformer编码器的输出（语音特征）和偏置因果MH自注意力（运动特征）组合联合收割机，以对准音频和运动模态（参见图2）。为此，我们在查询键注意力得分中加入对齐偏差，这是简单有效的。对准偏置 𝐁𝐀 （ 1≤i≤𝐭,1≤j≤k𝐓 ）表示为：

𝐁𝐀(i,j)={0,ki≤j<k(i+1)−∞, otherwise ,

(8)

Each token in 𝐀k𝐓 has captured the long-term audio context due to the self-attention mechanism. On the other hand, assuming the outputs of biased causal MH self-attention is 𝐅~𝐭=(𝐟~𝟏,…,𝐟~𝐭), each token in 𝐅~𝐭 has encoded the history context of face motions. Both 𝐀k𝐓 and 𝐅~𝐭 are fed into biased cross-modal MH attention. Likewise, 𝐀k𝐓 is transformed into two separate matrices: keys 𝐊𝐀 and values 𝐕𝐀, whereas 𝐅~𝐭 is transformed into queries 𝐐𝐅~. The output is calculated as a weighted sum of 𝐕𝐀,
由于自注意机制， 𝐀k𝐓 中的每个令牌都捕获了长期音频上下文。另一方面，假设偏置因果MH自注意的输出是 𝐅~𝐭=(𝐟~𝟏,…,𝐟~𝐭) ，则 𝐅~𝐭 中的每个令牌已经编码了面部运动的历史上下文。 𝐀k𝐓 和 𝐅~𝐭 都被馈送到偏置交叉模态MH注意中。同样， 𝐀k𝐓 被转换成两个单独的矩阵：键 𝐊𝐀 和值 𝐕𝐀 ，而 𝐅~𝐭 被转换成查询 𝐐𝐅~ 。输出计算为 𝐕𝐀 的加权和，

Att⁡(𝐐𝐅~,𝐊𝐀,𝐕𝐀,𝐁𝐀)=softmax⁡(𝐐𝐅~(𝐊𝐀)T𝐝𝐤+𝐁𝐀)𝐕𝐀.

(9)

To explore different subspaces, we also extend Eq. 9 to 𝐇 heads as in Eq. 7. Finally, the predicted face motion 𝐲^𝐭 is obtained by projecting the 𝐝-dimensional hidden state back to the 𝐕-dimensional 3D vertex space via a motion decoder.
为了探索不同的子空间，我们还扩展了Eq。9到 𝐇 头，如等式所示。7.最后，通过经由运动解码器将 𝐝 维隐藏状态投影回 𝐕 维3D顶点空间来获得预测面部运动 𝐲^𝐭 。

3.3Training and Testing 3.3训练和测试

During the training phase, we adopt an autoregressive scheme instead of a teacher-forcing scheme. In our experiments, we observe that training FaceFormer with a less guided scheme works better than a fully guided one. Once the complete 3D facial motion sequence is produced, the model is trained by minimizing the Mean Squared Error (MSE) between the decoder outputs 𝐘^𝐭=(𝐲^𝟏,…,𝐲^𝐓) and the ground truth 𝐘𝐭=(𝐲𝟏,…,𝐲𝐓):
在训练阶段，我们采用了自回归计划，而不是教师强迫计划。在我们的实验中，我们观察到，用较少指导的方案训练FaceFormer比完全指导的方案效果更好。一旦产生了完整的3D面部运动序列，通过最小化解码器输出 𝐘^𝐭=(𝐲^𝟏,…,𝐲^𝐓) 和地面实况 𝐘𝐭=(𝐲𝟏,…,𝐲𝐓) 之间的均方误差（MSE）来训练模型：

ℒMSE=∑𝐭=1𝐓∑𝐯=1𝐕‖𝐲^𝐭,𝐯−𝐲𝐭,𝐯‖2,

(10)

where 𝐕 represents the number of vertices of the 3D face mesh.
其中 𝐕 表示3D面网格的顶点数。

At inference time, FaceFormer autoregressively predicts a sequence of animated 3D face meshes. More specifically, at each time-step, it predicts the face motion 𝐲^𝐭 conditioned on the raw audio 𝒳, the history of face motions 𝐲^<𝐭 and the style representations 𝐬𝐧 as in Eq. 1. 𝐬𝐧 is determined by the speaker identity, and thus altering the one-hot identity vector can manipulate the output in different styles.
在推理时，FaceFormer自回归预测动画3D面网格的序列。更具体地，在每个时间步，它预测以原始音频 𝒳 、面部运动的历史 𝐲^<𝐭 和风格表示 𝐬𝐧 为条件的面部运动 𝐲^𝐭 ，如等式（1）所示。1. 𝐬𝐧 由说话者身份确定，因此改变独热身份向量可以以不同的风格操纵输出。

4Experiments and Results 4实验与结果

4.1Experimental Settings 4.1实验设置

We use two publicly available 3D datasets, BIWI [24] and VOCASET [17] for training and testing. Both datasets provide the audio-3D scan pairs of English spoken utterances. BIWI contains 40 unique sentences shared across all speakers. VOCASET contains 255 unique sentences, some of which are shared across speakers. Comparatively, BIWI represents a more challenging dataset for lip sync as it covers fewer phonemes.
我们使用两个公开的3D数据集BIWI [ 24]和VOCASET [ 17]进行训练和测试。两个数据集都提供了英语口语的音频-3D扫描对。BIWI包含所有说话者共享的40个独特句子。VOCASET包含255个独特的句子，其中一些是跨扬声器共享。相比之下，BIWI代表了一个更具挑战性的唇同步数据集，因为它涵盖了更少的音素。

BIWI Dataset. BIWI is a corpus of affective speech and corresponding dense dynamic 3D face geometries. 14 human subjects are asked to read 40 English sentences, each of which is recorded twice: in a neutral or emotional context. The 3D face geometries are captured at 25fps, each with 23370 vertices. Each sequence is 4.67 seconds long on average. For our experiments, we use the subset where the sentences are recorded in the emotional context. Specifically, we split the data into a training set (BIWI-Train) of 192 sentences spoken by six subjects (each subject speaks 32 sentences), a validation set (BIWI-Val) of 24 sentences spoken by six subjects (each subject speaks 4 sentences), and two testing sets (BIWI-Test-A and BIWI-Test-B). BIWI-Test-A includes 24 sentences spoken by six seen subjects (each speaks 4 sentences), and BIWI-Test-B includes 32 sentences spoken by eight unseen subjects (each speaks 4 sentences).
BIWI数据集。BIWI是情感语音和相应的密集动态3D人脸几何形状的语料库。14名人类受试者被要求阅读40个英语句子，每个句子被记录两次：在中性或情感背景下。3D面部几何形状以25 fps捕获，每个具有23370个顶点。每个序列平均长4.67秒。在我们的实验中，我们使用了在情感语境中记录句子的子集。具体来说，我们将数据分成由六个受试者说出的192个句子的训练集（BIWI-Train）（每个受试者说出32个句子），由六个受试者说出的24个句子的验证集（BIWI-Val）（每个受试者说出4个句子），以及两个测试集（BIWI-Test-A和BIWI-Test-B）。BIWI-测试-A包括由六个可见的受试者说出的24个句子（每个人说出4个句子），BIWI-测试-B包括由八个不可见的受试者说出的32个句子（每个人说出4个句子）。

VOCASET Dataset. VOCASET is composed of 480 facial motion sequences from 12 subjects. Each sequence is captured at 60fps and is between 3 and 4 seconds long. Each 3D face mesh has 5023 vertices. For a fair comparison, we use the same training, validation and testing splits as VOCA [17], which we refer to VOCA-Train, VOCA-Val and VOCA-Test, respectively.
VOCASET数据集。VOCASET由来自12个受试者的480个面部运动序列组成。每个序列以60 fps的速度捕获，长度在3到4秒之间。每个3D面网格有5023个顶点。为了公平比较，我们使用与VOCA相同的训练，验证和测试分割[ 17]，我们分别称为VOCA-Train，VOCA-Val和VOCA-Test。

Baseline Methods. We compare FaceFormer with two state-of-the-art methods, VOCA [17] and MeshTalk [51], on both BIWI and VOCASET. Among the three methods, FaceFormer and VOCA require conditioning on a training speaker identity during inference. For unseen subjects, we obtain the predictions of FaceFormer and VOCA by conditioning on all training identities. The implementation details of FaceFormer and the baseline methods are provided in the supplementary material (Sec. 1 and Sec. 2).
基线方法。我们将FaceFormer与两种最先进的方法VOCA [ 17]和MeshTalk [ 51]在BIWI和VOCASET上进行比较。在这三种方法中，FaceFormer和VOCA需要在推理过程中对训练说话人身份进行调节。对于看不见的主题，我们通过调节所有训练身份来获得FaceFormer和VOCA的预测。FaceFormer和基线方法的实施细节见补充材料（第10节）。1、第二。2）的情况。

4.2Evaluation Results 4.2评价结果

Table 1: Comparison of lip-sync errors. We compare FaceFormer with two state-of-the-art methods [17, 51] on BIWI-Test-A. The average lip error [51] is used for lip synchronization evaluation.
表1：口形同步误差的比较。我们在BIWI测试A上比较了FaceFormer与两种最先进的方法[17，51]。平均唇误差[ 51]用于唇同步评估。

Methods 方法	Lip Vertex Error (×10−4mm) 唇顶点误差（ ×10−4 mm）
VOCA	7.6427
MeshTalk	6.7436
FaceFormer (Ours) FaceFormer（我们的）	5.3742

Table 2: User study results on BIWI-Test-B. We use A/B testing and report the percentage of answers where A is preferred over B.
表2：BIWI-测试-B的用户研究结果。我们使用A/B测试并报告A优于B的答案的百分比。

Ours vs. Competitor 我们与竞争对手	Realism 现实主义	Lip Sync 对口型假唱
Ours vs. VOCA	83.85±3.76	82.64±3.77
Ours vs. MeshTalk 我们的与MeshTalk	83.33±4.07	80.56±5.22
Ours vs. GT 我们对GT	35.24±2.87	36.98±1.38

Table 3: User study results on VOCA-Test.
表3：VOCA测试的用户研究结果。

Ours vs. Competitor 我们与竞争对手	Realism 现实主义	Lip Sync 对口型假唱
Ours vs. VOCA	77.92±7.94	77.08±7.32
Ours vs. MeshTalk 我们的与MeshTalk	82.92±2.60	82.08±3.15
Ours vs. GT 我们对GT	29.17±10.41	30.42±8.04

Lip-sync Evaluation. We follow the lip-sync metric employed in MeshTalk [51] for evaluating the quality of lip movements. The maximal L2 error of all lip vertices is defined as the lip error for each frame. The error is calculated by comparing the predictions and the captured 3D face geometry data. We report the computed average over all testing sequences of BIWI-test-A for VOCA, MeshTalk and FaceFormer in Tab. 1. The lower average lip error achieved by FaceFormer suggests it can produce more accurate lip movements compared to the other two methods.
对口型评价。我们遵循MeshTalk [ 51]中采用的嘴唇同步度量来评估嘴唇运动的质量。所有嘴唇顶点的最大L2误差被定义为每一帧的嘴唇误差。通过比较预测和捕获的3D面几何数据来计算误差。我们在表中报告了VOCA、MeshTalk和FaceFormer的BIWI-测试-A的所有测试序列的计算平均值。1. FaceFormer实现的较低平均唇部误差表明，与其他两种方法相比，它可以产生更准确的唇部运动。

Qualitative Evaluation. Given the many-to-many mappings between upper face motions and the speech utterance, it is suggested that qualitative evaluations and user studies are more proper for evaluating the quality of speech-driven facial animation than using quantitative metrics [17, 31]. We refer the readers to our supplementary video for the assessment of the motion quality. The video compares the results of our approach, those by the previous methods [53, 31, 17, 51] and the ground truth. Specifically, we test our model using (1) audio sequences from BIWI and VOCASET test sets, (2) audio clips extracted from supplementary videos of previous methods and (3) audio clips extracted from TED videos on YouTube. For the last two cases, the results are predicted from the model trained on BIWI. The video shows that FaceFormer produces realistic and natural-looking facial animation with accurate lip synchronization. Compared to VOCA and MeshTalk, it is notable that, FaceFormer produces more realistic facial motions and better lip sync with proper mouth closures in many situations, e.g., the lips are fully closed when pronouncing /b/,/m/,/p/. We also show that our system can produce animation of talking in different styles and different languages.
定性评价。鉴于上面部运动和语音发声之间的多对多映射，建议定性评估和用户研究比使用定量度量更适合于评估语音驱动面部动画的质量[ 17，31]。我们建议读者参考我们的补充视频，以评估运动质量。该视频比较了我们的方法的结果、之前方法的结果[ 53，31，17，51]和基本事实。具体来说，我们使用（1）来自BIWI和VOCASET测试集的音频序列，（2）从以前方法的补充视频中提取的音频片段和（3）从YouTube上的TED视频中提取的音频片段来测试我们的模型。对于最后两种情况，结果是根据在BIWI上训练的模型预测的。视频显示，FaceFormer制作的面部动画逼真、自然，嘴唇同步准确。与VOCA和MeshTalk相比，值得注意的是，FaceFormer在许多情况下产生更逼真的面部运动和更好的嘴唇同步，并具有适当的嘴巴闭合，例如，发/B/、/m/、/p/时嘴唇完全闭合。我们还表明，我们的系统可以产生不同风格和不同语言的动画。

4.3Perceptual Evaluation 4.3感知评估

User Study on BIWI. We conduct user studies on Amazon Mechanical Turk (AMT) to evaluate the animation quality of FaceFormer, compared with the ground truth, VOCA and MeshTalk. For BIWI, we obtain the results of three methods using all test audio sequences of BIWI-Test-B (32 sentences). The results of FaceFormer and VOCA are produced by conditioning on all training speaker identities, which results in 192 videos (32 sentences × 6 identities) for each method. Therefore, 576 A vs. B pairs (192 videos × 3 comparisons) are created for BIWI-Test-B. For each HIT (human intelligence task), the AMT interface shows four video pairs including the qualification test in randomized order, and the Turker is instructed to judge the videos in terms of realistic facial animation and lip sync. Each video pair is evaluated by three Turkers. In particular, Turkers must pass the qualification test otherwise they are not allowed to submit HITs. Finally, we collect 576 HITs for the user study on BIWI. More details about the user study are described in our supplementary material (Sec. 3).
BIWI用户研究。我们在Amazon Mechanical Turk（AMT）上进行了用户研究，以评估FaceFormer的动画质量，并将其与地面实况、VOCA和MeshTalk进行了比较。对于BIWI，我们使用BIWI-Test-B（32句）的所有测试音频序列获得了三种方法的结果。FaceFormer和VOCA的结果是通过对所有训练扬声器身份进行调节而产生的，这导致每种方法的192个视频（32个句子 × 6身份）。因此，为BIWI-测试-B创建了576个A与B对（192个视频 × 3比较）。对于每个HIT（人类智能任务），AMT界面以随机顺序显示包括资格测试在内的四个视频对，并指示Turker从逼真的面部动画和口型同步方面对视频进行判断。每个视频对都由三名土耳其人进行评估。特别是，Turkers必须通过资格测试，否则不允许提交HIT。最后，我们收集了576个HIT用于BIWI上的用户研究。有关用户研究的更多细节，请参见我们的补充材料（第（3）第三章。

Tab. 2 shows the percentage of A/B testing in terms of realism and lip sync. Turkers favor FaceFormer over VOCA in terms of realistic facial animation and lip sync. We believe this is mainly due to two reasons: (1) face motions synthesized by VOCA are mostly present in the lower face; (2) VOCA sometimes fails to fully close the mouth at the phonemes /b/,/m/,/p/. FaceFormer also outperforms MeshTalk and we attribute this to the results produced by FaceFormer having more expressive facial motions and more accurate mouth movements. Not surprisingly, Turkers perceive the ground truth more realistic than FaceFormer.
选项卡. 2显示了A/B测试在真实性和对口型方面的百分比。在逼真的面部动画和口型同步方面，土耳其人更喜欢FaceFormer而不是VOCA。我们认为这主要是由于两个原因：（1）VOCA合成的面部运动大多出现在下面部;（2）VOCA有时无法在音素/B/、/m/、/p/处完全合上嘴。FaceFormer的性能也优于MeshTalk，我们将其归因于FaceFormer具有更有表现力的面部动作和更准确的嘴部动作所产生的结果。毫不奇怪，土耳其人认为地面真相比FaceFormer更现实。

User Study on VOCASET. In the second user study, we compare the results of three methods on VOCA-Test. We randomly select 10 sentences from VOCA-Test and obtain the results of FaceFormer and VOCA conditioned on all training speaker identities, which results in 80 videos (10 sentences × 8 identities) for each method. In total, 240 A vs. B pairs (80 videos × 3 comparisons) are created for VOCA-Test. Similarly, for each pair, Turkers make the choice between two videos in terms of realism and lip sync. Since VOCASET has very few upper face motions, movements are present mostly in the lower face for all three methods. In this case, well-synchronized lip motions are important for generating perceptually realistic results. Tab. 3 shows that FaceFormer achieves higher percentages over VOCA and MeshTalk. We believe this is because our results have better synchronized mouth shapes and closures. Similarly, there is still a certain gap between our results and the ground truth.
VOCASET的用户研究。在第二个用户研究中，我们比较了三种方法对VOCA测试的结果。我们从VOCA-Test中随机选择10个句子，并以所有训练说话人身份为条件获得FaceFormer和VOCA的结果，每种方法产生80个视频（10个句子 × 8身份）。总共为VOCA测试创建了240个A与B对（80个视频 × 3比较）。同样，对于每一对，Turkers在两个视频之间做出现实主义和对口型的选择。由于VOCASET的上面部运动很少，因此这三种方法的运动主要存在于下面部。在这种情况下，良好同步的嘴唇运动对于生成感知上逼真的结果是重要的。选项卡. 3显示FaceFormer比VOCA和MeshTalk实现了更高的百分比。我们认为这是因为我们的结果有更好的同步口型和闭合。同样，我们的结果与地面事实之间仍然存在一定的差距。

4.4Visualization Analysis
4.4可视化分析

Refer to caption

Figure 3: Attention Weights Visualization. Attention weights of the (a) MH self-attention of the encoder and (b) biased causal MH self-attention of the decoder.
图3：注意力权重可视化。（a）编码器的MH自注意和（B）解码器的偏置因果MH自注意的注意权重。

To provide insights into the underlying attention mechanism, we visualize the attention weights for the MH self-attention of the encoder, as well as the biased causal MH self-attention of the decoder. We consider 100 frames of a test sequence from BIWI and examine the attention weights that are used to predict the last frame. Fig. 3 visualizes the average attention weights across all heads. We observe that the encoder self-attention (Fig. 3 (a)) not only focuses on the nearby audio frames (as reflected by the diagonal line) but also attends to some farther future and past frames. This indicates that the self-attention mechanism of transformer is able to capture both the short- and long-range audio context dependencies. The attended audio frames may contain more informative context features that influence the current face motion. For the decoder self-attention (Fig. 3 (b)), the visualization corresponds to the casual attention incorporated with the temporal bias (Eq. 6). There is a clear pattern that the face motion frames in a closer period are assigned with higher weights, as those frames are more likely to influence the current face motion. For example, there is a high probability that people will keep smiling if they have been smiling over the past frames.
为了深入了解潜在的注意力机制，我们可视化的注意力权重的MH自我注意的编码器，以及有偏见的因果MH自我注意的解码器。我们认为100帧的测试序列从BIWI和检查的注意力的权重，用于预测最后一帧。图3显示了所有头部的平均注意力权重。我们观察到，编码器自我注意力（图3（a））不仅关注附近的音频帧（如对角线所反映的），而且还关注一些更远的未来和过去的帧。这表明Transformer的自注意机制能够捕获短距离和长距离音频上下文依赖性。关注的音频帧可以包含影响当前面部运动的更多信息上下文特征。对于解码器自我注意力（图3（B）），可视化对应于与时间偏差结合的偶然注意力（等式4）。（六）。有一个明显的模式，即在较近的时间段内的面部运动帧被分配有较高的权重，因为这些帧更有可能影响当前的面部运动。例如，如果人们在过去的帧中一直在微笑，那么他们将保持微笑的概率很高。

4.5Ablation Study 4.5消融研究

The visual results of ablation study are included in the supplementary video. Please watch the supplementary video for the dynamic comparison.
消融研究的视觉结果包含在补充视频中。请观看补充视频进行动态比较。

4.5.1Ablation on FaceFormer Encoder
4.5.1 FaceFormer编码器上的消融

Effect of the encoder self-attention module. To investigate the effect of the MH self-attention module in FaceFormer encoder, we directly remove it from the whole architecture, with the pre-trained TCN retained to extract the speech representations. We refer to this variant as “TCN+FaceFormer Decoder” and conduct the comparison experiments on BIWI. The results show that “TCN+FaceFormer Decoder” often fails to close the mouth, resulting in out-of-sync lip motions. Besides, the produced results have a temporal jitter effect around the mouth region, as shown in the supplementary video.
编码器自我注意模块的效果。为了研究FaceFormer编码器中MH自注意模块的效果，我们直接将其从整个架构中删除，保留预训练的TCN来提取语音表示。我们将这种变体称为“TCN+FaceFormer解码器”，并在BIWI上进行比较实验。结果显示，“TCN+FaceFormer解码器”经常无法闭合嘴巴，导致嘴唇运动不同步。此外，所产生的结果在嘴部区域周围具有时间抖动效果，如补充视频中所示。

Effect of the wav2vec weights initialization. We also perform an ablation study of the wav2vec weights initialization by comparing FaceFormer trained with and without wav2vec weights initialization (denoted as “FaceFormer w/o wav2vec”). Without wav2vec weights initialization, we observe a downgrade of the quality of face movements. “FaceFormer w/o wav2vec” can not produce synchronized mouth motions and a temporal jitter effect can be observed. This suggests that simply training FaceFormer with randomly initialized weights might converge to a poor solution. Hence, the wav2vec weights initialization is necessary for the FaceFormer encoder.
wav 2 vec权重初始化的效果。我们还通过比较使用和不使用wav 2 vec权重初始化训练的FaceFormer（表示为“FaceFormer w/o wav 2 vec”）来执行wav 2 vec权重初始化的消融研究。在没有wav 2 vec权重初始化的情况下，我们观察到面部运动的质量下降。“FaceFormer w/o wav 2 vec”不能产生同步的嘴部运动，并且可以观察到时间抖动效果。这表明，简单地用随机初始化的权重训练FaceFormer可能会收敛到一个糟糕的解决方案。因此，wav 2 vec权重初始化对于FaceFormer编码器是必要的。

4.5.2Ablation on FaceFormer Decoder
4.5.2 FaceFormer解码器上的消融

Choices of the decoder architecture. We explore whether the transformer-based architecture has advantages over a fully-connected layer or LSTM by training and testing two alternative variants: “FaceFormer Encoder+FC” and “FaceFormer Encoder+LSTM”. As shown in the supplemental video, FaceFormer yields more stable mouth motions and more accurate lip sync compared to the two variants. Compared to the FC decoder, the autoregressive machanism of FaceFormer decoder can stabilize the predicted lip motions by modeling the history motions. On the other hand, the self-attention machanism of FaceFormer decoder might model the context cues in history motions better than LSTM, thus having more temporally coherent lip motions.
解码器架构的选择。我们通过训练和测试两种替代变体来探索基于transformer的架构是否优于全连接层或LSTM：“FaceFormer Encoder+FC”和“FaceFormer Encoder+LSTM”。如补充视频所示，与两种变体相比，FaceFormer产生更稳定的嘴部运动和更准确的嘴唇同步。与FC解码器相比，FaceFormer解码器的自回归机制可以通过对历史运动进行建模来稳定预测的嘴唇运动。另一方面，FaceFormer解码器的自我注意机制可能比LSTM更好地模拟历史运动中的上下文线索，从而具有更多的时间相干的嘴唇运动。

Effect of the alignment bias. We examine the effect of the alignment bias (Eq. 8) by removing it from the biased cross-modal MH attention module. The model without the alignment bias (denoted as “FaceFormer w/o AB”) tends to generate muted facial expressions across all frames. Hence, the alignment bias is indispensable for the cross-modal attention in aligning the audio-motion modalities correctly.
对准偏差的影响。我们研究了对准偏置的影响（等式2）。8)通过将其从偏置的跨模态MH注意模块中移除。没有对齐偏差的模型（表示为“FaceFormer w/o AB”）倾向于在所有帧上生成静音的面部表情。因此，对齐偏差是必不可少的跨模态注意在正确对齐的音频运动模态。

Refer to caption

Figure 4: Illustration of different positional encoding strategies.
图4：不同位置编码策略的图示。

Effect of the proposed positional encoding strategy. In the FaceFormer decoder, the proposed positional encoding strategy is adding a temporal bias to the attention score and making the original sinusoidal position embedding [58] periodic. We refer to this strategy as “TB+PPE”. We compare “TB+PPE” with the original sinusoidal position encoding [58] (“Original PE”) and “ALiBi” [50]. The differences of three different positional encoding strategies are visualized in Fig. 4. The results show “Original PE” can still produce well-synchronized mouth motions with proper lip closures, yet has a temporal jitter effect around the lips during silent frames, especially as the test audio sequence exceeds the average length of training audio sequences. While “ALiBi” does not influence the results on BIWI, it quickly freezes to a static facial expression when training and testing on VOCASET. This happens because the original ALiBi does not add any position information to the input representation, influencing the robustness of the temporal order information. This influence is more obvious when motion data have subtle variations among adjacent frames.
所提出的位置编码策略的效果。在FaceFormer解码器中，提出的位置编码策略是向注意力分数添加时间偏差，并使原始正弦位置嵌入[ 58]成为周期性的。我们将此策略称为“TB+PPE”。我们将“TB+PPE”与原始正弦位置编码[ 58]（“原始PE”）和“ALiBi”[ 50]进行比较。三种不同位置编码策略的差异如图4所示。结果表明，“原始PE”仍然可以产生良好同步的嘴部运动与适当的嘴唇关闭，但在无声帧期间，嘴唇周围有时间抖动效应，特别是当测试音频序列超过训练音频序列的平均长度时。虽然“AliBi”不会影响BIWI上的结果，但在VOCASET上训练和测试时，它会很快冻结为静态面部表情。这是因为原始ALiBi没有向输入表示添加任何位置信息，从而影响时间顺序信息的鲁棒性。当运动数据在相邻帧之间具有细微变化时，这种影响更加明显。

Since “TB+PPE” is a key component for improving the ability to generalize to longer audio sequences, we additionally study its influence by conducting the perceptual evaluation on AMT. Specifically, we download the TED videos shared under the “CC BY-NC-ND 4.0 International License”, and extract 15 representative audio clips for the user study. The audio sequences are around 20 seconds long, more than four times the average length of training audio sequences. For the comparison to “Original PE”, we randomly sample a training identity and use it as the condition for both methods. Similar to the user study in Sec. 4.3, each video pair is evaluated by three judges. Overall, Turkers perceive the facial animation results of FaceFormer more realistic (57.78%±16.78%) and the generated lip motions of FaceFormer more in sync with audio (62.22%±10.18%) than “Original PE”. That indicates that FaceFormer generalizes better to longer audio clips than “Original PE”. The likely explanation is that “Original PE” tends to generate unstable lip motions during silent frames when testing on longer audio sequences.
由于“TB+PPE”是提高推广到较长音频序列的能力的关键组成部分，我们还通过对AMT进行感知评估来研究其影响。具体来说，我们下载了在“CC BY-NC-ND 4.0国际许可证”下共享的TED视频，并提取了15个代表性的音频片段用于用户研究。音频序列大约20秒长，是训练音频序列平均长度的四倍多。为了与“原始PE”进行比较，我们随机抽取一个训练身份，并将其用作两种方法的条件。类似于SEC中的用户研究。4.3、每个视频对由三名评委进行评价。总体而言，土耳其人认为FaceFormer的面部动画结果比“Original PE”更真实（ 57.78%±16.78% ），并且FaceFormer生成的嘴唇动作与音频更同步（ 62.22%±10.18% ）。这表明FaceFormer比“Original PE”更好地概括更长的音频片段。可能的解释是，当测试较长的音频序列时，“原始PE”倾向于在无声帧期间生成不稳定的嘴唇运动。

5Discussion and Conclusion
5讨论和结论

In this work, we propose an autoregressive transformer-based architecture for speech-driven 3D facial animation. The encoder effectively leverages the self-supervised pre-trained speech representations, and the inside self-attention can capture long-range audio context dependencies. The decoder attention modules with a periodic position encoding strategy are tailored for cross-modal alignment and generalization to longer sequences. Overall, FaceFormer demonstrates higher quality for lip synchronization and realistic facial animation compared to the state-of-the-arts. However, the main bottleneck in our model is the quadratic memory and time complexity of the self-attention mechanism, making it not suitable for real-time applications. One future work is to address this problem using advanced techniques [61, 3] that improve the efficiency of self-attention.
在这项工作中，我们提出了一个自回归变换为基础的语音驱动的三维人脸动画架构。编码器有效地利用了自我监督的预训练语音表示，并且内部的自我注意力可以捕获长距离音频上下文依赖性。具有周期性位置编码策略的解码器注意模块针对跨模态对齐和更长序列的泛化而定制。总体而言，FaceFormer展示了更高的质量，嘴唇同步和现实的面部动画相比，国家的最先进的。然而，在我们的模型中的主要瓶颈是二次记忆和时间复杂度的自我注意力机制，使其不适合实时应用。未来的一项工作是使用先进的技术来解决这个问题[61，3]，提高自我注意力的效率。

Ethics Considerations: We should use technology responsibly and be careful about the synthesized content. Since our technique requires 3D scan data collected from actors, it is important to obtain consent from the actors during data acquisition. Our method can animate a realistic 3D talking face from an arbitrary audio signal. However, there is a risk that such techniques could potentially be misused to cause embarrassment. Thus, we hope to raise the public’s awareness about the risks of the potential misuse and encourage research efforts on the responsible use of technology.
道德考量：我们应该负责任地使用技术，并小心合成的内容。由于我们的技术需要从演员那里收集3D扫描数据，因此在数据采集过程中获得演员的同意非常重要。我们的方法可以从任意音频信号中生成逼真的3D说话人脸。然而，这种技术有可能被滥用，造成尴尬。因此，我们希望提高公众对潜在滥用风险的认识，并鼓励负责任地使用技术的研究工作。

Acknowledgement. This research is partly supported by New Energy and Industrial Technology Development Organization (NEDO) (ref:JPNP21004).
致谢。本研究得到了新能源与工业技术发展组织（NEDO）的部分支持（参考：JPNP21004）。

References

[1]Emre Aksan, Peng Cao, Manuel Kaufmann, and Otmar Hilliges.A spatio-temporal transformer for 3d human motion prediction.arXiv preprint arXiv:2004.08692, 2020.
[2]Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli.wav2vec 2.0: A framework for self-supervised learning of speech representations.arXiv preprint arXiv:2006.11477, 2020.
[3]Iz Beltagy, Matthew E Peters, and Arman Cohan.Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020.
[4]Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha.Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents.In 2021 IEEE Virtual Reality and 3D User Interfaces (VR), pages 1–10. IEEE, 2021.
[5]Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou.Real-time facial animation with image-based dynamic avatars.ACM Transactions on Graphics, 35(4), 2016.
[6]Yong Cao, Wen C Tien, Petros Faloutsos, and Frédéric Pighin.Expressive speech-driven facial animation.ACM Transactions on Graphics, 24(4):1283–1302, 2005.
[7]Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko.End-to-end object detection with transformers.In Proceedings of the European Conference on Computer Vision, pages 213–229. Springer, 2020.
[8]Yujin Chai, Yanlin Weng, Lvdi Wang, and Kun Zhou.Speech-driven facial animation with spectral gathering and temporal attention.Frontiers of Computer Science, 2020.
[9]Chun-Fu Chen, Quanfu Fan, and Rameswar Panda.Crossvit: Cross-attention multi-scale vision transformer for image classification.arXiv preprint arXiv:2103.14899, 2021.
[10]Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu.Talking-head generation with rhythmic head motion.In Proceedings of the European Conference on Computer Vision, pages 35–51. Springer, 2020.
[11]Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu.Lip movements generation at a glance.In Proceedings of the European Conference on Computer Vision, pages 520–535, 2018.
[12]Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu.Hierarchical cross-modal talking face generation with dynamic pixel-wise loss.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7832–7841, 2019.
[13]Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever.Generative pretraining from pixels.In International Conference on Machine Learning, pages 1691–1703, 2020.
[14]Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen.Conditional positional encodings for vision transformers.arXiv preprint arXiv:2102.10882, 2021.
[15]Joon Son Chung, Amir Jamaludin, and Andrew Zisserman.You said that?arXiv preprint arXiv:1705.02966, 2017.
[16]Joon Son Chung and Andrew Zisserman.Out of time: automated lip sync in the wild.In Asian Conference on Computer Vision, pages 251–263. Springer, 2016.
[17]Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black.Capture, learning, and synthesis of 3d speaking styles.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10101–10111, 2019.
[18]Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojeshwar Bhowmick.Speech-driven facial animation using cascaded gans for learning of motion and texture.In European Conference on Computer Vision, pages 408–424. Springer, 2020.
[19]Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser.Universal transformers.arXiv preprint arXiv:1807.03819, 2018.
[20]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
[21]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020.
[22]Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh.Jali: an animator-centric viseme model for expressive lip synchronization.ACM Transactions on Graphics), 35(4):1–11, 2016.
[23]Bo Fan, Lijuan Wang, Frank K Soong, and Lei Xie.Photo-real talking head with deep bidirectional lstm.In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4884–4888. IEEE, 2015.
[24]Gabriele Fanelli, Juergen Gall, Harald Romsdorfer, Thibaut Weise, and Luc Van Gool.A 3-d audio-visual corpus of affective communication.IEEE Transactions on Multimedia, 12(6):591–598, 2010.
[25]Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala.Text-based editing of talking-head video.ACM Transactions on Graphics, 38(4):1–14, 2019.
[26]Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi-Min Hu.Pct: Point cloud transformer.Computational Visual Media, 7(2):187–199, 2021.
[27]Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt.Learning speech-driven 3d conversational gestures from video.arXiv preprint arXiv:2102.06837, 2021.
[28]Ahmed Hussen Abdelaziz, Barry-John Theobald, Paul Dixon, Reinhard Knothe, Nicholas Apostoloff, and Sachin Kajareker.Modality dropout for improved performance-driven talking faces.In Proceedings of the 2020 International Conference on Multimodal Interaction, pages 378–386, 2020.
[29]Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, and Feng Xu.Audio-driven emotional video portraits.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 14080–14089, 2021.
[30]Yifan Jiang, Shiyu Chang, and Zhangyang Wang.Transgan: Two transformers can make one strong gan.arXiv preprint arXiv:2102.07074, 2021.
[31]Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen.Audio-driven facial animation by joint end-to-end learning of pose and emotion.ACM Transactions on Graphics, 36(4):1–12, 2017.
[32]Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah.Transformers in vision: A survey.arXiv preprint arXiv:2101.01169, 2021.
[33]Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Niessner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt.Deep video portraits.ACM Transactions on Graphics, 37(4):1–14, 2018.
[34]Avisek Lahiri, Vivek Kwatra, Christian Frueh, John Lewis, and Chris Bregler.Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2755–2764, 2021.
[35]Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler.Realtime facial animation with on-the-fly correctives.ACM Transactions on Graphics, 32(4):42–1, 2013.
[36]Jiaman Li, Yihang Yin, Hang Chu, Yi Zhou, Tingwu Wang, Sanja Fidler, and Hao Li.Learning to generate diverse dance motions with transformer.arXiv preprint arXiv:2008.08171, 2020.
[37]Ruilong Li, Shan Yang, David Ross, and Angjoo Kanazawa.Learn to dance with aist++: Music conditioned 3d dance generation.arXiv preprint arXiv:2101.08779, 2021.
[38]Kevin Lin, Lijuan Wang, and Zicheng Liu.End-to-end human pose and mesh reconstruction with transformers.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1954–1963, 2021.
[39]Jingying Liu, Binyuan Hui, Kun Li, Yunke Liu, Yu-Kun Lai, Yuxiang Zhang, Yebin Liu, and Jingyu Yang.Geometry-guided dense perspective network for speech-driven facial animation.IEEE Transactions on Visualization and Computer Graphics, 2021.
[40]Yilong Liu, Feng Xu, Jinxiang Chai, Xin Tong, Lijuan Wang, and Qiang Huo.Video-audio driven real-time facial animation.ACM Transactions on Graphics, 34(6):1–10, 2015.
[41]DW Massaro, MM Cohen, M Tabain, J Beskow, and R Clark.Animated speech: research progress and applications.Audiovisual Speech Processing, page 309–345, 2012.
[42]Gaurav Mittal and Baoyuan Wang.Animating face using disentangled audio representations.In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pages 3290–3298, 2020.
[43]Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur.Librispeech: an asr corpus based on public domain audio books.In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5206–5210, 2015.
[44]Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran.Image transformer.In International Conference on Machine Learning, pages 4055–4064. PMLR, 2018.
[45]Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.On the difficulty of training recurrent neural networks.In International Conference on Machine Learning, pages 1310–1318, 2013.
[46]Mathis Petrovich, Michael J Black, and Gül Varol.Action-conditioned 3d human motion synthesis with transformer vae.arXiv preprint arXiv:2104.05670, 2021.
[47]Hai X Pham, Samuel Cheung, and Vladimir Pavlovic.Speech-driven 3d facial animation with implicit emotional awareness: a deep learning approach.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 80–88, 2017.
[48]Hai Xuan Pham, Yuting Wang, and Vladimir Pavlovic.End-to-end learning for 3d facial animation from speech.In Proceedings of the ACM International Conference on Multimodal Interaction, pages 361–365, 2018.
[49]KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar.A lip sync expert is all you need for speech to lip generation in the wild.In Proceedings of the 28th ACM International Conference on Multimedia, pages 484–492, 2020.
[50]Ofir Press, Noah A Smith, and Mike Lewis.Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409, 2021.
[51]Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando de la Torre, and Yaser Sheikh.Meshtalk: 3d face animation from speech using cross-modality disentanglement.In Proceedings of the IEEE International Conference on Computer Vision, pages 1173–1182, 2021.
[52]Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman.Synthesizing obama: learning lip sync from audio.ACM Transactions on Graphics, 36(4):1–13, 2017.
[53]Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews.A deep learning approach for generalized speech animation.ACM Transactions on Graphics, 36(4):1–11, 2017.
[54]Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews.Dynamic units of visual speech.In Proceedings of the ACM SIGGRAPH/Eurographics conference on Computer Animation, pages 275–284, 2012.
[55]Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner.Neural voice puppetry: Audio-driven facial reenactment.In Proceedings of European Conference on Computer Vision, pages 716–731. Springer, 2020.
[56]Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou.Training data-efficient image transformers & distillation through attention.In International Conference on Machine Learning, pages 10347–10357, 2021.
[57]Guillermo Valle-Pérez, Gustav Eje Henter, Jonas Beskow, André Holzapfel, Pierre-Yves Oudeyer, and Simon Alexanderson.Transflower: probabilistic autoregressive dance generation with multimodal attention.arXiv preprint arXiv:2106.13871, 2021.
[58]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
[59]Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic.Realistic speech-driven facial animation with gans.International Journal of Computer Vision, 128(5):1398–1413, 2020.
[60]Qianyun Wang, Zhenfeng Fan, and Shihong Xia.3d-talkemo: Learning to synthesize 3d emotional talking head.arXiv preprint arXiv:2104.12051, 2021.
[61]Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma.Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020.
[62]Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly.Realtime performance-based facial animation.ACM Transactions on Graphics, 30(4):1–10, 2011.
[63]Olivia Wiles, A Koepke, and Andrew Zisserman.X2face: A network for controlling face generation using images, audio, and pose codes.In Proceedings of the European Conference on Computer Vision, pages 670–686, 2018.
[64]Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo.Segformer: Simple and efficient design for semantic segmentation with transformers.arXiv preprint arXiv:2105.15203, 2021.
[65]Yuyu Xu, Andrew W Feng, Stacy Marsella, and Ari Shapiro.A practical and configurable lip sync method for games.In Proceedings of Motion on Games, pages 131–140, 2013.
[66]Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu.Audio-driven talking face video generation with learning-based personalized head pose.arXiv preprint arXiv:2002.10137, 2020.
[67]Dan Zeng, Han Liu, Hui Lin, and Shiming Ge.Talking face generation with expression-tailored generative adversarial network.In Proceedings of the 28th ACM International Conference on Multimedia, pages 1716–1724, 2020.
[68]Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, and Vladlen Koltun.Point transformer.arXiv preprint arXiv:2012.09164, 2020.
[69]Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang.Talking face generation by adversarially disentangled audio-visual representation.In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9299–9306, 2019.
[70]Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu.Pose-controllable talking face generation by implicitly modularized audio-visual representation.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4176–4186, 2021.
[71]Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh.Visemenet: Audio-driven animator-centric speech animation.ACM Transactions on Graphics, 37(4):1–10, 2018.
[72]Michael Zollhöfer, Justus Thies, Pablo Garrido, Derek Bradley, Thabo Beeler, Patrick Pérez, Marc Stamminger, Matthias Nießner, and Christian Theobalt.State of the art on monocular 3d face reconstruction, tracking, and applications.In Computer Graphics Forum, pages 523–550, 2018.

Supplementary Material 补充材料

In this supplementary material, we provide further information about FaceFormer, including detailed explanation of the FaceFormer architecture and the training details (Sec. 6), implementations of baseline methods (Sec. 7), and the additional information about user study (Sec. 8).
在本补充材料中，我们提供了有关FaceFormer的更多信息，包括FaceFormer架构和培训细节的详细说明（第6），基线方法的实施（第6节）。7），以及关于用户研究的附加信息（第7节）。8）。