MobilePortrait：移动的设备上的实时一次性神经头化身

最新推荐文章于 2025-11-23 19:49:10 发布

翻译最新推荐文章于 2025-11-23 19:49:10 发布 · 280 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://arxiv.org/html/2407.05712?_immersive_translate_auto_translate=1

文章标签：

#计算机视觉

Existing neural head avatars methods have achieved significant progress in the image quality and motion range of portrait animation. However, these methods neglect the computational overhead, and to the best of our knowledge, none is designed to run on mobile devices. This paper presents MobilePortrait, a lightweight one-shot neural head avatars method that reduces learning complexity by integrating external knowledge into both the motion modeling and image synthesis, enabling real-time inference on mobile devices. Specifically, we introduce a mixed representation of explicit and implicit keypoints for precise motion modeling and precomputed visual features for enhanced foreground and background synthesis. With these two key designs and using simple U-Nets as backbones, our method achieves state-of-the-art performance with less than one-tenth the computational demand. It has been validated to reach speeds of over 100 FPS on mobile devices and support both video and audio-driven inputs. Video samples are in project page.
现有的神经头化身方法在人像动画的图像质量和运动范围方面取得了显著的进步。然而，这些方法忽略了计算开销，并且据我们所知，没有一种方法被设计为在移动的设备上运行。本文提出了MobilePortrait，一种轻量级的一次性神经头化身方法，通过将外部知识集成到运动建模和图像合成中来降低学习复杂度，从而实现移动的设备上的实时推理。具体来说，我们引入了一个混合表示的显式和隐式的关键点精确的运动建模和预计算的视觉功能增强前景和背景合成。通过这两个关键设计并使用简单的U网作为主干，我们的方法实现了最先进的性能，计算需求不到十分之一。经验证，它在移动的设备上的速度超过100 FPS，并支持视频和音频驱动输入。视频样本在项目页面。MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices

Keywords:

Neural Head Avatar Face Reenactment Talking Head Generation

关键词：

神经头化身面部再现会说话的头生成

Refer to caption

Figure 1:The provided examples (on the left) demonstrate that our methods can achieve results comparable to or even better than those of current high-computation state-of-the-art methods, but with less than one-tenth of the computational cost. On the right, a bubble chart compares various methods, with the size of each bubble representing the model’s parameter size. This further confirms that our method can produce high-quality results while offering a significant advantage in computational efficiency.
图1：提供的示例（左侧）表明，我们的方法可以实现与当前高计算量的最先进方法相当甚至更好的结果，但计算成本不到十分之一。在右侧，气泡图比较了各种方法，每个气泡的大小表示模型的参数大小。这进一步证实了我们的方法可以产生高质量的结果，同时在计算效率方面具有显着优势。

1Introduction
1引言

One-Shot Neural head avatar (NHA) is a technology that animates a single facial image according to a specified driving signal, which could be video or audio, to synthesize a portrait video. In recent years, significant improvements [4, 34, 29, 38, 9, 32, 17, 33, 30] have been made in the quality of the generated image and the range of motion. However, existing NHA approaches concentrate on achieving realism and robustness in image synthesis with models that are increasingly complex, typically surpassing 100 GFLOPs [32, 29, 38, 9], leading to the under-exploration of lightweight NHA. With the swift progress in large language models (LLMs) and the widespread use of smartphones, avatars on mobile devices are poised to become a crucial interface for AI interaction. This prospect has driven us to develop an efficient one-shot neural head avatar model optimized for performance on mobile platforms.
One-Shot Neural Head Avatar（NHA）是一种根据指定的驱动信号（可以是视频或音频）对单个面部图像进行动画处理以合成肖像视频的技术。近年来，在生成图像的质量和运动范围方面取得了显著的改进[4，34，29，38，9，32，17，33，30]。然而，现有的NHA方法集中于在图像合成中实现真实性和鲁棒性，模型越来越复杂，通常超过100 GFLOP[32，29，38，9]，导致轻量级NHA的探索不足。随着大型语言模型（LLMs）的快速发展和智能手机的广泛使用，移动的设备上的化身有望成为人工智能交互的关键界面。这种前景促使我们开发一种高效的一次性神经头化身模型，该模型针对移动的平台上的性能进行了优化。

Initially, we attempted to convert existing SOTA models into ones that could be deployed on mobile devices. However, we found that these models incorporated many complex modules in their structural design, such as memory modules, dynamic convolutions [9], attentions [9, 15], multiscale feature warping [9, 29, 38] or image-to-plane [32] methods. Reducing computational complexity is a challenging task, and the complexity of the models increased the difficulty and development workload of deploying them on mobile devices. Therefore, we start to reflect on the rationale underlying these methods and aim to construct a lightweight NHA model through the most essential and straightforward design.
最初，我们试图将现有的SOTA模型转换为可以部署在移动的设备上的模型。然而，我们发现这些模型在其结构设计中包含了许多复杂的模块，例如内存模块，动态卷积[9]，注意力[9，15]，多尺度特征扭曲[9，29，38]或图像到平面[32]方法。降低计算复杂性是一项具有挑战性的任务，并且模型的复杂性增加了在移动的设备上部署它们的难度和开发工作量。因此，我们开始反思这些方法的基本原理，并旨在通过最基本、最简单的设计构建一个轻量级的NHA模型。

In fact, Real3D [32] and MCNet [9] represent two distinct categories of motion modeling: explicit facial movement modeling and implicit global motion modeling. Explicit modeling [34, 32, 17] methods often involve predefined facial keypoints or 3D face representation to capture motion driven by facial movements. This results in undefined motion in regions beyond the face, necessitating a powerful motion network to extrapolate and fill in the motion for these areas based solely on facial movements. Implicit modeling methods [30, 38, 21, 20, 9] use an encoder to extract global and image-level motion from inputs without facial priors, representing motion with neural keypoints or latents, which requires a powerful motion network to define facial and background movements.
实际上，Real3D[32]和MCNet[9]代表了两种不同类别的运动建模：显式面部运动建模和隐式全局运动建模。显式建模[34，32，17]方法通常涉及预定义的面部关键点或3D面部表示，以捕获由面部运动驱动的运动。这会导致面部以外区域的运动不确定，需要一个强大的运动网络来推断和填充仅基于面部运动的这些区域的运动。隐式建模方法[30，38，21，20，9]使用编码器从没有面部先验的输入中提取全局和图像级运动，用神经关键点或潜伏点表示运动，这需要强大的运动网络来定义面部和背景运动。

From the Figure 1, we can observe that the explicit modeling method, Real3D, produces poor results in areas not defined by the 3DMM, such as inside the mouth and around the neck. Meanwhile, the implicit modeling method, MCNet, produces notable blurriness at the boundary between the person and the background, possibly due to the lack of an explicit facial region prior. This observation inspired us to develop a more holistic and efficient approach to motion modeling that combines face-specific knowledge with global motion representations, complementing each other.
从图1中，我们可以观察到，显式建模方法Real3D在3DMM未定义的区域（如口腔内和颈部周围）产生的结果很差。同时，隐式建模方法MCNet在人与背景之间的边界处产生显著的模糊，这可能是由于缺乏明确的面部区域先验。这一观察启发了我们开发一种更全面、更有效的运动建模方法，将面部特定知识与全局运动表示相结合，相互补充。

The results shown in Figure 1 not only reveal issues with motion capturing but also indicate inadequate appearance synthesis capability of the model. Many recent advancements [19, 11, 23, 27, 1] in 3D-related fields have been driven by the integration of multiview designs into network architectures. The rationale is intuitive: a network exposed to more appearances can learn more effectively. This concept has led us to explore a parallel avenue: if facial knowledge can strengthen our motion network, might the incorporation of appearance knowledge similarly strengthen our synthesis network? The integration of appearance knowledge may redefine the synthesis network’s task, transforming it from generating all content with a powerful network to efficiently completing content with the provided appearance knowledge, akin to shifting from a closed-book to an open-book exam. Importantly, this can be achieved with virtually no increase in computational load during runtime because appearance knowledge can be prepared in advance.
图1所示的结果不仅揭示了运动捕捉的问题，而且还表明模型的外观合成能力不足。3D相关领域的许多最新进展[19，11，23，27，1]都是通过将多视图设计集成到网络架构中来推动的。其原理是直观的：接触更多外观的网络可以更有效地学习。这个概念让我们探索了一条平行的道路：如果面部知识可以加强我们的运动网络，那么外表知识的结合是否也可以同样加强我们的合成网络？外观知识的集成可以重新定义合成网络的任务，将其从使用强大的网络生成所有内容转换为使用所提供的外观知识有效地完成内容，类似于从闭卷考试转换为开卷考试。重要的是，这可以在运行时几乎不增加计算负载的情况下实现，因为可以提前准备外观知识。

Building on aforementioned observation and considerations, we meticulously design MobilePortrait, our lightweight one-shot neural head avatar method. First, we utilize lightweight U-Nets with conventional convolutional layers as the backbones for motion and synthesis networks, significantly reducing computational requirements compared to existing methods and is easily implementable on mobile devices. Second, to compensate for potential losses in motion accuracy due to reduced computations, we combine implicit global motion modeling with explicit facial motion modeling, introducing mixed keypoints to capture motion. We also design facial knowledge losses to ensure the incorporation of facial knowledge. Lastly, in the image synthesis phase, we incorporated appearance knowledge, utilizing pseudo multi-view features and pseudo backgrounds to enhance synthesis of foreground and background respectively. With these proposed designs, MobilePortrait achieves performance on par with or exceeding state-of-the-art methods with far less computational demand, as shown in Figure 1. Our contributions are succinctly outlined as follows:
基于上述观察和考虑，我们精心设计了MobilePortrait，这是我们轻量级的一次性神经头部化身方法。首先，我们利用具有传统卷积层的轻量级U-Nets作为运动和合成网络的骨干，与现有方法相比显着降低了计算要求，并且易于在移动的设备上实现。其次，为了补偿由于减少计算而导致的运动精度的潜在损失，我们将联合收割机隐式全局运动建模与显式面部运动建模相结合，引入混合关键点来捕获运动。我们还设计了面部知识损失，以确保面部知识的纳入。最后，在图像合成阶段，我们结合外观知识，利用伪多视角特征和伪背景分别增强前景和背景的合成。通过这些拟议的设计，MobilePortrait实现了与最先进的方法相当或超过最先进的方法的性能，计算需求少得多，如图1所示。我们的贡献简要概述如下：

•

We introduce MobilePortrait, which, to the best of our knowledge, is the first one-shot mobile neural head avatar method capable of real-time performance.

·我们介绍MobilePortrait，据我们所知，它是第一个能够实时执行的一次性移动的神经头化身方法。
•

We streamline the task by leveraging external facial and appearance knowledge, merging explicit and implicit keypoints for comprehensive motion capture, and including features like pseudo multiview and background for improved synthesis. This approach allows MobilePortrait to efficiently create neural head avatars with lightweight U-Net [18] backbones.

·我们通过利用外部面部和外观知识，合并显式和隐式关键点以实现全面的运动捕捉，并包括伪多视图和背景等功能以改进合成，从而简化任务。这种方法允许MobilePortrait有效地创建具有轻量级U-Net [ 18]骨干的神经头化身。
•

Extensive testing across various datasets confirms MobilePortrait’s effectiveness. It achieves state-of-the-art performance while requiring significantly fewer FLOPs and parameters. Moreover, we have also verified that MobilePortrait can render at speeds of up to 100+ FPS on mobile devices and support both video and audio driving inputs.

·在各种数据集上进行的广泛测试证实了MobilePortrait的有效性。它实现了最先进的性能，同时需要更少的FLOP和参数。此外，我们还验证了MobilePortrait可以在移动的设备上以高达100+ FPS的速度渲染，并支持视频和音频驱动输入。

2Related Works
2相关工程

Neural head avatar generation can be categorized into video-driven and cross-modal driven approaches. video-driven neural head avatars methods mainly consists of two important parts: motion modeling and image synthesis. The former captures the motion between the source and driving images, while the latter generates the animated pixels. For motion modeling, some methods [20, 21, 38, 9, 29, 30] propose frameworks for decoupling appearance and motion representation in an unsupervised manner. For instance, [20, 21, 38, 9] involve learning to detect 2D implicit keypoints from images and further predict explicit warping flow. FaceV2V [29] expands the network architecture dimension and learns 3D implicit keypoints for motion modeling. LIA [30] constructs a latent motion space and represents motion as a linear displacement of the latent code. In contrast, other works [34, 17, 32] rely on explicit motion representation, such as pre-defined facial landmarks and blendshapes. For example, MetaPortrait [34] uses facial landmarks as input to predict a warp flow, while PIRenderer [17] and Real3D [32] employ the 3DMM model [22] to facilitate decoupling of the control of facial rotation, translation and expression. Although significant progress has been made, implicit and explicit modeling still remain largely independent of each other. For image synthesis, some methods [20, 21, 38] predict motion flow to directly warp the source image, utilizing more original pixel information, while other approaches [29, 34, 9, 4] opt to warp features, offering greater flexibility for subsequent generative networks. StyleHeat [33] explores using pretrained StyleGAN [12] as a generator, achieving neural head avatars through latent edits on the powerful generator.
神经头部化身生成可以分为视频驱动和跨模态驱动的方法。视频驱动的神经头化身方法主要包括两个重要部分：运动建模和图像合成。前者捕获源图像和驱动图像之间的运动，而后者生成动画像素。对于运动建模，一些方法[20，21，38，9，29，30]提出了以无监督方式解耦外观和运动表示的框架。例如，[20，21，38，9]涉及学习从图像中检测2D隐式关键点并进一步预测显式扭曲流。FaceV2V[29]扩展了网络架构维度，并学习了用于运动建模的3D隐式关键点。 LIA[30]构建了一个潜在的运动空间，并将运动表示为潜在代码的线性位移。相比之下，其他作品[34，17，32]依赖于显式运动表示，例如预定义的面部标志和融合形状。例如，MetaPortrait[34]使用面部标志作为输入来预测扭曲流，而PIRenderer[17]和Real3D[32]采用3DMM模型[22]来促进面部旋转，平移和表情控制的解耦。虽然已经取得了显着的进展，隐式和显式建模仍然在很大程度上保持相互独立。对于图像合成，一些方法[20，21，38]预测运动流以直接扭曲源图像，利用更多的原始像素信息，而其他方法[29，34，9，4]选择扭曲特征，为后续生成网络提供更大的灵活性。StyleHeat[33]探索使用预训练的StyleGAN[12]作为生成器，通过对强大的生成器进行潜在编辑来实现神经头化身。

Current cross-modal methods, mainly audio-driven, aim to generate motion signals from audio for natural talking head videos with accurate lip-sync and expressive facial animations. They typically produce driving signals as output and use separately trained video-driven models as video renderers. Sadtalker [36] uses a PoseVAE and ExpNet to generate head pose and expression as motion descriptors generated from audio and adopt FaceV2V [29] as renderer. Vivid-Talk [24] designs an Audio-To-Mesh module to predict 3DMM expression coefficients and lip-related vertex offsets based on an input audio signal and a reference image, while utilizing another mesh-to-video model as a renderer. Some recent works [8, 14, 35, 8, 14, 35] explore the use of diffusion models as audio-to-motion modules, employing VAEs [13] or existing video-driven models as renderers to enhance the accuracy and expressiveness of motion signals. EMO [26] designs a end-to-end diffusion model and can generate highly realistic portrait videos based on audio input. Although the inference process is end-to-end, the multi-stage training procedure for the network can, to some extent, correspond to the audio-to-motion and render modules.Image quality in audio-driven methods largely hinges on the rendering model or the module managing image quality. Given the driving signal’s low transmission cost, it’s suitable for server-side deployment. Thus, an efficient renderer is key for audio-driven neural head avatars on edge devices.
目前的跨模态方法，主要是音频驱动的，旨在从音频生成运动信号，用于具有准确的唇同步和富有表现力的面部动画的自然说话头部视频。它们通常产生驱动信号作为输出，并使用单独训练的视频驱动模型作为视频渲染器。Sadtalker[36]使用PoseVAE和ExpNet生成头部姿势和表情作为从音频生成的运动描述符，并采用FaceV 2 V[29]作为渲染器。Vivid-Talk[24]设计了一个Audio-To-Mesh模块，用于根据输入音频信号和参考图像预测3DMM表情系数和嘴唇相关顶点偏移，同时利用另一个网格到视频模型作为渲染器。最近的一些作品[8，14，35，8，14，35]探索了使用扩散模型作为音频到运动模块，采用VAE[13]或现有的视频驱动模型作为渲染器，以增强运动信号的准确性和表现力。EMO[26]设计了一个端到端的扩散模型，可以根据音频输入生成高度逼真的肖像视频。虽然推理过程是端到端的，但网络的多阶段训练过程在某种程度上可以对应于音频到运动和渲染模块。音频驱动方法中的图像质量在很大程度上取决于渲染模型或管理图像质量的模块。考虑到驱动信号的低传输成本，它适合服务器端部署。因此，高效的渲染器对于边缘设备上的音频驱动神经头化身至关重要。

In general, existing works typically pursue complex network architectures and high computational complexity to achieve high-quality animation but rarely consider scenarios with limited computational resources, leaving on-device neural head avatars generation largely unexplored.
一般来说，现有的作品通常追求复杂的网络架构和高计算复杂度来实现高质量的动画，但很少考虑计算资源有限的场景，使得设备上的神经头化身生成在很大程度上未被探索。

Refer to caption

Figure 2: The video-driven pipeline of MobilePortrait. MobilePortrait processes source and driving image to generate mixed keypoints that are merged from detected neural and facial keypoints. These mixed keypoints, along with precomputed source masks, are used to create optical flow for image warping via a dense motion network. The synthesis network generates the final image by combining the warped image with precomputed pseudo background and multiview foreground features. Since facial and appearance knowledge is precomputed just once, the two simple U-Net backbones account for nearly all of the computational load during inference. In audio-driven mode, an audio-to-keypoints module supplies the driving keypoints.
图2：MobilePortrait的视频驱动管道。MobilePortrait处理源图像和驱动图像，以生成从检测到的神经和面部关键点合并的混合关键点。这些混合关键点与预先计算的源掩模一起沿着用于通过密集运动网络创建用于图像扭曲的光流。合成网络通过将扭曲图像与预先计算的伪背景和多视图前景特征相结合来生成最终图像。由于面部和外表知识仅预先计算一次，因此两个简单的U-Net主干几乎承担了推理期间的所有计算负载。在音频驱动模式下，音频到关键点模块提供驱动关键点。

3Method
3方法

This section first provides an overview of MobilePortrait’s architecture, shown in Figure 2, comprising two primary modules: motion generation and image synthesis. Then in Section 3.2 we describe the hybrid motion modeling designed within the motion generation module, which utilizes both explicit and implicit facial keypoints. Next, we introduce techniques that enhance image synthesis through precomputed appearance knowledge in Section 3.3. Subsequently, in Section 3.4, we present the audio-to-motion module, which allows MobilePortrait to be driven by audio input. Finally, we outline the loss functions employed during training in Section 3.5.
本节首先概述MobilePortrait的架构，如图2所示，包括两个主要模块：运动生成和图像合成。然后在第3.2节中，我们描述了在运动生成模块内设计的混合运动建模，其利用显式和隐式面部关键点。接下来，我们将在3.3节中介绍通过预先计算的外观知识来增强图像合成的技术。随后，在第3.4节中，我们介绍了音频到运动模块，它允许MobilePortrait由音频输入驱动。最后，我们在3.5节中概述了训练过程中使用的损失函数。

3.1Overview of MobilePortrait
3.1MobilePortrait概述

As depicted in Figure 2, with video-driven animation as an example, MobilePortrait processes the source image 𝐒 and each driving frame 𝐃 from the driving video, generating target images frame by frame. Specifically, within the motion generation module, Keypoint Detectors initially produce a set of keypoints, which are our proposed mixed keypoints in the MobilePortrait, for both 𝐒 and 𝐃, i.e. {xs,i,ys,i}i=1Nm⁢k and {xd,i,yd,i}i=1Nm⁢k. The subsequent warping and generation process is similar to the previous works [38, 9, 20, 21]. Based on these keypoints, we follow TPS [38] to generate the initial transformations. Keypoints represented as heatmaps are input into the dense motion network and combined with initial transformations to generate the motion field, 𝐌, delineating the pixel displacement from 𝐒 to 𝐃, or in other words, the optical flow. Based on the source image and the optical flow, a warp operation is performed to obtain the initial warped image, which is then multiplied by another output from the dense motion, the occlusion maps, to produce the final warped image 𝐒w. Subsequently, the Image Synthesis module leverages 𝐒w and auxiliary appearance knowledge features derived from 𝐒 to create the final target image through a synthesis network. For efficient computation and mobile deployment friendliness, we retained simple U-Nets without the additions from prior work [9, 38, 29], such as multi-scale feature warping, dynamic convolution, and attention modules, as backbones for both the dense motion network and the synthesis network. 重试错误原因

Refer to caption

Figure 3: The motion generation process of MobilePortrait. (a) represents the optical flow generation method adopted by our MobilePortrait, where NK and FK represent the neural and facial keypoints, respectively. (b) is the method used in literatures [20, 38, 9, 21]; (c) is similar to literature [34], which directly obtains optical flow through CNN. For brevity, we omitted the heatmap generation and occlusion process.
图3：MobilePortrait的运动生成过程。（a）表示我们的MobilePortrait采用的光流生成方法，其中NK和FK分别表示神经和面部关键点。(b)是文献[20，38，9，21]中使用的方法;（c）与文献[34]类似，直接通过CNN获得光流。为了简洁起见，我们省略了热图生成和遮挡过程。

3.2Motion Generation with Facial Knowledge
3.2利用面部知识生成运动

Mixed Keypoint Representation. In the Motion Generation module, prior works such as FOMM [20], TPS [38], and MCNet [9] employ similar network design structures. They utilize a neural keypoint predictor, denoted as NK detector, to separately predict a pair of keypoints for 𝐒 and 𝐃, and based on these keypoints, construct an initial collection of transformations. Dense motion network (DMN) then predicts local weights for this collection of transformations and occlusion maps for warped image. The optical flow field is obtained through a weighted summation of these elements. This process is similar to part (b) described in Figure 3.
混合关键点表示。在运动生成模块中，FOMM[20]，TPS[38]和MCNet[9]等先前的工作采用了类似的网络设计结构。他们利用神经关键点预测器（表示为NK检测器）分别预测 𝐒 和 𝐃 的一对关键点，并基于这些关键点构建初始转换集合。然后，密集运动网络（DMN）预测该变换集合的局部权重和扭曲图像的遮挡图。通过这些元素的加权求和获得光流场。该过程类似于图3中描述的部分（B）。

Neural keypoints enable the network to learn global motion information from the driving video, as well as facial movements. However, as the computational load of the dense motion network decreases, the network struggles to distinguish between the motion of the face and the background, leading to severe artifacting, akin to a "liquefaction" effect, or may even result in an inability to drive the synthesized video, as shown in the visualization results in Figure 4. To address this, we introduce a pretrained face keypoint detector to extract facial landmarks from 𝐒 and 𝐃 respectively. A mixed keypoint predictor, the merger shown in Figure 2, then merges the neural keypoints and the face keypoints to create mixed keypoints. As shown in the left part of Figure 3, once the mixed keypoints are calculated, we proceed to calculate the optical flow based on these keypoints, replacing the neural keypoints used in previous methods [9, 38, 20, 21]. Our experiments indicate that integrating implicit and explicit keypoints effectively reduces global liquefaction artifacts and enhances motion precision in the generated videos and also performs better than other methods of incorporating facial information. Additionally, inspired by MetaPortrait [34] and the ResNet [7] architecture, we add two extra output channels to the last layer of our dense motion network. This modification enables the network to produce a residual optical flow, enhancing the expressiveness of the generated optical flow.
神经关键点使网络能够从驾驶视频以及面部运动中学习全局运动信息。然而，随着密集运动网络的计算负载降低，网络难以区分面部和背景的运动，导致严重的伪像，类似于“液化”效果，甚至可能导致无法驱动合成视频，如图4中的可视化结果所示。为了解决这个问题，我们引入了一个预训练的人脸关键点检测器，分别从 𝐒 和 𝐃 中提取人脸标志。混合关键点预测器（图2中所示的合并器）然后合并神经关键点和面部关键点以创建混合关键点。如图3的左侧部分所示，一旦计算出混合关键点，我们就继续基于这些关键点计算光流，取代先前方法中使用的神经关键点[9，38，20，21]。我们的实验表明，集成隐式和显式关键点有效地减少了全球液化伪影，提高了运动精度，在生成的视频，也比其他方法结合面部信息表现更好。此外，受MetaPortrait[34]和ResNet[7]架构的启发，我们在密集运动网络的最后一层添加了两个额外的输出通道。这种修改使得网络能够产生残余光流，增强所生成的光流的表现力。

Face-Aware Motion Generation. In addition to incorporating facial priors into the keypoints representation, we enrich the input to the dense motion network with a foreground mask and facial landmarks mask from the source image. These only need to be computed once, preserving real-time inference capabilities. As shown in Figure 2, we further design a facial knowledge loss. Specifically, we add two predictors for these masks to the last feature layer of the DMN, which are trained with L1 losses to predict the foreground and landmarks mask for the driving image. These predictors, existing only during training, help the model to better understand portrait integrity, facilitating improved face-aware motion generation.
面部感知运动生成。除了将面部先验信息纳入关键点表示之外，我们还使用来自源图像的前景掩模和面部地标掩模来丰富密集运动网络的输入。这些只需要计算一次，保持实时推理能力。如图2所示，我们进一步设计了一个面部知识损失。具体来说，我们将这些掩模的两个预测器添加到DMN的最后一个特征层，这些预测器使用L1损失进行训练，以预测驾驶图像的前景和地标掩模。这些预测器仅在训练期间存在，有助于模型更好地理解肖像完整性，促进改进的面部感知运动生成。

With these enhancements, our motion generation module leverages external facial knowledge to perform motion capture at both the face level and the video level, with virtually no increase in computational cost. This enables the model to generate a plausible optical flow 𝐌 even when the computational load of dense motion network is reduced.
通过这些增强，我们的运动生成模块利用外部面部知识在面部级别和视频级别执行运动捕捉，几乎不会增加计算成本。这使得即使当密集运动网络的计算负荷降低时，模型也能够生成似真光流 𝐌 。

3.3Image Synthesis with Appearance Knowledge
3.3具有外观知识的图像合成

Image synthesis based on the warped source image capitalizes on the original pixel information, but as the warping itself doesn’t create new pixel data, reliance on the warped source may lead to diminished synthesis quality when there are changes in pose angles, as shown in Figure 1. To compensate for the decrease in synthesis quality due to reduced complexity, we utilize the warped source image as input to U-Net based synthesis network and introduce precomputed visual features from source image to decrease the burden on Image Synthesis module.
基于变形源图像的图像合成利用了原始像素信息，但由于变形本身不会创建新的像素数据，因此当姿态角发生变化时，依赖变形源可能会导致合成质量降低，如图1所示。为了弥补由于复杂度降低而导致的合成质量下降，我们利用变形的源图像作为基于U-Net的合成网络的输入，并从源图像中引入预先计算的视觉特征，以减轻图像合成模块的负担。

Enhanced Foreground Synthesis. We sample T frames uniformly from the driving video and, with the source image, generate T newly warped images using our motion generation module. As depicted in the top-right part of Figure 2, To ensure efficient feature extraction and fusion, we opt for the final downblocks of the U-Net, corresponding to the lowest spatial resolution. The early layers of the U-Net, up to the last downblock, are utilized for feature extraction from the newly warped image to obtain multiview features. An additional convolution layer merges multiview features with those of the current frame within the corresponding downblock. Apart from this, there are no further differences or additional computational burdens imposed on the synthesis network. These pseudo multiview image features offer appearance information from different poses to aid in enhancing the quality of synthesis and can be precomputed, thus not hindering inference efficiency.
增强前景合成。我们从驾驶视频中均匀采样 T 帧，并使用源图像，使用我们的运动生成模块生成 T 新扭曲的图像。如图2右上部分所示，为了确保有效的特征提取和融合，我们选择了U-Net的最终下块，对应于最低的空间分辨率。U-Net的早期层（直到最后一个下块）用于从新扭曲的图像中提取特征，以获得多视图特征。额外的卷积层将多视图特征与对应的下块内的当前帧的多视图特征合并。除此之外，没有进一步的差异或对合成网络施加额外的计算负担。这些伪多视图图像特征提供来自不同姿态的外观信息以帮助提高合成质量，并且可以预先计算，从而不妨碍推理效率。

Enhanced Background Synthesis. We employ an offline inpainting model to fill in the source image after foreground removal, creating a complete background picture as shown in the top-right part of Figure 2. This inpainted background, along with a mask of the foreground, serves as extra inputs to the synthesis network. To ensure that the Image Synthesis module can effectively utilize this background information, we perform inpainting on the driving image during training, which has proven crucial in our experiments.
增强背景合成。我们采用离线修复模型来填充前景去除后的源图像，创建一个完整的背景图片，如图2的右上部分所示。这个修补的背景，沿着前景的遮罩，作为合成网络的额外输入。为了确保图像合成模块可以有效地利用这些背景信息，我们在训练过程中对驾驶图像进行修复，这在我们的实验中被证明是至关重要的。

With these improvements, our image synthesis can rely on a simple yet efficient U-Net backbone while maintaining high-quality synthesis results during inference, with negligible additional computational cost.
通过这些改进，我们的图像合成可以依赖于一个简单而高效的U-Net主干，同时在推理过程中保持高质量的合成结果，而额外的计算成本可以忽略不计。

3.4Audio-Driven Functionality
3.4音频驱动功能

In this section, we first introduce a baseline solution that enables MobilePortrait to support audio-driven functionality. To enable MobilePortrait to process audio-driven signals, we need to extract neural keypoints and facial keypoints from the audio input. Inspired by the audio-to-motion designed in SadTalker [36] and VividTalk [24], we train an audio-to-motion model that includes two modules: audio-to-mesh and mesh-to-neural keypoints. The former uses LSTM to convert audio signals into 3D Morphable Model (3DMM) coefficients to acquire facial meshes, whereas the latter employs a ResNet18 [7] to predict neural keypoints from images sketched with sampled mesh vertices and edges. Facial keypoints are directly extracted from the mesh. With this setup, we capture the necessary motion signals, including neural and facial keypoints, for driving MobilePortrait with audio input. Thanks to the trained mesh-to-neural keypoints module, MobilePortrait can also be driven by 3DMM. This not only facilitates expression editing via 3DMM but also enhances results in cross-identity scenarios when driven by 3DMM. It is important to note that we provide merely a baseline solution here, enabling audio-driven capability for MobilePortrait. MobilePortrait can accommodate more sophisticated designs [24, 32, 35, 14] to achieve improved results.
在本节中，我们首先介绍一个基线解决方案，它使MobilePortrait能够支持音频驱动功能。为了使MobilePortrait能够处理音频驱动的信号，我们需要从音频输入中提取神经关键点和面部关键点。受SadTalker[36]和VividTalk[24]中设计的音频到动作的启发，我们训练了一个音频到动作模型，其中包括两个模块：音频到网格和网格到神经关键点。前者使用LSTM将音频信号转换为3D Morphable Model（3DMM）系数以获取面部网格，而后者使用ResNet 18[7]从采样网格顶点和边缘的图像中预测神经关键点。面部关键点直接从网格中提取。通过这种设置，我们捕获了必要的运动信号，包括神经和面部关键点，用于通过音频输入驱动MobilePortrait。由于经过训练的网格到神经关键点模块，MobilePortrait也可以由3DMM驱动。这不仅方便了通过3DMM进行表达式编辑，而且还增强了由3DMM驱动的跨身份场景中的结果。值得注意的是，我们在这里提供的只是一个基线解决方案，为MobilePortrait启用音频驱动功能。MobilePortrait可以适应更复杂的设计[24，32，35，14]，以实现更好的效果。

3.5Training Losses
3.5培训损失

Following previous works[20, 38, 9], We employ perceptual loss ℒp⁢e⁢r⁢c⁢e⁢p and L1 loss ℒL⁢1 to optimize feature and pixel distances, keypoint distance loss ℒk⁢p for facial keypoints accuracy, and equivariance loss [38] ℒe⁢q for neural keypoint stability. Additionally, we add two proposed facial knowledge loss terms (shown in the top-left part of Figure 2), implemented in L1 loss, to make dense motion network to be aware of the landmark mask ℒl⁢a⁢n⁢d⁢m⁢a⁢r⁢k and foreground mask ℒm⁢a⁢s⁢k. The final loss can be written as follows:
根据以前的作品[20，38，9]，我们采用感知损失 ℒp⁢e⁢r⁢c⁢e⁢p 和L1损失 ℒL⁢1 来优化特征和像素距离，关键点距离损失 ℒk⁢p 用于面部关键点准确性，等方差损失[38] ℒe⁢q 用于神经关键点稳定性。此外，我们添加了两个建议的面部知识损失项（如图2的左上部分所示），在L1损失中实现，以使密集运动网络能够意识到地标掩码 ℒl⁢a⁢n⁢d⁢m⁢a⁢r⁢k 和前景掩码 ℒm⁢a⁢s⁢k 。最后的损失可以写如下：

ℒ=ℒp⁢e⁢r⁢c⁢e⁢p+ℒL⁢1+ℒk⁢p+ℒe⁢q+ℒl⁢a⁢n⁢d⁢m⁢a⁢r⁢k+ℒm⁢a⁢s⁢k

(1)

4Experiments
4个实验

Experimental Setup. To rigorously assess method effectiveness, we trained and tested our approach using various datasets. For training, we leveraged the VFHQ [31], VoxCeleb2 [3], and CelebvHQ [40] datasets, which together comprise 16,827 high-resolution portrait clips from VFHQ, 35,666 clips at 512×512 from CelebVHQ, and 150,480 clips at 256×256 from VoxCeleb2, collectively representing more than 21,000 distinct identities. To evaluate generalization, We construct a test set drawn from multiple datasets, including Talking Head 1K [29], CCv2 [6], and HDTF [37], from which we randomly sampled 38, 137, and 100 videos according to the dataset proportions, respectively. Videos were processed at 25 FPS, square-cropped based on face detection, and resized to 512px.
实验设置。为了严格评估方法的有效性，我们使用各种数据集训练和测试了我们的方法。为了进行训练，我们利用了VFHQ[31]，VoxCeleb2[3]和CelebvHQ[40]数据集，这些数据集包括来自VFHQ的16，827个高分辨率肖像剪辑，来自CelebVHQ的35，666个512×512剪辑，以及来自VoxCeleb2的150，480个256×256剪辑，共同代表了超过21，000个不同的身份。为了评估泛化能力，我们从多个数据集中构建了一个测试集，包括Talking Head 1K[29]，CCv2[6]和HDTF[37]，我们根据数据集比例分别随机抽取了38，137和100个视频。视频以25 FPS处理，基于面部检测进行方形裁剪，并调整为512px。

Implementation Details. For facial keypoints FK, we adopt the 106 landmark protocol, and for neural keypoints NK, we select 50 points. The mixed keypoint predictor is realized by concatenating keypoints and processing them through a MLP. By fusing FK and NK, we obtain 50 mixed keypoints. A foreground segmenter [2] was employed for mask extraction, and LaMa [25] was used for background inpainting. The training processes of MobilePortrait are performed on 8 NVIDIA A100 GPUs, with a learning rate of 0.002 for 60 epochs.
实施细节。对于面部关键点FK，我们采用106界标协议，对于神经关键点NK，我们选择50个点。混合关键点预测器通过连接关键点并通过MLP处理它们来实现。通过融合FK和NK，我们获得了50个混合关键点。前景分割器[2]用于掩模提取，LaMa[25]用于背景修复。MobilePortrait的训练过程在8个NVIDIA A100 GPU上执行，60个epoch的学习率为0.002。

Metrics To comprehensively evaluate the efficacy of our method, we employed multiple metrics and assessed both same-id reenactment and cross-id reenactment. To evaluate the quality of the generated images, we used common image quality metrics [9, 32] including reference-based indices like FID, SSIM, and PSNR, as well as identity preservation indicator CSIM. To measure the accuracy and stability of the synthesized motion, we evaluated average keypoint distance(AKD), head pose distance(HPD) and expression errors(AED). Additionally, referencing recent text-to-video evaluation metrics [10], we add a background consistency index (BCI).
度量为了全面评估我们的方法的有效性，我们采用了多个指标，并评估了相同ID的重演和跨ID的重演。为了评估生成图像的质量，我们使用了常见的图像质量度量[9，32]，包括基于参考的指标，如FID，SSIM和PSNR，以及身份保留指标CSIM。为了衡量合成运动的准确性和稳定性，我们评估了平均关键点距离（AKD），头部姿势距离（HPD）和表达误差（AED）。此外，参考最近的文本到视频评估指标[10]，我们添加了背景一致性指数（BCI）。

Compared methods. To validate the superiority and effectiveness of our method, we conducted comparative tests with recent top-performing methods, including latent-driven TPS, MCNet and FaceV2V, as well as approaches that use landmarks and 3DMM like Real3D, PIRender. For fair comparisons, we trained all methods on the same datasets as previously described, with the exception of Real3D. Due to its complex training requirements, we use the official release model trained on CelebVHQ. To account for this, we also included the performance of our method under the same conditions in Table 5. Notably, these methods were not initially designed for computational efficiency, which often results in them having higher FLOPs and parameter counts, making them challenging to deploy on mobile devices.
比较方法。 为了验证我们的方法的优越性和有效性，我们与最近表现最好的方法进行了比较测试，包括延迟驱动的TPS，MCNet和FaceV2V，以及使用地标和3DMM的方法，如Real3D，PIErender。为了公平比较，我们在与前面描述的相同的数据集上训练了所有方法，除了Real3D。由于其复杂的训练要求，我们使用在CelebVHQ上训练的官方发布模型。为了说明这一点，我们还将我们的方法在相同条件下的性能包括在表5中。值得注意的是，这些方法最初不是为了计算效率而设计的，这通常导致它们具有更高的FLOP和参数计数，使得它们在移动的设备上部署具有挑战性。

Table 1:Comparisons with SOTA methods in video-driven same/cross-identity reenactment. Bold means best scores and underline means top3 scores.
表1：在视频驱动的相同/交叉身份重演中与SOTA方法的比较。粗体表示最好成绩，下划线表示前3名成绩。

Method	Same-Identity Reenactment 同一身份再现							Cross-Identity 交叉身份				Cost 成本
Method	FID↓ FID编号0#	PSNR↑ PSNR ↑	SSIM↑ SSIM ↑	AKD↓ AKD编号0#	HPD↓ HPD ↓	AED↓ AED编号0#	BCI↑ BCI ↑	CSIM↑ CSIM ↑	HPD↓ HPD ↓	AED↓ AED编号0#	BCI↑ BCI ↑	FLOPs(G) FLOPs（G）
PIRender [17] PIErender 17	39.1	22.7	77.8	2.14	0.99	0.09	96.9	45.7	4.50	0.15	96.7	131
FaceV2V [29]	29.3	22.5	85.3	1.96	2.52	0.06	97.2	46.0	5.45	0.15	97.2	629
TPS [38]	29.8	27.3	87.7	1.43	0.71	0.06	97.9	38.9	4.61	0.15	97.6	140
MCNet [9]	27.2	28.5	88.7	1.33	0.81	0.05	97.8	27.6	6.69	0.16	97.5	200
Real3D [32]	50.8	23.1	80.6	1.63	0.82	0.08	97.6	47.8	3.74	0.17	97.5	610
Ours 我们	29.2	26.1	85.9	1.30	0.40	0.05	98.2	39.2	2.74	0.13	97.9	16

Table 2:Experimental results of models with differernt FLOPs.
表2：具有重复FLOP的模型的实验结果。

FLOPs	Device 装置	#Param.	Latency 延迟
16G	iPhone14 Pro	67.7M	15.8ms
7G	iPhone14 Pro	40.8M	6.4ms
4G	iPhone14 Pro	25.5M	5.9ms
16G	iPhone12	67.7M	25.5ms
7G	iPhone12	40.8M	10.9ms
4G	iPhone12	25.5M	8.9ms

[Uncaptioned image]

Refer to caption

Figure 4: Visualizations Comparisons among models with different FLOPs.
图4：不同FLOP模型之间的可视化比较。 Table 3:Ablation studies of motion generation.
表3：运动生成的消融研究。

Method	FID	AKD	AED⁢(C)	HPD⁢(C)
Mixed Keypoint. 混合关键点。	29.2	1.30	3.0	0.13
NK-Only 仅NK	48.3	2.62	3.9	0.17
FK-Only 仅FK	33.2	1.61	10.5	0.13
No Proposed Loss 无拟议损失	29.1	1.45	4.19	0.13

Method	FID	AKD	AED⁢(C)	HPD⁢(C)
Ours 我们	29.2	1.30	3.0	0.13
No Residual O.F. 无残留O.F.	28.5	1.45	6.2	0.13
Conv. Motion Conv.运动	43.9	1.49	3.7	0.15
Spase Motion Spase运动	44.7	3.14	9.5	0.14

Table 4:Ablation studies of image synthesis.
表4：图像合成的消融研究。

Inp. BG 输入BG	FG Comp. FG成分。	FID	AKD	AED⁢(C)	HPD⁢(C)
		30.1	1.54	7.3	0.14
✓		29.2	1.30	2.74	0.13
	✓	30.0	1.52	5.70	0.13
✓	✓	29.7	1.47	10.0	0.13

#Views	FID	AKD	AED⁢(C)	HPD⁢(C)
No Ref. 无参考	34.2	2.53	3.21	0.13
2	31.3	1.53	3.07	0.13
4	29.2	1.30	2.7	0.13
8	30.2	1.31	2.1	0.13

Table 5:Ablation studies of training datasets.
表5：训练数据集的消融研究。

Method	FID↓	PSNR↑	SSIM↑	AKD↓	APD⁢(C)↓	AED⁢(C)↓	CSIM⁢(C)↑
Full Datasets 全部数据集	29.2	26.1	85.9	1.30	2.7	0.13	39.2
remove VoxCelebvHQ 删除VoxCelebvHQ	32.5	26.0	85.8	1.43	2.9	0.13	37.7
remove VFHQ 删除VFHQ	37.1	25.4	84.3	1.49	3.6	0.13	38.5

4.1Comparisons with SOTA methods
4.1与SOTA方法的比较

In this section, we contrast MobilePortrait’s video-driven performance with other techniques in Table 1, and will later include an audio-driven comparison. Given that audio-driven methods often use video-driven approaches for rendering, video-driven analysis serves as a reliable measure of synthesis quality. In same-id scenarios, the source image is sampled from the driving video, meaning there exists ground truth video for reference. In cross-id scenarios, the source image is not derived from the driving video; instead, we randomly select and sample a frame from another video in datasets, so there is no GT video for direct comparison, and we do not assess reference-based image quality metrics.
在本节中，我们将MobilePortrait的视频驱动性能与表1中的其他技术进行比较，稍后将包括音频驱动的比较。鉴于音频驱动方法通常使用视频驱动方法进行渲染，视频驱动分析可作为合成质量的可靠度量。在相同ID的场景中，源图像是从驾驶视频中采样的，这意味着存在地面实况视频供参考。在交叉ID场景中，源图像不是从驾驶视频中导出的;相反，我们从数据集中的另一个视频中随机选择并采样一帧，因此没有GT视频用于直接比较，并且我们不评估基于参考的图像质量指标。

It can be discerned that MobilePortrait, despite employing a smaller computational load, achieves outcomes comparable to those with greater computational resources and excels in key metrics, leading in AKD and BCI and ranking second in FID, which assess motion and image quality. Furthermore, during cross-identity reenactment, the lead in HPD, AED and BCI metrics also demonstrates the effectiveness of MobilePortrait. While MobilePortrait does not achieve the best results in the CSIM, later visualization results show that it can yield satisfactory outcomes.
可以看出，尽管MobilePortrait采用了较小的计算负载，但其结果与具有更大计算资源的结果相当，并且在关键指标方面表现出色，在AKD和BCI中领先，在评估运动和图像质量的FID中排名第二。此外，在跨身份重现过程中，HPD、AED和BCI指标的领先也证明了MobilePortrait的有效性。虽然MobilePortrait在CSIM中没有达到最佳效果，但后来的可视化结果表明它可以产生令人满意的结果。

4.2Comparisons among Different Computational Loads
4.2不同计算荷载的比较

Here, a comparative analysis of performance across various computational scales (FLOPs) is provided in right part of Table 2. By reducing the number of channels and layers, we obtain models of different sizes. MobilePortrait remarkably maintains satisfactory performance on key metrics such as FID and AKD, as well as cross-identity motion accuracy like HPD and AED, even when computational resources are limited to just 4 GFLOPs, marking a significant improvement over the baseline, which does not incorporate external facial and appearance knowledge as demonstrated in Figure 2. Moreover, visualization results in Figure 4 showcasing our approach’s effectiveness. Concurrently, in the left part of Table 2 also details the computational resource consumption of MobilePortrait on mobile devices, underscoring its efficient viability on mobile platforms.
这里，在表2的右侧部分提供了跨各种计算尺度（FLOP）的性能的比较分析。通过减少通道和层的数量，我们得到不同大小的模型。MobilePortrait在FID和AKD等关键指标以及HPD和AED等跨身份运动准确性方面保持了令人满意的性能，即使计算资源仅限于4 GFLOP，也标志着基线的显着改进，如图2所示，基线不包含外部面部和外观知识。此外，图4中的可视化结果展示了我们方法的有效性。同时，在表2的左侧部分中还详细描述了MobilePortrait在移动的设备上的计算资源消耗，强调了其在移动的平台上的有效可行性。

4.3Ablation Studies
4.3消融研究

Motion Generation. we conduct ablation experiments to validate the effectiveness of proposed components. We assess key metrics for image quality and motion, such as FID and AKD, along with AED and HPD specifically in cross-identity scenarios, denoted as AED(C) and HPD(C). Table 3 presents comparisons among employing mixed keypoints, neural keypoints only and face keypoints only settings, where mixed keypoints demonstrate significant performance improvements. Additionally, excluding our proposed facial knowledge losses degrades results. We also explored alternative approaches to integrating NK and FK beyond the mixed keypoint predictor. For instance, as shown in (b) of Figure 3 and drawing inspiration from literatures [20, 29, 21], we perform fusion on the initial transformation. We transform FK into sparse motions to generate transformations, which, when concatenated with NK’s transformations, yield a combined transformations. Alternatively, in (c) of the figure, both NK and FK are converted into heatmaps, which are then directly fed into a convolutional network to generate optical flow. However, these methods did not achieve better motion accuracy than mixed keypoints. Additionally, we experimented with removing the residual optical flow and observed that it indeed resulted in decreased motion accuracy, although it also introduced some perturbations to the FID. The experimental results demonstrate that integrating explicit and implicit information significantly improves the generated outcomes in terms of image quality and motion, while the fusion form of mixed keypoints is a simple and effective design.
运动生成。我们进行烧蚀实验以验证所提出的组件的有效性。我们评估图像质量和运动的关键指标，例如FID和AKD，特别是在跨身份场景中，沿着AED和HPD，表示为AED（C）和HPD（C）。表3给出了采用混合关键点、仅神经关键点和仅面部关键点设置之间的比较，其中混合关键点展示了显著的性能改进。此外，排除我们提出的面部知识损失会降低结果。我们还探索了在混合关键点预测器之外集成NK和FK的替代方法。例如，如图3的（B）所示，并从文献[20，29，21]中汲取灵感，我们对初始变换进行融合。我们将FK变换为稀疏运动以生成变换，当与NK的变换级联时，生成组合变换。或者，在图的（c）中，NK和FK都被转换成热图，然后直接馈送到卷积网络中以生成光流。然而，这些方法并没有实现比混合关键点更好的运动精度。此外，我们实验去除残留的光流，并观察到它确实导致运动精度下降，尽管它也引入了一些扰动的FID。实验结果表明，显式和隐式信息的集成显着提高了图像质量和运动方面的生成结果，而混合关键点的融合形式是一种简单有效的设计。

Table 6:Comparisons with audio-driven methods
表6：与音频驱动方法的比较

Method	Sync-C↑ Sync-C ↑	Sync-D↓ Sync-D ↓	BSI↑ BSI编号0#	Training Data 训练数据
MakeItTalk [39]	4.77	10.19	98.0	109 ID
SadTalker [36]	7.32	7.87	98.2	1890 Videos, 46 ID 1890视频，46 ID
Real3D [32]	7.06	7.77	98.0	200 Hours, 6000 ID 200小时，6000 ID
MobilePortrait 手机肖像	6.01	9.02	98.5	16 Hours, 20 ID 16小时，20 ID

Enhanced Background Synthesis. In this section, we assess the effectiveness and usage of pseudo background in synthesis networks and list experimental results the left part of Table 4, where the setting employed in our method is marked in gray. We investigated four configurations, namely whether pseudo background should be input into the synthesis network (abbreviated as Inp. BG) and whether the model should synthesize the background (given the presence of pseudo backgrounds, we have the option to generate only the foreground and then composite it onto the background by additionally predicting an alpha channel, abbreviated as FG Comp.). We find that pseudo-background integration indeed enhances performance by transforming the task from full generation to knowledge-aided synthesis. Table 4 shows that separately generating and merging the foreground and background does not significantly improve performance. Our method, highlighted in row two, enhances synthesis by end-to-end training with pseudo backgrounds derived from driving images, enabling effective utilization of this knowledge in creating the final image. Real3D, which employs volume rendering, attempts to integrate the rendered head with the background using a split-and-merge-like strategy, but this can lead to inconsistent motion and visible discrepancies, as depicted in Figure 1 and Figure 5.
增强背景合成。在本节中，我们评估了合成网络中伪背景的有效性和使用情况，并在表4的左侧部分列出了实验结果，其中我们的方法中采用的设置以灰色标记。我们研究了四种配置，即是否应将伪背景输入到合成网络（简称为Inp. BG）以及模型是否应该合成背景（考虑到伪背景的存在，我们可以选择仅生成前景，然后通过额外预测alpha通道（缩写为FG Comp）将其合成到背景上）。我们发现，伪背景集成确实提高了性能，从全代知识辅助合成的任务。表4显示，分别生成和合并前景和背景并不能显著提高性能。我们的方法（在第二行中突出显示）通过端到端的训练来增强合成，其中伪背景来自于驾驶图像，从而能够有效地利用这些知识来创建最终图像。Real3D采用体绘制，试图使用类似拆分和合并的策略将渲染的头部与背景整合在一起，但这可能导致不一致的运动和可见差异，如图1和图5所示。

Enhanced Foreground Synthesis. For the pseudo multi-view inputs, experiments are conducted to examine the influence of different numbers of pseudo multi-view inputs on synthesis quality. We examine configurations with 0 (indicating the absence of multiview inputs), 2, 4, and 8 multiview inputs, and observe that each incremental addition of multiview inputs proportionately enhances the synthesis outcomes. In the right part of Table 4, as the number of images increases, the improvement in results begins to saturate, an observation consistent with phenomena encountered in video classification [28, 5] and 3D-related tasks [23]. This correlation is striking. Indeed, as multiview inputs near the frame count of the driving video, the synthesis process naturally evolves from generating static images to creating dynamic video sequences. Adding temporal data improves stability, yielding predictable and notable enhancements.
增强前景合成。对于伪多视点输入，实验研究了不同伪多视点输入数目对合成质量的影响。我们检查配置为0（表示没有多视图输入），2，4，和8多视图输入，并观察到，每一个增量增加多视图输入成比例地增强合成结果。在表4的右侧部分，随着图像数量的增加，结果的改善开始饱和，这与视频分类[28，5]和3D相关任务[23]中遇到的现象一致。这种相关性是惊人的。实际上，当多视图输入接近驾驶视频的帧计数时，合成过程自然地从生成静态图像演变为创建动态视频序列。添加时态数据可以提高稳定性，从而产生可预测和显著的增强。

The experiments confirm our method’s effectiveness and our premise that external knowledge boosts the motion and synthesis networks with minimal extra computational cost, allowing model to deliver satisfactory results with reduced overhead.
实验证实了我们方法的有效性和我们的前提，即外部知识以最小的额外计算成本提升运动和合成网络，使模型能够以减少的开销提供令人满意的结果。

Training data. In addition to the method, we are also interested in whether the data plays a significant role in achieving satisfactory performance for lightweight neural head avatar methods. We sequentially remove training datasets, starting with the relatively lower-quality Dataset VoxCeleb2 [3], followed by the removal of the high-quality Dataset VFHQ [31], leaving only Dataset CelebvHQ [40]. Table 5 illustrates that the removal of datasets leads to a decline in performance. However, this decline is mainly reflected in the image quality index FID, with a relatively smaller impact on motion accuracy. This results demonstrate the robust motion modeling capability of our motion generation module, which, even with less data, maintains superior motion accuracy compared to some methods listed in Table 1. The VFHQ dataset, due to its higher clarity, has a more pronounced impact on FID, aligning with the expectations.
训练数据。除了方法之外，我们还感兴趣的是数据是否在实现轻量级神经头化身方法的令人满意的性能方面起着重要作用。我们依次删除训练数据集，从相对较低质量的数据集VoxCeleb2开始[3]，然后删除高质量的数据集VFHQ[31]，只留下数据集CelebvHQ[40]。表5说明了删除数据集会导致性能下降。但这种下降主要体现在图像质量指标FID上，对运动精度的影响相对较小。该结果证明了我们的运动生成模块的鲁棒运动建模能力，与表1中列出的一些方法相比，即使使用较少的数据，该运动生成模块也保持了上级运动精度。 VFHQ数据集由于其更高的清晰度，对FID的影响更明显，与预期一致。

4.4Experimental Results on More Application Scenarios
4.4更多应用场景的实验结果

Comparisons among Audio-to-Motion. We compared MobilePortrait with some audio-driven methods. We used the 100 videos from HDTF for testing and measured lip synchronization using the Sync-D and Sync-C metrics generated by SyncNet [16] and evaluated background consistency using BSI. The results in Table 6 indicate that MobilePortrait achieves comparable performance to some audio-driven methods, outperforms one of them, and exhibits superior visual stability. It is noteworthy that MobilePortrait’s primary focus is to achieve a real-time neural head avatar method on mobile devices. Our audio-to-motion module, trained on a limited set of our own speaking videos, serves as a baseline to showcase its adaptability to audio inputs. We provide audio-driven video samples in the supplementary materials, demonstrating that MobilePortrait can achieve satisfactory results.
音频到运动之间的比较。我们将MobilePortrait与一些音频驱动的方法进行了比较。我们使用来自HDTF的100个视频进行测试，并使用SyncNet[16]生成的Sync-D和Sync-C指标测量嘴唇同步，并使用BSI评估背景一致性。表6中的结果表明，MobilePortrait实现了与某些音频驱动方法相当的性能，优于其中之一，并表现出上级视觉稳定性。值得注意的是，MobilePortrait的首要重点是在移动的设备上实现一种实时的神经头化身方法。我们的音频到动作模块，在有限的一组我们自己的演讲视频上训练，作为一个基线，展示其对音频输入的适应性。我们在补充材料中提供了音频驱动的视频样本，证明MobilePortrait可以获得令人满意的结果。

Refer to caption

Figure 5: Visualization Results. To compare with other methods visually, we selected various styles of input images and rich motions to demonstrate the robustness of MobilePortrait (more results are shown on the project page). The video results are provided in the supplementary materials.
图5：可视化结果。为了与其他方法进行视觉比较，我们选择了各种风格的输入图像和丰富的运动来证明MobilePortrait的鲁棒性（更多结果显示在项目页面上）。录像结果见补充材料。

Robustness to Motion and Appearance. To further demonstrate the utility of MobilePortrait, in Figure 5, we provide a visual analysis validating its robustness in cross-identity reenactment, with challenging scenarios like non-real images, complex backgrounds and large motions. Current state-of-the-art methods, when confronted with these challenging cases, reveal numerous unfavorable results. With the help of external knowledge, MobilePortrait achieves satisfactory results with significantly less computational effort.
对运动和外观的鲁棒性。为了进一步展示MobilePortrait的实用性，在图5中，我们提供了一个视觉分析，验证其在跨身份重演中的鲁棒性，具有挑战性的场景，如非真实图像，复杂背景和大运动。目前最先进的方法，当面对这些具有挑战性的情况下，揭示了许多不利的结果。在外部知识的帮助下，MobilePortrait以显著减少的计算工作量获得了令人满意的结果。

5Conclusion
五、结论

In this work, we address the overlooked challenge of creating lightweight one-shot neural head avatars and introduce MobilePortrait, to the best of our knowledge, the first real-time solution for mobile devices. By employing a mixed representation of explicit and implicit keypoints, along with pseudo multiview and background, we enhance the network’s motion generation and synthesis capabilities with external knowledge, enabling MobilePortrait to achieve neural head avatars with simple lightweight U-Nets. Extensive experiments confirm that MobilePortrait achieves state-of-the-art performance in synthesis quality and motion accuracy, and supports both video and audio driving inputs.
在这项工作中，我们解决了创建轻量级一次性神经头化身的挑战，并介绍了MobilePortrait，据我们所知，这是移动的设备的第一个实时解决方案。通过采用显式和隐式关键点的混合表示，沿着伪多视图和背景，我们利用外部知识增强了网络的运动生成和合成能力，使MobilePortrait能够通过简单的轻量级U-Nets实现神经头化身。大量实验证实，MobilePortrait在合成质量和运动精度方面达到了最先进的性能，并支持视频和音频驱动输入。

References 引用

[1] [1]第一章↑Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 1907–1915 (2017)
陈旭，妈妈，H.，万，J.，Li，B.，Xia，T.：面向自动驾驶的多视角三维目标检测网络。在：IEEE计算机视觉和模式识别会议论文集。pp. 1907年至1915年（2017年）
[2] [二]《中国日报》↑Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Cheng，B.，米斯拉岛，巴西-地施温，A.G.，基里洛夫，A.，Girdhar，R.：用于通用图像分割的掩蔽注意掩码Transformer。在：CVPR（2022年）
[3] [3]第一章↑Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. In: INTERSPEECH (2018)
钟智生，香港中文大学，纳格拉尼，A.，齐瑟曼，A.：Voxceleb 2：深度说话人识别。在：INTERSPEECH（2018）
[4] [4]美国↑Drobyshev, N., Chelishev, J., Khakhulin, T., Ivakhnenko, A., Lempitsky, V., Zakharov, E.: Megaportraits: One-shot megapixel neural head avatars. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 2663–2671 (2022)
Drobyshev，N.，Chelishev，J.，Khakhulin，T.，Ivakhnenko，A.，伦皮茨基，V.，Zakharov，E.：Megaportraits：一次性百万像素的神经头化身。在：第30届ACM多媒体国际会议的记录。pp. 2663-2671（2022年）
[5] [五]《中国日报》↑Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 6202–6211 (2019)
费希滕霍费尔角范，H.，马利克，J.，他，K.：视频识别的慢速网络。在：IEEE/CVF计算机视觉国际会议录。pp. 6202-6211（2019）
[6] [6]美国↑Hazirbas, C., Bitton, J., Dolhansky, B., Pan, J., Gordo, A., Ferrer, C.C.: Towards measuring fairness in ai: the casual conversations dataset. IEEE Transactions on Biometrics, Behavior, and Identity Science 4(3), 324–332 (2021)
Hazirbas角，Bitton，J.，多尔汉斯基，B，潘杰，戈多，A.，Ferrer，C.C.：测量人工智能中的公平性：随意对话数据集。IEEE Transactions on Biometrics，Behavior，and Identity Science4（3），324-332（2021）