StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis
StyleNeRF:一个基于样式的3D感知生成器,用于高分辨率图像合成
顾家涛 † ,刘玲杰 ‡ ,王鹏 ⋄ ,克里斯蒂安·西奥博尔特 ‡
†Facebook AI ‡Max Planck Institute for Informatics ⋄The University of Hong Kong
† Facebook AI ‡ 马克斯·普朗克信息学研究所 ⋄ 香港大学
†jgu@fb.com ‡{lliu,theobalt}@mpi-inf.mpg.de ⋄pwang3@cs.hku.hk
† jgu@fb.com ‡ {lliu,theobalt}@ mpi-inf. mpg. de ⋄ pwang3@cs.hku.hk
corresponding author. 通讯作者
Abstract 摘要
We propose StyleNeRF, a 3D-aware generative model for photo-realistic high-resolution image synthesis with high multi-view consistency, which can be trained on unstructured 2D images. Existing approaches either cannot synthesize high-resolution images with fine details or yield noticeable 3D-inconsistent artifacts. In addition, many of them lack control over style attributes and explicit 3D camera poses. StyleNeRF integrates the neural radiance field (NeRF) into a style-based generator to tackle the aforementioned challenges, i.e., improving rendering efficiency and 3D consistency for high-resolution image generation. We perform volume rendering only to produce a low-resolution feature map and progressively apply upsampling in 2D to address the first issue. To mitigate the inconsistencies caused by 2D upsampling, we propose multiple designs, including a better upsampler and a new regularization loss. With these designs, StyleNeRF can synthesize high-resolution images at interactive rates while preserving 3D consistency at high quality. StyleNeRF also enables control of camera poses and different levels of styles, which can generalize to unseen views. It also supports challenging tasks, including zoom-in and-out, style mixing, inversion, and semantic editing.11Please check our video at: StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis.
请查看我们的视频:http://jiataogu.me/style_nerf/。[2110.08985] StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis
我们提出了StyleNeRF,这是一个3D感知的生成模型,用于具有高多视图一致性的照片级逼真的高分辨率图像合成,可以在非结构化的2D图像上进行训练。现有的方法要么不能合成具有精细细节的高分辨率图像,要么产生明显的3D不一致伪影。此外,它们中的许多缺乏对样式属性和明的3D相机姿势的控制。StyleNeRF将神经辐射场(NeRF)集成到基于样式的生成器中,以解决上述挑战,即,提高渲染效率和3D一致性,以生成高分辨率图像。我们只执行体渲染来生成低分辨率的特征图,并逐步在2D中应用上采样来解决第一个问题。为了减轻2D上采样引起的不一致性,我们提出了多种设计,包括更好的上采样器和新的正则化损失。 通过这些设计,StyleNeRF可以以交互速率合成高分辨率图像,同时保持高质量的3D一致性。StyleNeRF还可以控制相机姿势和不同级别的样式,这可以概括为不可见的视图。它还支持具有挑战性的任务,包括放大和缩小,样式混合,反转和语义编辑。 1
1Introduction 1介绍
Figure 1:Synthesized 10242 images by StyleNeRF, with the corresponding low-resolution feature maps. StyleNeRF can generate photo-realistic high-resolution images from novel views at interactive rates while preserving high 3D consistency. None of existing methods can achieve both features.
图1:StyleNeRF合成的 10242 图像,以及相应的低分辨率特征图。StyleNeRF可以从新颖的视图以交互速率生成照片般逼真的高分辨率图像,同时保持高3D一致性。现有的方法都不能实现这两个特征。
Photo-realistic free-view image synthesis of real-world scenes is a long-standing problem in computer vision and computer graphics. Traditional graphics pipeline requires production-quality 3D models, computationally expensive rendering, and manual work, making it challenging to apply to large-scale image synthesis for a wide range of real-world scenes. In the meantime, Generative Adversarial Networks (GANs, Goodfellow et al., 2014) can be trained on a large number of unstructured images to synthesize high-quality images. However, most GAN models operate in 2D space. Therefore, they lack the 3D understanding of the training images, which results in their inability to synthesize images of the same 3D scene with multi-view consistency. They also lack direct 3D camera control over the generated images.
真实世界场景的真实感自由视角图像合成是计算机视觉和计算机图形学中的一个长期问题。传统的图形管道需要生产质量的3D模型,计算昂贵的渲染和手工工作,这使得它具有挑战性,适用于大规模图像合成的广泛的现实世界的场景。与此同时,生成对抗网络(GANs,Goodfellow等人,2014)可以在大量非结构化图像上训练,以合成高质量的图像。然而,大多数GAN模型在2D空间中运行。因此,他们缺乏对训练图像的3D理解,这导致他们无法合成具有多视图一致性的相同3D场景的图像。它们也缺乏对生成图像的直接3D相机控制。
Natural images are the 2D projection of the 3D world. Hence, recent works on generative models (Schwarz et al., 2020; Chan et al., 2021) enforce 3D structures by incorporating a neural radiance field (NeRF, Mildenhall et al., 2020). However, these methods cannot synthesize high-resolution images with delicate details due to the computationally expensive rendering process of NeRF. Furthermore, the slow rendering process leads to inefficient training and makes these models unsuitable for interactive applications. GIRAFFE (Niemeyer & Geiger, 2021b) combines NeRF with a CNN-based renderer, which has the potential to synthesize high-resolution images. However, this method falls short of 3D-consistent image generation and so far has not shown high-resolution results.
自然图像是3D世界的2D投影。因此,最近关于生成模型的工作(施瓦茨et al.,2020; Chan等人,2021)通过结合神经辐射场(NeRF,Mildenhall等人,2020年)。然而,由于NeRF的计算昂贵的渲染过程,这些方法不能合成具有精细细节的高分辨率图像。此外,缓慢的渲染过程导致低效的训练,使这些模型不适合交互式应用程序。GIRAFFE(Niemeyer &盖革,2021 b)将NeRF与基于CNN的渲染器相结合,具有合成高分辨率图像的潜力。然而,这种方法福尔斯达不到3D一致的图像生成,并且到目前为止还没有显示出高分辨率的结果。
We propose StyleNeRF, a new 3D-aware generative model for high-resolution 3D consistent image synthesis at interactive rates. It also allows control of the 3D camera pose and enables control of specific style attributes. StyleNeRF incorporates 3D scene representations into a style-based generative model. To prevent the expensive direct color image rendering from the original NeRF approach, we only use NeRF to produce a low-resolution feature map and upsample it progressively to high resolution. To improve 3D consistency, we propose several designs, including a desirable upsampler that achieves high consistency while mitigating artifacts in the outputs, a novel regularization term that forces the output to match the rendering result of the original NeRF and fixing the issues of view direction condition and noise injection. StyleNeRF is trained using unstructured real-world images. A progressive training strategy significantly improves the stability of learning real geometry.
我们提出了StyleNeRF,一个新的3D感知生成模型的高分辨率3D一致的图像合成在交互式速率。它还允许控制3D摄影机姿势,并允许控制特定的样式属性。StyleNeRF将3D场景表示纳入基于样式的生成模型。为了避免原始NeRF方法中昂贵的直接彩色图像渲染,我们只使用NeRF来生成低分辨率的特征图,并将其逐步上采样到高分辨率。为了提高3D一致性,我们提出了几种设计,包括一个理想的上采样器,实现高一致性,同时减轻输出中的伪影,一个新的正则化项,迫使输出匹配原始NeRF的渲染结果,并修复视图方向条件和噪声注入的问题。StyleNeRF使用非结构化的真实世界图像进行训练。渐进式训练策略显著提高了学习真实的几何的稳定性。
We evaluate StyleNeRF on various challenging datasets. StyleNeRF can synthesize photo-realistic 10242 images at interactive rates while achieving high multi-view consistency. None of the existing methods can achieve both characteristics. Additionally, StyleNeRF enables direct control on styles, and 3D camera poses even for the poses starkly different from training and supports applications including style mixing, interpolation, inversion, and semantic editing.
我们在各种具有挑战性的数据集上评估StyleNeRF。StyleNeRF可以以交互速率合成照片级逼真的 10242 图像,同时实现高多视图一致性。现有的方法都不能同时实现这两个特性。此外,StyleNeRF可以直接控制样式和3D相机姿势,即使姿势与训练完全不同,并支持包括样式混合,插值,反转和语义编辑在内的应用程序。
2Related Work 2相关工作
Neural Implicit Fields 神经内隐场
Representing 3D scenes as neural implicit fields has increasingly gained much attention. Michalkiewicz et al. (2019); Mescheder et al. (2019); Park et al. (2019); Peng et al. (2020) predict neural implicit fields with 3D supervision. Some of them (Sitzmann et al., 2019; Niemeyer et al., 2019) assume that the ray color only lies on the geometry surface and propose differentiable renderers to learn a neural implicit surface representation. NeRF and its variants (Mildenhall et al., 2020; Liu et al., 2020; Zhang et al., 2020) utilize a volume rendering technique to render neural implicit volume representations for novel view synthesis. In this work, we propose a generative variant of NeRF (Mildenhall et al., 2020). Unlike the discussed methods, which require posed multi-view images, our approach only needs unstructured single-view images for training.
将3D场景表示为神经隐式场越来越受到人们的关注。Michalkiewicz et al.(2019); Mescheder et al.(2019); Park et al.(2019); Peng et al.(2020)用3D监督预测神经内隐场。其中一些(Sitzmann等人,2019; Niemeyer等人,2019)假设光线颜色仅位于几何表面上,并提出可微分渲染器来学习神经隐式表面表示。NeRF及其变体(Mildenhall等人,2020; Liu等人,2020; Zhang等人,2020)利用体绘制技术来绘制用于新颖视图合成的神经隐式体表示。在这项工作中,我们提出了NeRF的生成变体(Mildenhall等人,2020年)。与所讨论的方法不同,这些方法需要多视图图像,我们的方法只需要非结构化的单视图图像进行训练。
Image Synthesis with GANs
使用GANs进行图像合成
Starting from Goodfellow et al. (2014), GANs have demonstrated high-quality results (Durugkar et al., 2017; Mordido et al., 2018; Doan et al., 2019; Zhang et al., 2019; Brock et al., 2018; Karras et al., 2018). StyleGANs (Karras et al., 2019; 2020b) achieve SOTA quality and support different levels of style control. Karras et al. (2021) solve the “texture sticking” problem of 2D GANs in generating animations with 2D transformations. Some methods (Härkönen et al., 2020; Tewari et al., 2020a; Shen et al., 2020; Abdal et al., 2020; Tewari et al., 2020b; Leimkühler & Drettakis, 2021; Shoshan et al., 2021) leverage disentangled properties in the latent space to enable explicit controls, most of which focus on faces. While these methods can synthesize face poses parameterized by two angles, extending them to general o