[非卷积5D] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

最新推荐文章于 2024-08-25 17:39:35 发布

原创

最新推荐文章于 2024-08-25 17:39:35 发布 · 6.4k 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#迁移学习 #机器学习 #pytorch #深度学习 #神经网络

本文介绍了一种名为NeRF的新方法，它通过优化连续的5D神经辐射场表示，实现了复杂场景的高质量视图合成。该方法使用完全连接的深度网络来表示场景，并通过光线行进进行渲染，展现了在真实感视图合成方面的优越性能。

不使用3D建模，使用静态图片进行训练，用(非卷积)深度网络表示场景的5D连续体表示，再通过ray marching进行渲染。
paper:《NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis》
文章来自arxiv.org，仅做笔记之用

[非卷积5D中文翻译及学习笔记] 神经辐射场 NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

NeRF: Representing Scenes as
Neural Radiance Fields for View Synthesis

Ben Mildenhall

^{1}

Pratul P. Srinivasan

^{1}

^* Matthew Tancik

^{1}

^*
Jonathan T. Barron

^{2}

Ravi Ramamoorthi

^{3}

Ren Ng

^{1}

^{1}

UC Berkeley

^{2}

Google Research

^{3}

UC San Diego

Authors contributed equally to this work.

Abstract

We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location $(x, y, z)$ and viewing direction $(θ, ϕ)$ ) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.

Keywords:

scene representation, view synthesis, image-based rendering, volume rendering, 3D deep learning

1 Introduction

In this work, we address the long-standing problem of view synthesis in a new way by directly optimizing parameters of a continuous 5D scene representation to minimize the error of rendering a set of captured images. We represent a scene as a continuous 5D function that outputs the radiance emitted in each direction $(θ, ϕ)$ at each point $(x, y, z)$ in space, and a density at each point which acts like a differential opacity controlling how much radiance is accumulated by a ray passing through $(x, y, z)$ . Our method optimizes a deep fully-connected neural network without any convolutional layers (often referred to as a multilayer perceptron or MLP) to represent this function by regressing from a single 5D coordinate $(x, y, z, θ, ϕ)$ to a single volume density and view-dependent RGB color. To render this neural radiance field (NeRF) from a particular viewpoint we: 1) march camera rays through the scene to generate a sampled set of 3D points, 2) use those points and their corresponding 2D viewing directions as input to the neural network to produce an output set of colors and densities, and 3) use classical volume rendering techniques to accumulate those colors and densities into a 2D image. Because this process is naturally differentiable, we can use gradient descent to optimize this model to represent a complex scene by minimizing the error between each observed image and the corresponding views rendered from our representation. Minimizing this error across multiple views encourages the network to predict a coherent model of the scene by assigning high volume densities and accurate colors to the locations that contain the true underlying scene content. Figure 2 visualizes this overall pipeline.

We find that the basic implementation of optimizing a neural radiance field representation for a complex scene does not converge to a sufficiently high-resolution representation and is inefficient in the required number of samples per camera ray. We address these issues by transforming input 5D coordinates with a positional encoding that enables the MLP to represent higher frequency functions, and we propose a hierarchical sampling procedure to reduce the number of queries required to adequately sample this high-frequency scene representation.

Figure 1: We present a method that optimizes a continuous 5D neural radiance field representation (volume density and view-dependent color at any continuous location) of a scene from a set of input images. We use techniques from volume rendering to accumulate samples of this scene representation along rays to render the scene from any viewpoint. Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation.

Our approach inherits the benefits of volumetric representations: both can represent complex real-world geometry and appearance and are well suited for gradient-based optimization using projected images. Crucially, our method is designed to overcome the prohibitive storage costs of discretized voxel grids when modeling complex scenes at high-resolutions.

In summary, our key technical contributions are:

An approach for representing continuous scenes with complex geometry and materials as 5D neural radiance fields, parameterized as basic MLP networks.
A differentiable rendering procedure based on classical volume rendering techniques, which we use to optimize these representations from standard RGB images. This includes a hierarchical sampling strategy to allocate the MLP’s capacity towards space with visible scene content.
A positional encoding to map each input 5D coordinate into a higher dimensional space, which enables us to successfully optimize neural radiance fields to represent high-frequency scene content.

We demonstrate that our resulting neural radiance field method quantitatively and qualitatively outperforms state-of-the-art view synthesis methods, including works that fit neural 3D representations to scenes as well as works that train deep convolutional networks to predict sampled volumetric representations. As far as we know, this paper presents the first continuous neural scene representation that is able to render high-resolution photorealistic novel views of real objects and scenes from RGB images captured in natural settings.

2 Related Work

A promising recent direction in computer vision is encoding objects and scenes in the weights of an MLP that directly maps from a 3D spatial location to an implicit representation of the shape, such as the signed distance [5] at that location. However, these methods have so far been unable to reproduce realistic scenes with complex geometry with the same fidelity as techniques that represent scenes using discrete representations such as triangle meshes or voxel grids. In this section, we review these two lines of work and contrast them with our approach, which enhances the capabilities of neural scene representations to produce state-of-the-art results for rendering complex realistic scenes.

Neural 3D shape representations

Recent work has investigated the implicit representation of continuous 3D shapes as level sets by optimizing deep networks that map $x y z$ coordinates to a signed distance function [23] or to an occupancy field [19]. However, these models are limited by their requirement of access to ground truth 3D geometry, typically obtained from synthetic 3D shape datasets such as ShapeNet [3]. Subsequent work has relaxed this requirement of ground truth 3D shapes by formulating differentiable rendering functions that allow neural implicit shape representations to be optimized using only 2D images. Niemeyer et al. [21] represent surfaces as 3D occupancy fields and use a numerical method to find the surface intersection for each ray, then calculate an exact derivative using implicit differentiation. Each ray intersection location is then provided as the input to a neural 3D texture field that predicts a diffuse color for that point. Sitzmann et al. [30] use a less direct neural 3D representation that simply outputs a feature vector and RGB color at each continuous 3D coordinate, and propose a differentiable rendering function consisting of a recurrent neural network that marches along each ray to decide where the surface is located. Though these techniques can potentially represent arbitrarily complicated and high-resolution scene geometries, they have so far been limited to simple shapes with low geometric complexity resulting and produce oversmoothed rendered views. We show that an alternate strategy of optimizing networks to encode 5D radiance fields (3D volumes with 2D view-dependent appearance) can represent higher-resolution geometry and appearance to render photorealistic novel views of complex scenes.

View synthesis and image-based rendering

The computer graphics community has made significant progress in photorealistic novel view synthesis by predicting traditional geometry and appearance representations from observed images. One popular class of approaches uses mesh-based representations of scenes with either diffuse [34] or view-dependent [2, 6, 35] appearance. Differentiable rasterizers [4, 8, 15, 17] or pathtracers [14, 22] can directly optimize mesh representations to reproduce a set of input images using gradient descent. However, gradient-based mesh optimization based on image reprojection is often difficult, likely because of local minima or poor conditioning of the loss landscape. Furthermore, this strategy requires a template mesh with fixed topology to be provided as an initialization before optimization [14], which is typically unavailable for unconstrained real-world scenes.

Another class of methods use volumetric representations to specifically address the task of high-quality photorealistic view synthesis from a set of input RGB images. Volumetric approaches are able to realistically represent complex shapes and materials, are well-suited for gradient-based optimization, and tend to produce less visually distracting artifacts than mesh-based methods. Early volumetric approaches used observed images to directly color voxel grids [12, 28, 32]. More recently, several methods [7, 20, 24, 31, 37] have used large datasets of multiple scenes to train deep networks that predict a sampled volumetric representation from a set of input images, and then use alpha-compositing [25] along rays to render novel views at test time. Other works have optimized a combination of convolutional networks (CNNs) and sampled voxel grids for each specific scene, such that the CNN can compensate for discretization artifacts from low resolution voxel grids [29] or allow the predicted voxel grids to vary based on input time or animation controls [16]. While these volumetric techniques have achieved impressive results for novel view synthesis, their ability to scale to higher resolution imagery is fundamentally limited by poor time and space complexity due to their discrete sampling — rendering higher resolution images requires a finer sampling of 3D space. We circumvent this problem by instead encoding a continuous volume within the parameters of a deep fully-connected neural network, which not only produces significantly higher quality renderings than prior volumetric approaches, but also requires just a fraction of the storage cost of those sampled volumetric representations.

Figure 2: An overview of our neural radiance field scene representation and differentiable rendering procedure. We synthesize images by sampling 5D coordinates (location and viewing direction) along camera rays (a), feeding those locations into an MLP to produce a color and volume density (b), and using volume rendering techniques to composite these values into an image (c). This rendering function is differentiable, so we can optimize our scene representation by minimizing the residual between synthesized and ground truth observed images (d).

3 Neural Radiance Field Scene Representation

We represent a continuous scene as a 5D vector-valued function whose input is a 3D location $x = (x, y, z)$ and 2D viewing direction $(θ, ϕ)$ , and whose output is an emitted color $c = (r, g, b)$ and volume density $σ$ . In practice, we express direction as a 3D Cartesian unit vector $d$ . We approximate this continuous 5D scene representation with an MLP network $F_{Θ} : (x, d) \to (c, σ)$ and optimize its weights $Θ$ to map each input 5D coordinate to its corresponding volume density and directional emitted color.

We encourage the representation to be multiview consistent by restricting the network to predict the volume density $σ$ as a function of only the location $x$ , while allowing the RGB color $c$ to be predicted as a function of both location and viewing direction. To accomplish this, the MLP $F_{Θ}$ first processes the input 3D coordinate $x$ with 8 fully-connected layers (using ReLU activations and 256 channels per layer), and outputs $σ$ and a 256-dimensional feature vector. This feature vector is then concatenated with the camera ray’s viewing direction and passed to 4 additional fully-connected layers (using ReLU activations and 128 channels per layer) that output the view-dependent RGB color.

See Fig. 3 for an example of how our method uses the input viewing direction to represent non-Lambertian effects. As shown in Fig. 4, a model trained without view dependence (only $x$ as input) has difficulty representing specularities.