【计算机视觉】Gaussian Splatting源码解读补充（二）

原创

已于 2024-05-23 21:12:12 修改 · 1w 阅读

118 ·

CC 4.0 BY-SA版权

文章标签：

#计算机视觉 #人工智能

于 2024-03-20 19:13:57 首次发布

本文是3D Gaussian Splatting源码解读的补充。先介绍相机相关内容，重点讲解前向传播（渲染）。包括CUDA编程基础，如kernel函数、线程与内存操作等。还分析了CUDA文件引用的Cooperative Groups、CUB、GLM库，以及前向传播各步骤和相关函数的作用。

第一部分

本文是对学习笔记之——3D Gaussian Splatting源码解读的补充，并订正了一些错误。

三、相机相关

`scene/cameras.py`：`class Camera`

class Camera(nn.Module):
    def __init__(self, colmap_id, R, T, FoVx, FoVy, image, gt_alpha_mask,
                 image_name, uid,
                 trans=np.array([0.0, 0.0, 0.0]), scale=1.0, data_device = "cuda"
                 ):
        super(Camera, self).__init__()

        self.uid = uid
        self.colmap_id = colmap_id
        self.R = R # 旋转矩阵
        self.T = T # 平移向量
        self.FoVx = FoVx # x方向视场角
        self.FoVy = FoVy # y方向视场角
        self.image_name = image_name

        try:
            self.data_device = torch.device(data_device)
        except Exception as e:
            print(e)
            print(f"[Warning] Custom device {
     
     data_device} failed, fallback to default cuda device" )
            self.data_device = torch.device("cuda")

        self.original_image = image.clamp(0.0, 1.0).to(self.data_device) # 原始图像
        self.image_width = self.original_image.shape[2] # 图像宽度
        self.image_height = self.original_image.shape[1] # 图像高度

        if gt_alpha_mask is not None:
            self.original_image *= gt_alpha_mask.to(self.data_device)
        else:
            self.original_image *= torch.ones((1, self.image_height, self.image_width), device=self.data_device)

		# 距离相机平面znear和zfar之间且在视锥内的物体才会被渲染
        self.zfar = 100.0 # 最远能看到多远
        self.znear = 0.01 # 最近能看到多近

        self.trans = trans # 相机中心的平移
        self.scale = scale # 相机中心坐标的缩放

        self.world_view_transform = torch.tensor(getWorld2View2(R, T, trans, scale)).transpose(0, 1).cuda() # 世界到相机坐标系的变换矩阵，4×4
        self.projection_matrix = getProjectionMatrix(znear=self.znear, zfar=self.zfar, fovX=self.FoVx, fovY=self.FoVy).transpose(0,1).cuda() # 投影矩阵
        self.full_proj_transform = (self.world_view_transform.unsqueeze(0).bmm(self.projection_matrix.unsqueeze(0))).squeeze(0) # 从世界坐标系到图像的变换矩阵
        self.camera_center = self.world_view_transform.inverse()[3, :3] # 相机在世界坐标系下的坐标

其中出现的辅助函数：

# utils/graphic_utils.py
def getWorld2View2(R, t, translate=np.array([.0, .0, .0]), scale=1.0):
    Rt = np.zeros((4, 4)) # 按理来说是世界到相机的变换矩阵，但没有加平移和缩放
    Rt[:3, :3] = R.transpose()
    Rt[:3, 3] = t
    Rt[3, 3] = 1.0

    C2W = np.linalg.inv(Rt) # 相机到世界的变换矩阵
    cam_center = C2W[:3, 3] # 相机中心在世界中的坐标，即C2W矩阵第四列的三维平移向量
    cam_center = (cam_center + translate) * scale # 相机中心坐标需要平移和缩放处理
    C2W[:3, 3] = cam_center # 重新填入C2W矩阵
    Rt = np.linalg.inv(C2W) # 再取逆获得W2C矩阵
    return np.float32(Rt)

四、前向传播（渲染）：`submodules/diff-gaussian-rasterization/cuda_rasterizer/forward.cu`

预备知识：CUDA编程基础

这部分的参考资料：

[1] CUDA Tutorial
[2] An Even Easier Introduction to CUDA
[3] Introduction to CUDA Programming

CUDA是一个为支持CUDA的GPU提供的平台和编程模型。该平台使GPU能够进行通用计算。CUDA提供了C/C++语言扩展和API，以便用户利用GPU进行高效计算。一般称CPU为host，GPU为device。

CUDA C++语言中有一个加在函数前面的关键字__global__，用于表明该函数是运行在GPU上的，并且可以被CPU调用。这种函数称为kernel。

当我们调用kernel的时候，需要在函数名和括号之间加上<<<M, T>>>，其中M是block的个数，T是每个block中线程的个数。这些线程都是并行执行的，每个线程中都在执行该函数。

根据参考资料[3]，GPU在同一时刻运行一个kernel（也就是一组任务），kernel运行在grid上，每个grid由多个block组成（他们是独立的ALU组），每个block有多个线程。

同一block中的线程一般合作完成任务，它们可以共享内存（这部分内存速度极快，用__shared__关键字声明）。每个线程“知道”它在哪个block（通过访问内置变量blockIdx.x）和它是第几个线程（通过访问变量threadIdx.x），以及每个block有多少个线程（blockDim.x），从而确定它应该完成什么任务。实际上线程和block的索引是三维的，这里只举了一个一维的例子。

注意GPU和CPU的内存是隔离的，想要操作显存或者在显存和CPU内存之间进行交流必须显示的声明这些操作。指针也是不一样的，有可能虽然都是int*，但表示的含义却不同：device指针指向显存，host指针指向CPU内存。CUDA提供了操作内存的内置函数：cudaMalloc、cudaFree、cudaMemcpy等，它们分别类似于C函数malloc、free和memcpy。

关于同步方面，内置函数 __syncthreads()可以同步一个block中的所有线程。在CPU中调用内置函数cudaDeviceSynchronize()可以可以阻塞CPU，直到所有先前发出的CUDA调用都完成为止。

另外还有__host__关键字和__device__关键字，前者表示该函数只编译成CPU版本（这是默认状态），后者表示只编译成GPU版本。二者同时使用表示编译CPU和GPU两个版本。从CPU调用__device__函数和从GPU调用__host__函数都会报错。

以下是一个CUDA并行计算向量加法的示例：

#include <stdio.h>

int a[10] = {
   
   0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
int b[10] = {
   
   0, 10, 20, 30, 40, 50, 60, 70, 80, 90};
int c[10];

__global__ void vec_add(int* A, int* B, int* C)
{
   
   
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    C[i] = A[i] + B[i];
}

int main()
{
   
   
    int* gpua, *gpub, *gpuc;
    int sz = 10 * sizeof(int);
    cudaMalloc((void**)&gpua, sz); // 在GPU中分配内存
    cudaMalloc((void**)&gpub, sz);
    cudaMalloc((void**)&gpuc, sz);
    cudaMemcpy(gpua, a, sz, cudaMemcpyHostToDevice); // 拷贝CPU中的数组到GPU
    cudaMemcpy(gpub, b, sz, cudaMemcpyHostToDevice);
    vec_add<<<2, 5>>>(gpua, gpub, gpuc); // 调用kernel
    cudaDeviceSynchronize(); // 等GPU执行完（这里其实不是很有必要）
    cudaMemcpy(c, gpuc, sz, cudaMemcpyDeviceToHost); // 把GPU的计算结果拷贝到CPU
    for(int i = 0; i < 10; i++)
    {
   
   
        printf("%d\n", c[i]); // 输出结果，结果为0、11、22、……、99
    }
    return 0;
}

有了这些预备知识之后，我们就可以开始看代码了。在看CUDA代码之前，我们先看看gaussian_renderer/__init__.py中的render函数。

`def render(viewpoint_camera, pc : GaussianModel, pipe, bg_color : torch.Tensor, scaling_modifier = 1.0, override_color = None)`

def render(viewpoint_camera, pc : GaussianModel, pipe, bg_color : torch.Tensor, scaling_modifier = 1.0, override_color = None):
	'''
	viewpoint_camera: scene.cameras.Camera类的实例
	pipe: 流水线，规定了要干什么
	'''
    # Create zero tensor. We will use it to make pytorch return gradients of the 2D (screen-space) means
    screenspace_points = torch.zeros_like(pc.get_xyz, dtype=pc.get_xyz.dtype, requires_grad=True, device="cuda") + 0
    try:
        screenspace_points.retain_grad() # 让screenspace_points在反向传播时接受梯度
    except:
        pass

    # Set up rasterization configuration
    tanfovx = math.tan(viewpoint_camera.FoVx * 0.5)
    tanfovy = math.tan(viewpoint_camera.FoVy * 0.5)

    raster_settings = GaussianRasterizationSettings(
        image_height=int(viewpoint_camera.image_height),
        image_width=int(viewpoint_camera.image_width),
        tanfovx=tanfovx,
        tanfovy=tanfovy,
        bg=bg_color, # 背景颜色
        scale_modifier=scaling_modifier,
        viewmatrix=viewpoint_camera.world_view_transform, # W2C矩阵
        projmatrix=viewpoint_camera.full_proj_transform, # 世界到图像的投影矩阵
        sh_degree=pc.active_sh_degree, # 目前的球谐阶数
        campos=viewpoint_camera.camera_center, # CAMera center POSition，相机中心文职
        prefiltered=False,
        debug=pipe.debug
    )

    rasterizer = GaussianRasterizer(raster_settings=raster_settings)

    means3D = pc.get_xyz
    means2D = screenspace_points # 疑似各个Gaussian的中心投影到在图像中的坐标
    opacity = pc.get_opacity # 不透明度

    # If precomputed 3d covariance is provided, use it. If not, then it will be computed from
    # scaling / rotation by the rasterizer.
    scales = None
    rotations = None
    cov3D_precomp = None
    if pipe.compute_cov3D_python:
        cov3D_precomp = pc.get_covariance(scaling_modifier)
    else:
        scales = pc.get_scaling
        rotations = pc.get_rotation

    # If precomputed colors are provided, use them. Otherwise, if it is desired to precompute colors
    # from SHs in Python, do it. If not, then SH -> RGB conversion will be done by rasterizer.
    shs = None
    colors_precomp = None
    if override_color is None:
        if pipe.convert_SHs_python:
            shs_view = pc.get_features.transpose(1, 2).view(-1, 3, (pc.max_sh_degree+1)**2)
            dir_pp = (pc.get_xyz - viewpoint_camera.camera_center.repeat(pc.get_features.shape[0], 1))
            dir_pp_normalized = dir_pp/dir_pp.norm(dim=1, keepdim=True)
            	# 找到高斯中心到相机的光线在单位球体上的坐标
            sh2rgb = eval_sh(pc.active_sh_degree, shs_view, dir_pp_normalized)
            	# 球谐的傅里叶系数转成RGB颜色
            colors_precomp = torch.clamp_min(sh2rgb + 0.5, 0.0)
        else:
            shs = pc.get_features
    else:
        colors_precomp = override_color

    # Rasterize visible Gaussians to image, obtain their radii (on screen). 
    rendered_image, radii = rasterizer(
        means3D = means3D,
        means2D = means2D,
        shs = shs,
        colors_precomp = colors_precomp,
        opacities = opacity,
        scales = scales,
        rotations = rotations,
        cov3D_precomp = cov3D_precomp)

    # Those Gaussians that were frustum culled or had a radius of 0 were not visible.
    # They will be excluded from value updates used in the splitting criteria.
    return {
   
   "render": rendered_image,
            "viewspace_points": screenspace_points,
            "visibility_filter" : radii > 0,
            "radii": radii}

CUDA文件`rasterizer_impl.cu`: `submodules/diff-gaussian-rasterization/cuda_rasterizer`

1. 引用头文件

#include "rasterizer_impl.h"
#include <iostream>
#include <fstream>
#include <algorithm>
#include <numeric>
#include <cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cub/cub.cuh> // CUDA的CUB库
#include <cub/device/device_radix_sort.cuh>
#define GLM_FORCE_CUDA
#include <glm/glm.hpp> // GLM (OpenGL Mathematics)库

#include <cooperative_groups.h> // CUDA 9引入的Cooperative Groups库
#include <cooperative_groups/reduce.h>
namespace cg = cooperative_groups;

#include "auxiliary.h"
#include "forward.h"
#include "backward.h"

有一些库需要我们讲解一下。

(1) Cooperative Groups

参考资料：

我们直到 __syncthreads()函数提供了在一个block内同步各线程的方法，但有时要同步block内的一部分线程或者多个block的线程，这时候就需要Cooperative Groups库了。这个库定义了划分和同步一组线程的方法。

在Gaussian Splatting的所有CUDA代码中，这个库仅以两种方式被调用：

🥕(a)

auto idx = cg::this_grid().thread_rank();

cg::this_grid()返回一个cg::grid_group实例，表示当前线程所处的grid。他有一个方法thread_rank()返回当前线程在该grid中排第几。

🥕(b)

auto block = cg::this_thread_block();

cg::this_thread_block返回一个cg::thread_block实例。代码中用到的成员函数有：

block.sync()：同步该block中的所有线程（等价于__syncthreads()）。
block.thread_rank()：返回非负整数，表示当前线程在该block中排第几。
block.group_index()：返回一个cg::dim3实例，表示该block在grid中的三维索引。
block.thread_index()：返回一个cg::dim3实例，表示当前线程在block中的三维索引。

(2) CUB

参考资料：CUB API

CUB provides layered algorithms that correspond to the thread/warp/block/device hierarchy of threads in CUDA. There are distinct algorithms for each layer and higher-level layers build on top of those below.

换言之，CUB就是针对不同的计算等级：线程、wap、block、device等设计了并行算法。例如，reduce函数有四个版本：ThreadReduce、WarpReduce、BlockReduce、DeviceReduce。

diff-gaussian-rasterization模块调用了CUB库的两个函数：

⭐️ (a) cub::DeviceScan::InclusiveSum

这个函数是算前缀和的。所谓"Inclusive"就是第 $i$ 个数被计入第 $i$ 个和中。函数定义如下：

template<typename InputIteratorT, typename OutputIteratorT>
static inline cudaError_t InclusiveSum(
	void *d_temp_storage, // 额外需要的临时显存空间
	size_t &temp_storage_bytes, // 临时显存空间的大小
	InputIteratorT d_in, // 输入指针
	OutputIteratorT d_out, // 输出指针
	int num_items, // 元素个数
	cudaStream_t stream = 0)

⭐️ (b) cub::DeviceRadixSort::SortPairs：device级别的并行基数排序

该函数根据key将(key, value)对进行升序排序。这是一种稳定排序。

函数定义如下：

template<typename KeyT, typename ValueT, typename NumItemsT>
static inline cudaError_t SortPairs(
	void *d_temp_storage, // 排序时用到的临时显存空间
	size_t &temp_storage_bytes, // 临时显存空间的大小
	const KeyT *d_keys_in,         KeyT *d_keys_out, // key的输入和输出指针
	const ValueT *d_values_in,     ValueT *d_values_out, // value的输入和输出指针
	NumItemsT num_items, // 对多少个条目进行排序
	int begin_bit = 0, // 低位
	int end_bit = sizeof(KeyT) * 8, // 高位
	cudaStream_t stream = 0)
	// 按照[begin_bit, end_bit)内的位进行排序

示例代码：

#include <stdio.h>
#include <cub/cub.cuh>
// or equivalently <cub/device/device_radix_sort.cuh>

int main()
{
   
   
    // Declare, allocate, and initialize device-accessible pointers
    // for sorting data
    int  num_items = 7;
    int  keys_in[7] = {
   
   8, 6, 7, 5, 3, 0, 9};
    int* d_keys_in, *d_keys_out;
    int  values_in[7] = {
   
   0, 1, 2, 3, 4, 5, 6};
    int* d_values_in, *d_values_out;

    int keys_out[7], values_out[7];

    // 下面把数据搬到显存中
    int sz = 7 * sizeof(int);
    cudaMalloc((void**)&d_values_in, sz);
    cudaMalloc((void**)&d_values_out, sz);
    cudaMalloc((void**)&d_keys_in, sz);
    cudaMalloc((void**)&d_keys_out, sz);
    cudaMemcpy(d_keys_in, keys_in, sz, cudaMemcpyHostToDevice);
    cudaMemcpy(d_values_in, values_in, sz, cudaMemcpyHostToDevice);

    // Determine temporary device storage requirements
    void     *d_temp_storage = NULL;
    size_t   temp_storage_bytes = 0;
    cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
        d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);

    // Allocate temporary storage
    cudaMalloc(&d_temp_storage, temp_storage_bytes);

    // Run sorting operation
    cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes,
        d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);

    // 结果：
    // d_keys_out            <-- [0, 3, 5, 6, 7, 8, 9]
    // d_values_out          <-- [5, 4, 3, 1, 2, 0, 6]

    // 把算好的数据搬回来
    cudaMemcpy(keys_out, d_keys_out, sz, cudaMemcpyDeviceToHost);
    cudaMemcpy(values_out, d_values_out, sz, cudaMemcpyDeviceToHost);

    puts("keys");
    for(int i = 0; i < 7; i++)
    {
   
   
        printf("%d ", keys_out[i]);
    }
    puts("\nvalues");
    for(int i = 0; i < 7; i++)
    {
   
   
        printf("%d ", values_out[i]);
    }

    cudaFree(d_keys_in);
    cudaFree(d_keys_out);
    cudaFree(d_values_in);
    cudaFree(d_values_out);

    return 0;
}

(3) GLM

参考资料：GLM的Github页面

GLM意为“OpenGL Mathematics”，是一个专为图形学设计的只有头文件的C++数学库。

Gaussian Splatting的代码中用到了glm::vec3（三维向量）, glm::vec4（四维向量）, glm::mat3（3×3矩阵）和glm::dot（向量点积）。

2. `getHigherMsb`

// Helper function to find the next-highest bit of the MSB
// on the CPU.
uint32_t getHigherMsb(uint32_t n)
{
   
   
	uint32_t msb = sizeof(n) * 4;
	uint32_t step = msb;
	while (step > 1)
	{
   
   
		step /= 2;
		if (n >> msb)
			msb += step;
		else
			msb -= step;
	}
	if (n >> msb)
		msb++;
	return msb;
}

这个函数似乎是用二分法检测n表示成二进制数除去前导零有几位（n=0时返回1）。我写了一个测试程序推测它的功能：

#include <iostream>
uint32_t getHigherMsb(uint32_t n);

int main()
{
   
   
    uint32_t n[] = {
   
   0b0, 0b1, 0b10, 0b11, 0b111, 0b10101, 0b1110001};
    for(int i = 0; i < sizeof(n) / sizeof(uint32_t); i++)
        std::cout

最低0.47元/天解锁文章

22 条评论

mzm461382789 2024.12.17
请问computeCov2D最后为什么是J^T*W^T*W*J的形式？和原文中不同

淀粉死了孤独登月 2024.09.03
太厉害了太详细了，我的大脑宕机了

xiaomingzaihuake 2024.08.01
厉害，代码解读非常详细。问下在相机模型的getWorld2View2函数中，将C2W的中心进行平移和缩放的原因是什么？有没有相关的解释？

CarpeMnn 2024.07.02
关于getProjectMatrix，P[0,2] P[1,2] 不为0，原代码是对的，这是因为，getProjectMatrix 这里除了透视投影以外，还乘上了scale&平移到ndc的 deformation
- 西山点子王回复CarpeMnn 2024.07.16
  为0啊，在相机坐标系里相机就是原点，这段函数变量定义那里也说了，left=-right;bottom=-top,下面再加起来可不就是0吗

ytyyr 2024.06.29
你好，请问在computeCov2D中雅可比矩阵的最后一行为什么都是0，EWA论文里不是这样啊
- wobushoulian回复ytyyr 2024.07.12
  可能因为在映射的时候需要截取前两维就直接设为0了

ytyyr 2024.06.27
你好， Q1:在renderCUDA代码中，每个高斯渲染前都会判断透光率和不透明度是否很小，如果很小就不渲染，那这样在last_contrib之前的高斯也有可能是没渲染的，请问这在代码中是如何解决的呢 Q2:请问为什么需要判断二维高斯指数部分的正负，为什么是正就直接跳过了

m0_57942192 2024.06.06
特别感谢,写的真好

Self Revolution 2024.05.24
请问colors_precomp = torch.clamp_min(sh2rgb + 0.5, 0.0)，为啥sh2RGB RGB2SH都有加减一个0.5呢？
- Self Revolution回复Self Revolution 2024.05.24
  如果RGB小于0了，

weixin_48521688 2024.05.23
请问在render时经历identifyTilerange后，同一tile内各个pixel需要处理的高斯球的起始和数目都是一致的，只是对于相同的这些高斯球不同pixel的坐标不同，生成的d不同从而alpha不同，这样说对吗？
- seh_sjlj回复weixin_48521688 2024.05.23
  大体上没问题，就是不太清楚你说的“起始”是什么意思