【CasStereo】《Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching》

原创于 2025-12-24 00:17:44 发布 · 449 阅读

15 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #深度学习 #计算机视觉 #stereo matching #CasStereo

CNN / Transformer 专栏收录该内容

277 篇文章

订阅专栏

在这里插入图片描述

CVPR-2020

https://github.com/alibaba/cascade-stereo

文章目录

1、Background and Motivation
2、Related Work
3、Advantages / Contributions
4、Method
5、Experiments
6、Conclusion（own） / Future work

1、Background and Motivation

近年来，基于深度学习的多视角立体匹配（multi-view stereo——MVS）和双目立体匹配（Stereo Matching）取得了显著进展。主流方法普遍采用3D代价体（Cost Volume）结构：通过在一系列假设深度（或视差）平面上对特征图进行扭曲（warping），构建一个三维张量，再利用3D卷积对其进行正则化并回归最终的深度/视差图。

然而，这种结构存在一个根本性瓶颈：计算复杂度和显存消耗随代价体分辨率呈立方级增长。这严重限制了模型处理高分辨率图像的能力——大多数方法不得不将特征图大幅下采样（如1/4或1/8），cost volumes at a lower resolution 通过 upsampling or post-refinement 输出 high-resolution result ，导致精度损失。

既然单一代价体难以兼顾高分辨率与高效性，能否借鉴传统计算机视觉中的“由粗到精”（coarse-to-fine）策略？

作者提出 CasStereo——propose a memory and time efficient cost volume formulation，核心思想是：将单一高开销代价体拆解为多个级联阶段，每阶段逐步缩小深度范围、细化间隔、提升分辨率，从而在更低资源消耗下实现更高精度（narrow the depth (or disparity) range of each stage by the prediction from the previous stage.）。

在这里插入图片描述

2、Related Work

（1）Stereo Matching

传统方法步骤

matching cost calculation
matching cost aggregation,
disparity calculation
disparity refinement

传统方法分类

Global methods（energy function）
Local methods

后面 CNN 方法

GC-Net
PSMNet
GwcNet
HSM
EMCUA
GANet

limited to downsampled cost volumes and rely on interpolation operations to generate high-resolution disparity.

（2）Multi-View Stereo

volumetric methods
point cloud based methods
depth map reconstruction methods

（3）High-Resolution Output in Stereo and MVS

MVSNet

Yao Y, Luo Z, Li S, et al. Mvsnet: Depth inference for unstructured multi-view stereo[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 767-783.

3、Advantages / Contributions

提出级联代价体（Cascade Cost Volume）——一种通用、即插即用的高效代价体构建范式；

实现真正的高分辨率推理：在DTU数据集上以1152×864输入取得SOTA结果（also rank first on Tanks and Temples benchmark）；

显著降低资源消耗：相比 MVSNet，GPU 显存减少 50.6%，运行时间减少 59.3%；

广泛兼容性：成功应用于MVSNet、PSMNet、GwcNet、GANet等多种架构；即插即用

4、Method

在这里插入图片描述

coarse to fine

ensures the computation and memory resources are spent on more meaningful regions

4.1、Cost volume Formulation

Constructing 3D cost volume requires three major steps

Feature Extraction（特征提取）
Warping / Feature Alignment（特征对齐 / 扭曲）
Cost Aggregation（代价聚合）

这里 warp 指根据相机的内外参（intrinsic & extrinsic parameters）和一个假设的深度值（hypothesis depth），利用单应性变换（homography）将某视角的特征图投影到参考视角（reference view）在该深度平面上的对应位置。warp 操作是可微的（通常用双线性插值实现），因此可以嵌入到端到端训练的网络中。

we warp the extracted 2D features of each view to the hypothesis planes and construct the feature volumes, which are finally fused together to build the 3D cost volume.

（1）Cost volume Formulation

参考视图到源视图的投影关系可由单应性矩阵（homography） $H_i(d)$ 描述：

参考图（Reference Image）作为重建目标的图像。最终的深度图（depth map）或视差图（disparity map）就是相对于这张图像定义的。

源图（Source Image）除参考图外的其他输入图像，用于提供额外视角信息，辅助判断参考图中每个像素的深度。

对应的公式化表达如下：
在这里插入图片描述

$i^{th}$ view

$n_1$ denotes the principle axis of the reference camera—— $n=[0,0,1] ^T$ 是 fronto-parallel 平面的法向量（因为假设深度平面平行于图像平面）。

K 内参、R 外参旋转、t 外参平移（不同视角下的相对平移会产生视差，而这个视差被深度 d 归一化，所以 d 在分母上）

利用相机几何（内外参）和假设深度 d，通过可微的单应性变换，将源视角的特征图对齐到参考视角在深度 d 的平面上，为后续多视角特征融合（构建 cost volume）提供几何一致的输入。

（2）3D Cost Volumes in Stereo Matching

在这里插入图片描述
在给定视差 $d$ 的假设下，右图中与左图 $X_l$ 位置对应的像素 x 坐标

在已校正的双目图像中，利用视差与水平位移的线性关系，快速确定左右图特征的对应位置，从而高效构建 3D 代价体。

不像 MVS 那样使用复杂的单应性变换，而是直接根据视差进行水平方向的偏移。

4.2、Feature Pyramid

{1/16, 1/4, 1}

sceneflow，left and right input image Size([1, 3, 256, 512]) 为例

注意代码中只有两个 stage

        refimg_msfea = self.feature_extraction(left)  # torch.Size([1, 3, 256, 512])
        targetimg_msfea = self.feature_extraction(right)

提取特征的特征金字塔 feature_extraction 具体如下

    def forward(self, x):
        output_s1   = self.firstconv_a(x)  # torch.Size([1, 3, 256, 512]) -> torch.Size([1, 32, 256, 512])
        output      = self.firstconv_b(output_s1)  # 1/2 torch.Size([1, 32, 128, 256])
        output_s2   = self.layer1(output)          # 1/2 torch.Size([1, 32, 128, 256])
        output_raw  = self.layer2(output_s2)       # 1/4 torch.Size([1, 64, 64, 128])
        output      = self.layer3(output_raw)      # 1/4 torch.Size([1, 128, 64, 128])
        output_skip = self.layer4(output)          # 1/4 torch.Size([1, 128, 64, 128])


        output_branch1 = self.branch1(output_skip)  # torch.Size([1, 32, 1, 2])
        output_branch1 = F.upsample(output_branch1, (output_skip.size()[2],output_skip.size()[3]),mode='bilinear', align_corners=Align_Corners) # torch.Size([1, 32, 64, 128])

        output_branch2 = self.branch2(output_skip)  # torch.Size([1, 32, 2, 4])
        output_branch2 = F.upsample(output_branch2, (output_skip.size()[2],output_skip.size()[3]),mode='bilinear', align_corners=Align_Corners) # torch.Size([1, 32, 64, 128])

        output_branch3 = self.branch3(output_skip)  # torch.Size([1, 32, 4, 8])
        output_branch3 = F.upsample(output_branch3, (output_skip.size()[2],output_skip.size()[3]),mode='bilinear', align_corners=Align_Corners) # torch.Size([1, 32, 64, 128])

        output_branch4 = self.branch4(output_skip)  # torch.Size([1, 32, 8, 16])
        output_branch4 = F.upsample(output_branch4, (output_skip.size()[2],output_skip.size()[3]),mode='bilinear', align_corners=Align_Corners) # torch.Size([1, 32, 64, 128])

        output_feature = torch.cat((output_raw, output_skip, output_branch4, output_branch3, output_branch2, output_branch1), 1)  # torch.Size([1, 320, 64, 128])


        output_msfeat = {}

        output_feature = self.inner0(output_feature)  # torch.Size([1, 32, 64, 128])
        out = self.lastconv(output_feature)           # torch.Size([1, 32, 64, 128])
        output_msfeat["stage1"] = out

        intra_feat = output_feature

        if self.arch_mode == "fpn":
            if self.num_stage == 3:
                intra_feat = F.interpolate(intra_feat, scale_factor=2, mode="nearest") + self.inner1(output_s2)
                out = self.out2(intra_feat)
                output_msfeat["stage2"] = out

                intra_feat = F.interpolate(intra_feat, scale_factor=2, mode="nearest") + self.inner2(output_s1)
                out = self.out3(intra_feat)
                output_msfeat["stage3"] = out

            elif self.num_stage == 2:
                intra_feat = F.interpolate(intra_feat, scale_factor=2, mode="nearest") + self.inner1(output_s2)  # 1/2 torch.Size([1, 32, 128, 256])
                out = self.out2(intra_feat)  # torch.Size([1, 16, 128, 256])
                output_msfeat["stage2"] = out

        return output_msfeat

特征金字塔建立后把所有特征 concat 在一起

提取出来的 stage1 torch.Size([1, 32, 64, 128])， 1/4

提取出来的 stage2 torch.Size([1, 16, 128, 256])，1/2

4.3、Cascade Cost Volume

在这里插入图片描述

a finer plane interval（更小的深度/视差采样步长，即在相同深度范围内使用更多假设平面） are likely to improve the reconstruction accuracy

eg:

深度范围：[200 mm, 1000 mm]

若采样 8 个平面，则 plane interval ≈ (1000−200)/7 ≈ 114 mm

若采样 64 个平面，则 plane interval ≈ 12.5 mm

plane interval 越小 → 采样越密集 → 搜索越精细

（1）Hypothesis Range

在这里插入图片描述
eg：0~192

（2）Hypothesis Plane Interval

depth interval is set to 4, 2 and 1 times

code 中 two stage 默认 parser.add_argument('--disp_inter_r', type=str, default="4,1", help='disp_intervals_ratio')

stage1 时，48*4 = 192，设计的很巧妙，真正的 coarse to fine

（3）Number of Hypothesis Planes

这个是 D 的数量

Hypothesis Range 除以 Hypothesis Plane Interval 就是 Number of Hypothesis Planes

在这里插入图片描述

3 个 stagethe number of depth hypothesis is 48, 32 and 8,

code 中 two stage 默认 parser.add_argument('--ndisps', type=str, default="48,24", help='ndisps')

（4）Spatial Resolution

gradually increases and is set to 1/16, 1/4 and 1 of the original input image size.

（5）Warping Operation

MVS
在这里插入图片描述

SM
在这里插入图片描述

每个 stage，调用 get_disp_range_samples 获取新的 disparity range

        for stage_idx in range(self.num_stage):
            # print("*********************stage{}*********************".format(stage_idx + 1))
            if pred is not None:
                if self.grad_method == "detach":
                    cur_disp = pred.detach()
                else:
                    cur_disp = pred
            disp_range_samples = get_disp_range_samples(cur_disp=cur_disp, ndisp=self.ndisps[stage_idx],
                                                        disp_inteval_pixel=self.disp_interval_pixel[stage_idx],
                                                        dtype=left.dtype,
                                                        device=left.device,
                                                        shape=[left.shape[0], left.shape[2], left.shape[3]],
                                                        max_disp=self.maxdisp,
                                                        using_ns=self.using_ns,
                                                        ns_size=self.ns_size)  # torch.Size([1, 48, 256, 512])
            stage_scale = self.stage_infos["stage{}".format(stage_idx + 1)]["scale"]
            refimg_fea, targetimg_fea = refimg_msfea["stage{}".format(stage_idx + 1)], \
                                        targetimg_msfea["stage{}".format(stage_idx + 1)]

get_disp_range_samples 的细节如下


def get_disp_range_samples(cur_disp, ndisp, disp_inteval_pixel, device, dtype, shape, using_ns, ns_size, max_disp=192.0):
    #shape, (B, H, W)
    #cur_disp: (B, H, W) or float
    #return disp_range_values: (B, D, H, W)
    # with torch.no_grad():
    if cur_disp is None:
        cur_disp = torch.tensor(0, device=device, dtype=dtype, requires_grad=False).reshape(1, 1, 1).repeat(*shape)  # torch.Size([1, 256, 512]) 创建一个全零的视差图，形状为 (B, H, W)，作为初始中心视差（即假设所有点视差为 0）。
        cur_disp_min = (cur_disp - ndisp / 2 * disp_inteval_pixel).clamp(min=0.0)   # (B, H, W)  计算每个像素位置的最小视差：以 cur_disp 为中心，向左扩展 ndisp/2 个间隔。
        cur_disp_max = (cur_disp_min + (ndisp - 1) * disp_inteval_pixel).clamp(max=max_disp)  # (B, H, W) 计算最大视差：从 cur_disp_min 开始，加上 (ndisp - 1) 个间隔，得到范围上限。
        new_interval = (cur_disp_max - cur_disp_min) / (ndisp - 1)  # (B, H, W)  重新计算实际使用的视差间隔（可能因 clamp 而略小于 disp_inteval_pixel），确保 ndisp 个点均匀覆盖 [min, max]。

        disp_range_volume = cur_disp_min.unsqueeze(1) + (torch.arange(0, ndisp, device=cur_disp.device,
                                                                      dtype=cur_disp.dtype,
                                                                      requires_grad=False).reshape(1, -1, 1, 1) * new_interval.unsqueeze(1))  # (B,1,H,W) +（1,D,1,1）*(B,1,H,W) = (B,D,H,W)

    else:
        disp_range_volume = get_cur_disp_range_samples(cur_disp, ndisp, disp_inteval_pixel, shape, ns_size, using_ns, max_disp)
    
    return disp_range_volume

torch.arange(0, ndisp) → 生成索引 [0, 1, ..., ndisp-1]
reshape(1, -1, 1, 1) → 变为 (1, D, 1, 1) 以便广播
new_interval.unsqueeze(1) → (B, 1, H, W)
相乘后得到偏移量，加到 cur_disp_min 上
最终 disp_range_volume 形状为 (B, D, H, W)，其中 disp_range_volume[:, d, h, w] 表示位置 (h,w) 在第 d 个假设视差的值

stage1 的 d = 48，depth plane 的间隔为 4

stage2 的 d = 24，depth plane 的间隔为 1

调用 get_cur_disp_range_samples 来获取新的 disparity range

def get_cur_disp_range_samples(cur_disp, ndisp, disp_inteval_pixel, shape, ns_size, using_ns=False, max_disp=192.0):
    #shape, (B, H, W)
    #cur_disp: (B, H, W)
    #return disp_range_samples: (B, D, H, W)
    if not using_ns:
        cur_disp_min = (cur_disp - ndisp / 2 * disp_inteval_pixel)  # (B, H, W)
        cur_disp_max = (cur_disp + ndisp / 2 * disp_inteval_pixel)
        # cur_disp_min = (cur_disp - ndisp / 2 * disp_inteval_pixel).clamp(min=0.0)   #(B, H, W)
        # cur_disp_max = (cur_disp_min + (ndisp - 1) * disp_inteval_pixel).clamp(max=max_disp)

        assert cur_disp.shape == torch.Size(shape), "cur_disp:{}, input shape:{}".format(cur_disp.shape, shape)
        new_interval = (cur_disp_max - cur_disp_min) / (ndisp - 1)  # (B, H, W)

        disp_range_samples = cur_disp_min.unsqueeze(1) + (torch.arange(0, ndisp, device=cur_disp.device,
                                                                      dtype=cur_disp.dtype,
                                                                      requires_grad=False).reshape(1, -1, 1,
                                                                                                   1) * new_interval.unsqueeze(1))

4.4、Cost Volume Construction

每个 stage 均会 construction，核心函数是 self.get_cv

            cost = self.get_cv(refimg_fea, targetimg_fea,
                               disp_range_samples=F.interpolate((disp_range_samples / stage_scale).unsqueeze(1),
                                                                [self.ndisps[stage_idx]//int(stage_scale), left.size()[2]//int(stage_scale), left.size()[3]//int(stage_scale)],
                                                                mode='trilinear',
                                                                align_corners=Align_Corners_Range).squeeze(1),
                               ndisp=self.ndisps[stage_idx]//int(stage_scale))   # torch.Size([1, 64, 12, 64, 128])

self.get_cv 核心思想是：对右图特征进行视差相关的“warp”（重采样），使其与左图特征在假设视差下对齐，再拼接形成代价体。

代码如下：

class GetCostVolume(nn.Module):
    def __init__(self):
        super(GetCostVolume, self).__init__()

    def forward(self, x, y, disp_range_samples, ndisp):
        assert (x.is_contiguous() == True)

        bs, channels, height, width = x.size()  # torch.Size([1, 32, 64, 128])
        cost = x.new().resize_(bs, channels * 2, ndisp, height, width).zero_()  # torch.Size([1, 64, 12, 64, 128])
        # cost = y.unsqueeze(2).repeat(1, 2, ndisp, 1, 1) #(B, 2C, D, H, W)

        mh, mw = torch.meshgrid([torch.arange(0, height, dtype=x.dtype, device=x.device),  
                                 torch.arange(0, width, dtype=x.dtype, device=x.device)])  # (H *W)  创建标准图像坐标网格
        mh = mh.reshape(1, 1, height, width).repeat(bs, ndisp, 1, 1)  # torch.Size([1, 12, 64, 128])
        mw = mw.reshape(1, 1, height, width).repeat(bs, ndisp, 1, 1)  # (B, D, H, W) torch.Size([1, 12, 64, 128]) 将坐标网格扩展到 batch 和视差维度

        cur_disp_coords_y = mh   # torch.Size([1, 12, 64, 128])
        cur_disp_coords_x = mw - disp_range_samples  # torch.Size([1, 12, 64, 128]) 计算右图采样坐标

        coords_x = cur_disp_coords_x / ((width - 1.0) / 2.0) - 1.0  # trans to -1 - 1 torch.Size([1, 12, 64, 128])
        coords_y = cur_disp_coords_y / ((height - 1.0) / 2.0) - 1.0 # torch.Size([1, 12, 64, 128])
        grid = torch.stack([coords_x, coords_y], dim=4)   #(B, D, H, W, 2)  torch.Size([1, 12, 64, 128, 2])  这是 F.grid_sample 所要求的归一化坐标格式（左上角=-1,-1；右下角=1,1）

        cost[:, x.size()[1]:, :, :, :] = F.grid_sample(y, grid.view(bs, ndisp * height, width, 2), mode='bilinear',
                                                       padding_mode='zeros').view(bs, channels, ndisp, height, width)   # 对右图进行可微分采样，存入 cost 的后半部分（[:, C:, ...]）

        # a littel difference, no zeros filling
        tmp = x.unsqueeze(2).repeat(1, 1, ndisp, 1, 1) #(B, C, D, H, W)
        # tmp = tmp.transpose(0, 1) #(C, B, D, H, W)
        # #x1 = x2 + d >= d
        # tmp[:, mw < disp_range_samples] = 0
        # tmp = tmp.transpose(0, 1) #(B, C, D, H, W)
        cost[:, :x.size()[1], :, :, :] = tmp  # 将复制后的左图特征存入 cost 的前半部分（[:, :C, ...]）

        return cost

可以看见本质上是一个 concatenation construction，输出的代价体包含左图特征 + warped 右图特征——要判断左图 (h,w) 与视差 d 是否匹配，应将右图 (h, w - d) 的特征拿来与左图 (h,w) 比较。

warp 右特征是根据 disparity range sample 在右特征图重采样得到的，这也是为什么图2 一上来就 warp 的原因，区别于 PCW-Net

也要注意 cost 的构建方式，因为 disparity range 不均匀了，constraction 实现通过 grid_sample 而非 slice

在这里插入图片描述

4.5、Cost Volume Aggregation

论文中好像没有怎么描写这一部分细节，看了 code 才知道，原来也是由 pre-hourglass 和 3 个 hourglass 结构构成的

                pred0, pred1, pred2, pred3 = self.cost_agg[stage_idx](cost,
                                                                      FineD=self.ndisps[stage_idx],  # 48
                                                                      FineH=left.shape[2],  # 256
                                                                      FineW=left.shape[3],  # 512
                                                                      disp_range_samples=disp_range_samples)

self.cost_agg 的实现如下：

class CostAggregation(nn.Module):
    def __init__(self, in_channels, base_channels=32):
        super(CostAggregation, self).__init__()

        self.dres0 = nn.Sequential(convbn_3d(in_channels, base_channels, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   convbn_3d(base_channels, base_channels, 3, 1, 1),
                                   nn.ReLU(inplace=True))

        self.dres1 = nn.Sequential(convbn_3d(base_channels, base_channels, 3, 1, 1),
                                   nn.ReLU(inplace=True),
                                   convbn_3d(base_channels, base_channels, 3, 1, 1))

        self.dres2 = hourglass(base_channels)

        self.dres3 = hourglass(base_channels)

        self.dres4 = hourglass(base_channels)

        self.classif0 = nn.Sequential(convbn_3d(base_channels, base_channels, 3, 1, 1),
                                      nn.ReLU(inplace=True),
                                      nn.Conv3d(base_channels, 1, kernel_size=3, padding=1, stride=1, bias=False))

        self.classif1 = nn.Sequential(convbn_3d(base_channels, base_channels, 3, 1, 1),
                                      nn.ReLU(inplace=True),
                                      nn.Conv3d(base_channels, 1, kernel_size=3, padding=1, stride=1, bias=False))

        self.classif2 = nn.Sequential(convbn_3d(base_channels, base_channels, 3, 1, 1),
                                      nn.ReLU(inplace=True),
                                      nn.Conv3d(base_channels, 1, kernel_size=3, padding=1, stride=1, bias=False))

        self.classif3 = nn.Sequential(convbn_3d(base_channels, base_channels, 3, 1, 1),
                                      nn.ReLU(inplace=True),
                                      nn.Conv3d(base_channels, 1, kernel_size=3, padding=1, stride=1, bias=False))

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.Conv3d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.kernel_size[2] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm3d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                m.bias.data.zero_()


    def forward(self, cost, FineD, FineH, FineW, disp_range_samples):

        cost0 = self.dres0(cost)  # torch.Size([1, 64, 12, 64, 128])->torch.Size([1, 32, 12, 64, 128])
        cost0 = self.dres1(cost0) + cost0  # torch.Size([1, 32, 12, 64, 128])

        out1 = self.dres2(cost0)  # torch.Size([1, 32, 12, 64, 128])
        out2 = self.dres3(out1)   # torch.Size([1, 32, 12, 64, 128])
        out3 = self.dres4(out2)   # torch.Size([1, 32, 12, 64, 128])

        cost3 = self.classif3(out3)  # torch.Size([1, 1, 12, 64, 128])

        if self.training:
            cost0 = self.classif0(cost0)  # torch.Size([1, 1, 12, 64, 128])
            cost1 = self.classif1(out1)   # torch.Size([1, 1, 12, 64, 128])
            cost2 = self.classif2(out2)   # torch.Size([1, 1, 12, 64, 128])

            cost0 = F.upsample(cost0, [FineD, FineH, FineW], mode='trilinear',
                               align_corners=Align_Corners)  # torch.Size([1, 1, 48, 256, 512])
            cost1 = F.upsample(cost1, [FineD, FineH, FineW], mode='trilinear',
                               align_corners=Align_Corners)  # torch.Size([1, 1, 48, 256, 512])
            cost2 = F.upsample(cost2, [FineD, FineH, FineW], mode='trilinear',
                               align_corners=Align_Corners)  # torch.Size([1, 1, 48, 256, 512])

            cost0 = torch.squeeze(cost0, 1)
            pred0 = F.softmax(cost0, dim=1)
            pred0 = disparity_regression(pred0, disp_range_samples)  # 注意这里的 d 是 disp_range_samples，而不是简单的 range(max_disparity)

            cost1 = torch.squeeze(cost1, 1)
            pred1 = F.softmax(cost1, dim=1)
            pred1 = disparity_regression(pred1, disp_range_samples)

            cost2 = torch.squeeze(cost2, 1)
            pred2 = F.softmax(cost2, dim=1)
            pred2 = disparity_regression(pred2, disp_range_samples)

        cost3 = F.upsample(cost3, [FineD, FineH, FineW], mode='trilinear', align_corners=Align_Corners)
        cost3 = torch.squeeze(cost3, 1)
        pred3_prob = F.softmax(cost3, dim=1)
        # For your information: This formulation 'softmax(c)' learned "similarity"
        # while 'softmax(-c)' learned 'matching cost' as mentioned in the paper.
        # However, 'c' or '-c' do not affect the performance because feature-based cost volume provided flexibility.
        pred3 = disparity_regression(pred3_prob, disp_range_samples)

        if self.training:
            return pred0, pred1, pred2, pred3
        else:
            return pred3

特别注意 disparity_regression 时候，d 非均匀的 range，而是 new disparity range sample

4.6、Loss Function

在这里插入图片描述

一共 N = 3 个 stage，每个 stage 有监督

def stereo_psmnet_loss(inputs, target, mask, **kwargs):

    disp_loss_weights = kwargs.get("dlossw", None)

    total_loss = torch.tensor(0.0, dtype=target.dtype, device=target.device, requires_grad=False)
    for (stage_inputs, stage_key) in [(inputs[k], k) for k in inputs.keys() if "stage" in k]:
        disp0, disp1, disp2, disp3 = stage_inputs["pred0"], stage_inputs["pred1"], stage_inputs["pred2"], stage_inputs["pred3"]

        loss = 0.5 * F.smooth_l1_loss(disp0[mask], target[mask], reduction='mean') + \
               0.5 * F.smooth_l1_loss(disp1[mask], target[mask], reduction='mean') + \
               0.7 * F.smooth_l1_loss(disp2[mask], target[mask], reduction='mean') + \
               1.0 * F.smooth_l1_loss(disp3[mask], target[mask], reduction='mean')

        if disp_loss_weights is not None:
            stage_idx = int(stage_key.replace("stage", "")) - 1
            total_loss += disp_loss_weights[stage_idx] * loss
        else:
            total_loss += loss

    return total_loss

遍历每个 stage，计算 smooth L1 loss

每个 stage 4 个 predict，分别对应

cost aggregation 模块中 pre-hourglass 的输出
第一个 hourglass 后预测的视差图
第二个 hourglass 后预测的视差图
第三个 hourglass 后预测的视差图

5、Experiments

MVSNet+Ours.

8 Nvidia GTX 1080Ti GPUs，batch-size 16

5.1、Datasets and Metrics

DTU
Tanks and Temples
Middlebury
KITTI 2015
Scene Flow

评价指标

MVS：Acc.(mm)、Comp.(mm)、Overall(mm)、Rank、Mean
SM：EPE、D1

5.2、Multi-view stereo

Benchmark Performance

在这里插入图片描述

作者的方法结果虽然不是最好，但基于 MVSNet 改进，提升还是很明显的

在这里插入图片描述
generates more complete point clouds with finer details

在这里插入图片描述

top row 的例子选取的并不好，左侧不都是错误的预测吗？？？

5.3、Stereo Matching

Benchmark Performance

在这里插入图片描述

在 PSMNet、GwcNet、GANet 中引入作者的方法，结果均有提升，且作者的 memory 更低

在这里插入图片描述

GwcNet 引入作者方法后排行榜 29th to 17th (date: Nov.5, 2019).

在这里插入图片描述

Middlebury benchmark, PSMNet+Ours ranks 37th for the avgerr metric(date: Feb.7, 2020).

5.4、Ablation Study

Cascade Stage Number and Parameter Sharing in Cost Volume Regularization

在这里插入图片描述

separate parameter learning of the cascade cost volumes at different stages further improves the accuracy.

Spatial Resolution and Feature Pyramid

在这里插入图片描述

5.5、Runtime and GPU Memory

在这里插入图片描述

6、Conclusion（own） / Future work

GPU memory and computationally efficient cascade cost volume
coarse to fine，ensures the computation and memory resources are spent on more meaningful regions.
本文提出的级联结构无需改变主干网络，即可直接赋能现有方法。
Hypothesis Plane Interval，48 * 4 = 192
左图+warp 右图，显式建模了几何约束
在深度学习中，实现“从任意坐标位置采样图像/特征图”的操作，并且该操作是可微分的（differentiable），以便能嵌入到端到端训练的神经网络中。选用 torch.nn.functional.grid_sample，例子，光流 warp、立体匹配中的右图对齐、STN、NeRF 渲染

更多论文解读，请参考【Paper Reading】