
CVPR-2020
https://github.com/alibaba/cascade-stereo
文章目录
1、Background and Motivation
近年来,基于深度学习的多视角立体匹配(multi-view stereo——MVS)和双目立体匹配(Stereo Matching)取得了显著进展。主流方法普遍采用3D代价体(Cost Volume)结构:通过在一系列假设深度(或视差)平面上对特征图进行扭曲(warping),构建一个三维张量,再利用3D卷积对其进行正则化并回归最终的深度/视差图。
然而,这种结构存在一个根本性瓶颈:计算复杂度和显存消耗随代价体分辨率呈立方级增长。这严重限制了模型处理高分辨率图像的能力——大多数方法不得不将特征图大幅下采样(如1/4或1/8),cost volumes at a lower resolution 通过 upsampling or post-refinement 输出 high-resolution result ,导致精度损失。
既然单一代价体难以兼顾高分辨率与高效性,能否借鉴传统计算机视觉中的“由粗到精”(coarse-to-fine)策略?
作者提出 CasStereo——propose a memory and time efficient cost volume formulation,核心思想是:将单一高开销代价体拆解为多个级联阶段,每阶段逐步缩小深度范围、细化间隔、提升分辨率,从而在更低资源消耗下实现更高精度(narrow the depth (or disparity) range of each stage by the prediction from the previous stage.)。

2、Related Work
(1)Stereo Matching
传统方法步骤
- matching cost calculation
- matching cost aggregation,
- disparity calculation
- disparity refinement
传统方法分类
- Global methods(energy function)
- Local methods
后面 CNN 方法
- GC-Net
- PSMNet
- GwcNet
- HSM
- EMCUA
- GANet
limited to downsampled cost volumes and rely on interpolation operations to generate high-resolution disparity.
(2)Multi-View Stereo
- volumetric methods
- point cloud based methods
- depth map reconstruction methods
(3)High-Resolution Output in Stereo and MVS
MVSNet
Yao Y, Luo Z, Li S, et al. Mvsnet: Depth inference for unstructured multi-view stereo[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 767-783.
3、Advantages / Contributions
提出级联代价体(Cascade Cost Volume)——一种通用、即插即用的高效代价体构建范式;
实现真正的高分辨率推理:在DTU数据集上以1152×864输入取得SOTA结果(also rank first on Tanks and Temples benchmark);
显著降低资源消耗:相比 MVSNet,GPU 显存减少 50.6%,运行时间减少 59.3%;
广泛兼容性:成功应用于MVSNet、PSMNet、GwcNet、GANet等多种架构;即插即用
4、Method

coarse to fine
ensures the computation and memory resources are spent on more meaningful regions
4.1、Cost volume Formulation
Constructing 3D cost volume requires three major steps
-
Feature Extraction(特征提取)
-
Warping / Feature Alignment(特征对齐 / 扭曲)
-
Cost Aggregation(代价聚合)
这里 warp 指根据相机的内外参(intrinsic & extrinsic parameters)和一个假设的深度值(hypothesis depth),利用单应性变换(homography)将某视角的特征图投影到参考视角(reference view)在该深度平面上的对应位置。warp 操作是可微的(通常用双线性插值实现),因此可以嵌入到端到端训练的网络中。
we warp the extracted 2D features of each view to the hypothesis planes and construct the feature volumes, which are finally fused together to build the 3D cost volume.
(1)Cost volume Formulation
参考视图到源视图的投影关系可由 单应性矩阵(homography) H i ( d ) H_i(d) Hi(d) 描述:
参考图(Reference Image) 作为重建目标的图像。最终的深度图(depth map)或视差图(disparity map)就是相对于这张图像定义的。
源图(Source Image) 除参考图外的其他输入图像,用于提供额外视角信息,辅助判断参考图中每个像素的深度。
对应的公式化表达如下:

i t h i^{th} ith view
n 1 n_1 n1 denotes the principle axis of the reference camera—— n = [ 0 , 0 , 1 ] T n=[0,0,1] ^T n=[0,0,1]T 是 fronto-parallel 平面的法向量(因为假设深度平面平行于图像平面)。
K 内参、R 外参旋转、t 外参平移(不同视角下的相对平移会产生视差,而这个视差 被深度 d 归一化,所以 d 在分母上)
利用相机几何(内外参)和假设深度 d,通过可微的单应性变换,将源视角的特征图对齐到参考视角在深度 d 的平面上,为后续多视角特征融合(构建 cost volume)提供几何一致的输入。
(2)3D Cost Volumes in Stereo Matching

在给定视差
d
d
d 的假设下,右图中与左图
X
l
X_l
Xl 位置对应的像素 x 坐标
在已校正的双目图像中,利用视差与水平位移的线性关系,快速确定左右图特征的对应位置,从而高效构建 3D 代价体。
不像 MVS 那样使用复杂的单应性变换,而是直接根据视差进行水平方向的偏移。
4.2、Feature Pyramid
{1/16, 1/4, 1}
sceneflow,left and right input image Size([1, 3, 256, 512]) 为例
注意代码中只有两个 stage
refimg_msfea = self.feature_extraction(left) # torch.Size([1, 3, 256, 512])
targetimg_msfea = self.feature_extraction(right)
提取特征的特征金字塔 feature_extraction 具体如下
def forward(self, x):
output_s1 = self.firstconv_a(x) # torch.Size([1, 3, 256, 512]) -> torch.Size([1, 32, 256, 512])
output = self.firstconv_b(output_s1) # 1/2 torch.Size([1, 32, 128, 256])
output_s2 = self.layer1(output) # 1/2 torch.Size([1, 32, 128, 256])
output_raw = self.layer2(output_s2) # 1/4 torch.Size([1, 64, 64, 128])
output = self.layer3(output_raw) # 1/4 torch.Size([1, 128, 64, 128])
output_skip = self.layer4(output) # 1/4 torch.Size([1, 128, 64, 128])
output_branch1 = self.branch1(output_skip) # torch.Size([1, 32, 1, 2])
output_branch1 = F.upsample(output_branch1, (output_skip.size()[2],output_skip.size()[3]),mode='bilinear', align_corners=Align_Corners) # torch.Size([1, 32, 64, 128])
output_branch2 = self.branch2(output_skip) # torch.Size([1, 32, 2, 4])
output_branch2 = F.upsample(output_branch2, (output_skip.size()[2],output_skip.size()[3]),mode='bilinear', align_corners=Align_Corners) # torch.Size([1, 32, 64, 128])
output_branch3 = self.branch3(output_skip) # torch.Size([1, 32, 4, 8])
output_branch3 = F.upsample(output_branch3, (output_skip.size()[2],output_skip.size()[3]),mode='bilinear', align_corners=Align_Corners) # torch.Size([1, 32, 64, 128])
output_branch4 = self.branch4(output_skip) # torch.Size([1, 32, 8, 16])
output_branch4 = F.upsample(output_branch4, (output_skip.size()[2],output_skip.size()[3]),mode='bilinear', align_corners=Align_Corners) # torch.Size([1, 32, 64, 128])
output_feature = torch.cat((output_raw, output_skip, output_branch4, output_branch3, output_branch2, output_branch1), 1) # torch.Size([1, 320, 64, 128])
output_msfeat = {}
output_feature = self.inner0(output_feature) # torch.Size([1, 32, 64, 128])
out = self.lastconv(output_feature) # torch.Size([1, 32, 64, 128])
output_msfeat["stage1"] = out
intra_feat = output_feature
if self.arch_mode == "fpn":
if self.num_stage == 3:
intra_feat = F.interpolate(intra_feat, scale_factor=2, mode="nearest") + self.inner1(output_s2)
out = self.out2(intra_feat)
output_msfeat["stage2"] = out
intra_feat = F.interpolate(intra_feat, scale_factor=2, mode="nearest") + self.inner2(output_s1)
out = self.out3(intra_feat)
output_msfeat["stage3"] = out
elif self.num_stage == 2:
intra_feat = F.interpolate(intra_feat, scale_factor=2, mode="nearest") + self.inner1(output_s2) # 1/2 torch.Size([1, 32, 128, 256])
out = self.out2(intra_feat) # torch.Size([1, 16, 128, 256])
output_msfeat["stage2"] = out
return output_msfeat
特征金字塔建立后把所有特征 concat 在一起
提取出来的 stage1 torch.Size([1, 32, 64, 128]), 1/4
提取出来的 stage2 torch.Size([1, 16, 128, 256]),1/2
4.3、Cascade Cost Volume

a finer plane interval( 更小的深度/视差采样步长,即在相同深度范围内使用更多假设平面) are likely to improve the reconstruction accuracy
eg:
深度范围:[200 mm, 1000 mm]
若采样 8 个平面,则 plane interval ≈ (1000−200)/7 ≈ 114 mm
若采样 64 个平面,则 plane interval ≈ 12.5 mm
plane interval 越小 → 采样越密集 → 搜索越精细
(1)Hypothesis Range

eg:0~192
(2)Hypothesis Plane Interval
depth interval is set to 4, 2 and 1 times
code 中 two stage 默认 parser.add_argument('--disp_inter_r', type=str, default="4,1", help='disp_intervals_ratio')
stage1 时,48*4 = 192,设计的很巧妙,真正的 coarse to fine
(3)Number of Hypothesis Planes
这个是 D 的数量
Hypothesis Range 除以 Hypothesis Plane Interval 就是 Number of Hypothesis Planes

3 个 stagethe number of depth hypothesis is 48, 32 and 8,
code 中 two stage 默认 parser.add_argument('--ndisps', type=str, default="48,24", help='ndisps')
(4)Spatial Resolution
gradually increases and is set to 1/16, 1/4 and 1 of the original input image size.
(5)Warping Operation
MVS

SM

每个 stage,调用 get_disp_range_samples 获取新的 disparity range
for stage_idx in range(self.num_stage):
# print("*********************stage{}*********************".format(stage_idx + 1))
if pred is not None:
if self.grad_method == "detach":
cur_disp = pred.detach()
else:
cur_disp = pred
disp_range_samples = get_disp_range_samples(cur_disp=cur_disp, ndisp=self.ndisps[stage_idx],
disp_inteval_pixel=self.disp_interval_pixel[stage_idx],
dtype=left.dtype,
device=left.device,
shape=[left.shape[0], left.shape[2], left.shape[3]],
max_disp=self.maxdisp,
using_ns=self.using_ns,
ns_size=self.ns_size) # torch.Size([1, 48, 256, 512])
stage_scale = self.stage_infos["stage{}".format(stage_idx + 1)]["scale"]
refimg_fea, targetimg_fea = refimg_msfea["stage{}".format(stage_idx + 1)], \
targetimg_msfea["stage{}".format(stage_idx + 1)]
get_disp_range_samples 的细节如下
def get_disp_range_samples(cur_disp, ndisp, disp_inteval_pixel, device, dtype, shape, using_ns, ns_size, max_disp=192.0):
#shape, (B, H, W)
#cur_disp: (B, H, W) or float
#return disp_range_values: (B, D, H, W)
# with torch.no_grad():
if cur_disp is None:
cur_disp = torch.tensor(0, device=device, dtype=dtype, requires_grad=False).reshape(1, 1, 1).repeat(*shape) # torch.Size([1, 256, 512]) 创建一个全零的视差图,形状为 (B, H, W),作为初始中心视差(即假设所有点视差为 0)。
cur_disp_min = (cur_disp - ndisp / 2 * disp_inteval_pixel).clamp(min=0.0) # (B, H, W) 计算每个像素位置的最小视差:以 cur_disp 为中心,向左扩展 ndisp/2 个间隔。
cur_disp_max = (cur_disp_min + (ndisp - 1) * disp_inteval_pixel).clamp(max=max_disp) # (B, H, W) 计算最大视差:从 cur_disp_min 开始,加上 (ndisp - 1) 个间隔,得到范围上限。
new_interval = (cur_disp_max - cur_disp_min) / (ndisp - 1) # (B, H, W) 重新计算实际使用的视差间隔(可能因 clamp 而略小于 disp_inteval_pixel),确保 ndisp 个点均匀覆盖 [min, max]。
disp_range_volume = cur_disp_min.unsqueeze(1) + (torch.arange(0, ndisp, device=cur_disp.device,
dtype=cur_disp.dtype,
requires_grad=False).reshape(1, -1, 1, 1) * new_interval.unsqueeze(1)) # (B,1,H,W) +(1,D,1,1)*(B,1,H,W) = (B,D,H,W)
else:
disp_range_volume = get_cur_disp_range_samples(cur_disp, ndisp, disp_inteval_pixel, shape, ns_size, using_ns, max_disp)
return disp_range_volume
torch.arange(0, ndisp) → 生成索引 [0, 1, ..., ndisp-1]
reshape(1, -1, 1, 1) → 变为 (1, D, 1, 1) 以便广播
new_interval.unsqueeze(1) → (B, 1, H, W)
相乘后得到偏移量,加到 cur_disp_min 上
最终 disp_range_volume 形状为 (B, D, H, W),其中 disp_range_volume[:, d, h, w] 表示位置 (h,w) 在第 d 个假设视差的值
stage1 的 d = 48,depth plane 的间隔为 4
stage2 的 d = 24,depth plane 的间隔为 1
调用 get_cur_disp_range_samples 来获取新的 disparity range
def get_cur_disp_range_samples(cur_disp, ndisp, disp_inteval_pixel, shape, ns_size, using_ns=False, max_disp=192.0):
#shape, (B, H, W)
#cur_disp: (B, H, W)
#return disp_range_samples: (B, D, H, W)
if not using_ns:
cur_disp_min = (cur_disp - ndisp / 2 * disp_inteval_pixel) # (B, H, W)
cur_disp_max = (cur_disp + ndisp / 2 * disp_inteval_pixel)
# cur_disp_min = (cur_disp - ndisp / 2 * disp_inteval_pixel).clamp(min=0.0) #(B, H, W)
# cur_disp_max = (cur_disp_min + (ndisp - 1) * disp_inteval_pixel).clamp(max=max_disp)
assert cur_disp.shape == torch.Size(shape), "cur_disp:{}, input shape:{}".format(cur_disp.shape, shape)
new_interval = (cur_disp_max - cur_disp_min) / (ndisp - 1) # (B, H, W)
disp_range_samples = cur_disp_min.unsqueeze(1) + (torch.arange(0, ndisp, device=cur_disp.device,
dtype=cur_disp.dtype,
requires_grad=False).reshape(1, -1, 1,
1) * new_interval.unsqueeze(1))
4.4、Cost Volume Construction
每个 stage 均会 construction,核心函数是 self.get_cv
cost = self.get_cv(refimg_fea, targetimg_fea,
disp_range_samples=F.interpolate((disp_range_samples / stage_scale).unsqueeze(1),
[self.ndisps[stage_idx]//int(stage_scale), left.size()[2]//int(stage_scale), left.size()[3]//int(stage_scale)],
mode='trilinear',
align_corners=Align_Corners_Range).squeeze(1),
ndisp=self.ndisps[stage_idx]//int(stage_scale)) # torch.Size([1, 64, 12, 64, 128])
self.get_cv 核心思想是:对右图特征进行视差相关的“warp”(重采样),使其与左图特征在假设视差下对齐,再拼接形成代价体。
代码如下:
class GetCostVolume(nn.Module):
def __init__(self):
super(GetCostVolume, self).__init__()
def forward(self, x, y, disp_range_samples, ndisp):
assert (x.is_contiguous() == True)
bs, channels, height, width = x.size() # torch.Size([1, 32, 64, 128])
cost = x.new().resize_(bs, channels * 2, ndisp, height, width).zero_() # torch.Size([1, 64, 12, 64, 128])
# cost = y.unsqueeze(2).repeat(1, 2, ndisp, 1, 1) #(B, 2C, D, H, W)
mh, mw = torch.meshgrid([torch.arange(0, height, dtype=x.dtype, device=x.device),
torch.arange(0, width, dtype=x.dtype, device=x.device)]) # (H *W) 创建标准图像坐标网格
mh = mh.reshape(1, 1, height, width).repeat(bs, ndisp, 1, 1) # torch.Size([1, 12, 64, 128])
mw = mw.reshape(1, 1, height, width).repeat(bs, ndisp, 1, 1) # (B, D, H, W) torch.Size([1, 12, 64, 128]) 将坐标网格扩展到 batch 和视差维度
cur_disp_coords_y = mh # torch.Size([1, 12, 64, 128])
cur_disp_coords_x = mw - disp_range_samples # torch.Size([1, 12, 64, 128]) 计算右图采样坐标
coords_x = cur_disp_coords_x / ((width - 1.0) / 2.0) - 1.0 # trans to -1 - 1 torch.Size([1, 12, 64, 128])
coords_y = cur_disp_coords_y / ((height - 1.0) / 2.0) - 1.0 # torch.Size([1, 12, 64, 128])
grid = torch.stack([coords_x, coords_y], dim=4) #(B, D, H, W, 2) torch.Size([1, 12, 64, 128, 2]) 这是 F.grid_sample 所要求的归一化坐标格式(左上角=-1,-1;右下角=1,1)
cost[:, x.size()[1]:, :, :, :] = F.grid_sample(y, grid.view(bs, ndisp * height, width, 2), mode='bilinear',
padding_mode='zeros').view(bs, channels, ndisp, height, width) # 对右图进行可微分采样,存入 cost 的后半部分([:, C:, ...])
# a littel difference, no zeros filling
tmp = x.unsqueeze(2).repeat(1, 1, ndisp, 1, 1) #(B, C, D, H, W)
# tmp = tmp.transpose(0, 1) #(C, B, D, H, W)
# #x1 = x2 + d >= d
# tmp[:, mw < disp_range_samples] = 0
# tmp = tmp.transpose(0, 1) #(B, C, D, H, W)
cost[:, :x.size()[1], :, :, :] = tmp # 将复制后的左图特征存入 cost 的前半部分([:, :C, ...])
return cost
可以看见本质上是一个 concatenation construction,输出的代价体包含左图特征 + warped 右图特征——要判断左图 (h,w) 与视差 d 是否匹配,应将右图 (h, w - d) 的特征拿来与左图 (h,w) 比较。
warp 右特征是根据 disparity range sample 在右特征图重采样得到的,这也是为什么图2 一上来就 warp 的原因,区别于 PCW-Net
也要注意 cost 的构建方式,因为 disparity range 不均匀了,constraction 实现通过 grid_sample 而非 slice

4.5、Cost Volume Aggregation
论文中好像没有怎么描写这一部分细节,看了 code 才知道,原来也是由 pre-hourglass 和 3 个 hourglass 结构构成的
pred0, pred1, pred2, pred3 = self.cost_agg[stage_idx](cost,
FineD=self.ndisps[stage_idx], # 48
FineH=left.shape[2], # 256
FineW=left.shape[3], # 512
disp_range_samples=disp_range_samples)
self.cost_agg 的实现如下:
class CostAggregation(nn.Module):
def __init__(self, in_channels, base_channels=32):
super(CostAggregation, self).__init__()
self.dres0 = nn.Sequential(convbn_3d(in_channels, base_channels, 3, 1, 1),
nn.ReLU(inplace=True),
convbn_3d(base_channels, base_channels, 3, 1, 1),
nn.ReLU(inplace=True))
self.dres1 = nn.Sequential(convbn_3d(base_channels, base_channels, 3, 1, 1),
nn.ReLU(inplace=True),
convbn_3d(base_channels, base_channels, 3, 1, 1))
self.dres2 = hourglass(base_channels)
self.dres3 = hourglass(base_channels)
self.dres4 = hourglass(base_channels)
self.classif0 = nn.Sequential(convbn_3d(base_channels, base_channels, 3, 1, 1),
nn.ReLU(inplace=True),
nn.Conv3d(base_channels, 1, kernel_size=3, padding=1, stride=1, bias=False))
self.classif1 = nn.Sequential(convbn_3d(base_channels, base_channels, 3, 1, 1),
nn.ReLU(inplace=True),
nn.Conv3d(base_channels, 1, kernel_size=3, padding=1, stride=1, bias=False))
self.classif2 = nn.Sequential(convbn_3d(base_channels, base_channels, 3, 1, 1),
nn.ReLU(inplace=True),
nn.Conv3d(base_channels, 1, kernel_size=3, padding=1, stride=1, bias=False))
self.classif3 = nn.Sequential(convbn_3d(base_channels, base_channels, 3, 1, 1),
nn.ReLU(inplace=True),
nn.Conv3d(base_channels, 1, kernel_size=3, padding=1, stride=1, bias=False))
for m in self.modules():
if isinstance(m, nn.Conv2d):
n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
m.weight.data.normal_(0, math.sqrt(2. / n))
elif isinstance(m, nn.Conv3d):
n = m.kernel_size[0] * m.kernel_size[1] * m.kernel_size[2] * m.out_channels
m.weight.data.normal_(0, math.sqrt(2. / n))
elif isinstance(m, nn.BatchNorm2d):
m.weight.data.fill_(1)
m.bias.data.zero_()
elif isinstance(m, nn.BatchNorm3d):
m.weight.data.fill_(1)
m.bias.data.zero_()
elif isinstance(m, nn.Linear):
m.bias.data.zero_()
def forward(self, cost, FineD, FineH, FineW, disp_range_samples):
cost0 = self.dres0(cost) # torch.Size([1, 64, 12, 64, 128])->torch.Size([1, 32, 12, 64, 128])
cost0 = self.dres1(cost0) + cost0 # torch.Size([1, 32, 12, 64, 128])
out1 = self.dres2(cost0) # torch.Size([1, 32, 12, 64, 128])
out2 = self.dres3(out1) # torch.Size([1, 32, 12, 64, 128])
out3 = self.dres4(out2) # torch.Size([1, 32, 12, 64, 128])
cost3 = self.classif3(out3) # torch.Size([1, 1, 12, 64, 128])
if self.training:
cost0 = self.classif0(cost0) # torch.Size([1, 1, 12, 64, 128])
cost1 = self.classif1(out1) # torch.Size([1, 1, 12, 64, 128])
cost2 = self.classif2(out2) # torch.Size([1, 1, 12, 64, 128])
cost0 = F.upsample(cost0, [FineD, FineH, FineW], mode='trilinear',
align_corners=Align_Corners) # torch.Size([1, 1, 48, 256, 512])
cost1 = F.upsample(cost1, [FineD, FineH, FineW], mode='trilinear',
align_corners=Align_Corners) # torch.Size([1, 1, 48, 256, 512])
cost2 = F.upsample(cost2, [FineD, FineH, FineW], mode='trilinear',
align_corners=Align_Corners) # torch.Size([1, 1, 48, 256, 512])
cost0 = torch.squeeze(cost0, 1)
pred0 = F.softmax(cost0, dim=1)
pred0 = disparity_regression(pred0, disp_range_samples) # 注意这里的 d 是 disp_range_samples,而不是简单的 range(max_disparity)
cost1 = torch.squeeze(cost1, 1)
pred1 = F.softmax(cost1, dim=1)
pred1 = disparity_regression(pred1, disp_range_samples)
cost2 = torch.squeeze(cost2, 1)
pred2 = F.softmax(cost2, dim=1)
pred2 = disparity_regression(pred2, disp_range_samples)
cost3 = F.upsample(cost3, [FineD, FineH, FineW], mode='trilinear', align_corners=Align_Corners)
cost3 = torch.squeeze(cost3, 1)
pred3_prob = F.softmax(cost3, dim=1)
# For your information: This formulation 'softmax(c)' learned "similarity"
# while 'softmax(-c)' learned 'matching cost' as mentioned in the paper.
# However, 'c' or '-c' do not affect the performance because feature-based cost volume provided flexibility.
pred3 = disparity_regression(pred3_prob, disp_range_samples)
if self.training:
return pred0, pred1, pred2, pred3
else:
return pred3
特别注意 disparity_regression 时候,d 非均匀的 range,而是 new disparity range sample
4.6、Loss Function

一共 N = 3 个 stage,每个 stage 有监督
def stereo_psmnet_loss(inputs, target, mask, **kwargs):
disp_loss_weights = kwargs.get("dlossw", None)
total_loss = torch.tensor(0.0, dtype=target.dtype, device=target.device, requires_grad=False)
for (stage_inputs, stage_key) in [(inputs[k], k) for k in inputs.keys() if "stage" in k]:
disp0, disp1, disp2, disp3 = stage_inputs["pred0"], stage_inputs["pred1"], stage_inputs["pred2"], stage_inputs["pred3"]
loss = 0.5 * F.smooth_l1_loss(disp0[mask], target[mask], reduction='mean') + \
0.5 * F.smooth_l1_loss(disp1[mask], target[mask], reduction='mean') + \
0.7 * F.smooth_l1_loss(disp2[mask], target[mask], reduction='mean') + \
1.0 * F.smooth_l1_loss(disp3[mask], target[mask], reduction='mean')
if disp_loss_weights is not None:
stage_idx = int(stage_key.replace("stage", "")) - 1
total_loss += disp_loss_weights[stage_idx] * loss
else:
total_loss += loss
return total_loss
遍历每个 stage,计算 smooth L1 loss
每个 stage 4 个 predict,分别对应
- cost aggregation 模块中 pre-hourglass 的输出
- 第一个 hourglass 后预测的视差图
- 第二个 hourglass 后预测的视差图
- 第三个 hourglass 后预测的视差图
5、Experiments
MVSNet+Ours.
8 Nvidia GTX 1080Ti GPUs,batch-size 16
5.1、Datasets and Metrics
-
Middlebury
-
KITTI 2015
-
Scene Flow
评价指标
-
MVS:Acc.(mm)、Comp.(mm)、Overall(mm)、Rank、Mean
-
SM:EPE、D1
5.2、Multi-view stereo
Benchmark Performance

作者的方法结果虽然不是最好,但基于 MVSNet 改进,提升还是很明显的

generates more complete point clouds with finer details




top row 的例子选取的并不好,左侧不都是错误的预测吗???
5.3、Stereo Matching
Benchmark Performance

在 PSMNet、GwcNet、GANet 中引入作者的方法,结果均有提升,且作者的 memory 更低

GwcNet 引入作者方法后排行榜 29th to 17th (date: Nov.5, 2019).

Middlebury benchmark, PSMNet+Ours ranks 37th for the avgerr metric(date: Feb.7, 2020).
5.4、Ablation Study
Cascade Stage Number and Parameter Sharing in Cost Volume Regularization

separate parameter learning of the cascade cost volumes at different stages further improves the accuracy.
Spatial Resolution and Feature Pyramid

5.5、Runtime and GPU Memory


6、Conclusion(own) / Future work
- GPU memory and computationally efficient cascade cost volume
- coarse to fine,ensures the computation and memory resources are spent on more meaningful regions.
- 本文提出的级联结构无需改变主干网络,即可直接赋能现有方法。
- Hypothesis Plane Interval,48 * 4 = 192
- 左图+warp 右图,显式建模了几何约束
- 在深度学习中,实现“从任意坐标位置采样图像/特征图”的操作,并且该操作是可微分的(differentiable),以便能嵌入到端到端训练的神经网络中。选用
torch.nn.functional.grid_sample,例子,光流 warp、立体匹配中的右图对齐、STN、NeRF 渲染
更多论文解读,请参考 【Paper Reading】





1318

被折叠的 条评论
为什么被折叠?



