MaptrV2代码阅读

qq_41131535

已于 2024-07-31 17:40:24 修改

阅读量2.3k

点赞数 29

分类专栏：代码阅读文章标签：人工智能机器学习深度学习

于 2024-06-14 18:34:08 首次发布

本文链接：https://blog.youkuaiyun.com/qq_41131535/article/details/139676227

版权

代码阅读专栏收录该内容

2 篇文章

订阅专栏

一数据处理（后续补充）

二模型结构

2.1 Backbone+Neck

这里输入不加时序的单帧图片，一共六张，输入图片大小为 $B * 6 * 3 * 480 * 800 （ B 是 ba t c h s i ze ）$ ，先走grid_mask数据增强（参考https://blog.youkuaiyun.com/u013685264/article/details/122667456），采用基础resnet50作为backbone，得到最后32倍下采样特征 $B * 6 * 2048 * 15 * 25$ ，在经过neck（主要是两个Conv2d 进行降维），得到输出 $B * 6 * 256 * 15 * 25$

2.2 BEV特征

目前bev特征生成，主流的主要是bevformer和LSS，针对这两种方式，后续补充，生成bev特征 $B * 2000 * 256 （ 2000 是对应 200 * 100 BE V 空间大小（ h * w ））$ ，LSS还会生成对应depth特征 $B * 6 * 68 * 15 * 25$ 用作后续深度监督

2.2.1 LSS（基于BEVDepth）

目前LSS方法，是针对所有2D图像，生成空间特征和对应的离散深度分布概率，通过内积，得到最终的视锥特征和对应的空间位置，然后根据内外参，将点投射到bev空间上，采用求和池化的方式，得到相应的bev特征（参考https://zhuanlan.zhihu.com/p/567880155）

2.2.1.1 以基于BEVDepth的LSS模块代码为例

输入图像特征 $B * 6 * 256 * 15 * 25$ ，以及对应的转换矩阵，包括图像增强经过转换后的平移和旋转参数（post_trans，post_rots），相机内参和外参（cam2ego_trans，cam2ego_rots），lidar2ego_trans，lidar2ego_rots
第一步生成带深度的空间点位置，以原输入IH=480，IW=800，深度取值范围是1m~35m，以0.5m为间隔，得到最终空间坐标frustum（shape大小 $68 * 15 * 25 * 3 ，以 32 为步长进行插值采样$ ），在图像坐标系下，然后经过以上输入的转换矩阵和扩展，得到bev空间下points（shape大小 $B * 6 * 68 * 15 * 25 * 3$ ）
这里在估算深度的时候，和相机内参相关，这里将相机内参，外参以及图像增强的转换矩阵，当成输入参数，生成mlp_input（shape大小 $B * 6 * 22$ ），先经过bn层，在经过fc+relu+fc+relu生成context_se（shape大小 $B * 6 * 256$ ）,与下面的图像特征做通道注意力（采用SENet通道注意力）
对于输入的图像特征，经过一次conv+bn+relu，与经过sigmoid之后的context_se进行相乘，得到经过相机参数加成之后的空间语义特征context $B * 6 * 256 * 15 * 25$
对于depth生成，第一步也是类似上一步，对于通过图像特征生成的depth也要经过相机参数加成，得到进入depth_conv的输入 $B * 6 * 256 * 15 * 25$
基于上一步生成的depth走一个ASPP模块，在加一个conv得到深度分布特征 $B * 6 * 68 * 15 * 25$ ，并通过softmax，得到最终depth $B * 6 * 68 * 15 * 25$ ，用于后续loss计算
将depth与context相乘得到最终的空间特征 $B * 6 * 68 * 15 * 25 * 256$ ，根据bev空间下points，然后因为实际的point_cloud_range是[-15.0, -30.0,-10.0, 15.0, 30.0, 10.0]，第一步的bev空间特征大小是200*400，所以将points坐标减去[-15.0, -30.0,-10.0]，在除以voxel_size = [0.15, 0.15, 20.0]，进行坐标转换，选取点坐标在[0,0,0,200,400,1]之间的进行保存，之后采用quickCumsum，进行特征合并，就是相同x，y坐标点的特征进行求和，得到bev特征 $B * 256 * 200 * 400$ ，之后在走3次conv+bn+relu，进行下采样得到最终BEV特征 $B * 256 * 200 * 400$

2.2.1.2 Depth的loss计算

基于上一步预测的depth，经过softmax之后，特征大小是 $B * 6 * 68 * 15 * 25$
生成depth真值，先加载对应的点云，然后通过转换矩阵，转换到图像坐标系下，生成3维坐标，保留在图像大小范围内，和深度1~35m范围内的点，对应同一个坐标点，多个深度值，选取深度最小的，生成depth_map $B * 6 * 480 * 800$ ，之后进行下采样32倍，就是取这个32*32范围内，深度最小的当成下采样之后该坐标点的深度值，然后扩展到68（1~35m，间隔0.5m取值一共68）的范围，最后做one-hot，得到最终的gt depth，与上面的输入进行binary_cross_entropy

2.2.2 BevFormer

-bevformer是通过transformer模块，生成对应bev特征，具体可以参考https://zhuanlan.zhihu.com/p/543335939

2.2.2.1 代码示例

输入mlvl_feats(包含一层的图像特征 $B * 6 * 256 * 15 * 25$ , 以及生成对应的bev_query（ $20000 * B * 256$ ）, bev_pos（ $20000 * B * 256$ ）
对应的can_bus这些参数，经过Linear+relu+Linear+relu+LayerNorm，生成对应特征与bev_query相加得到bev_queries（ $20000 * B * 256$ ）
对于mlvl_feats 加上对应的cams_embeds和level_embeds，得到feat_flatten( $6 * 375 * B * 256$ )，spatial_shapes( $15 * 25$ )
生成ref_3d，就是在H，W，Z（100,200,20）范围内生成对应归一化的点，其中在Z方向上均匀采样4个点（ $B * 4 * 20000 * 3$ ）
生成ref_2d，就是在H，W上生成对应归一化的点，和上面的ref_3d就是少了Z方向（ $B * 20000 * 1 * 2$ ）
通过ref_3d，通过变换，投影到图像坐标系下面，保留在相机坐标系下深度大于0的点以及，在图像范围内的点，得到reference_points_cam（ $6 * B * 20000 * 4 * 2$ ）, bev_mask（ $6 * B * 20000 * 4$ ）
对于bevformer是有时序模块，maptrv2没有加入时序，所以prev_bev为None，生成对应的hybird_ref_2d，就是两个ref_2d进行stack（ $B * 2 * 20000 * 1 * 2$ ）
最终进入layer的输入，包括bev_query，key和value为feat_flatten，ref_2d为hybird_ref_2d，ref_3d，bev_pos，reference_points_cam

2.2.2.2 Layer模块

ret_dict = self.encoder(
            bev_queries,
            feat_flatten,
            feat_flatten,
            mlvl_feats=mlvl_feats,
            bev_h=bev_h,
            bev_w=bev_w,
            bev_pos=bev_pos,
            spatial_shapes=spatial_shapes,
            level_start_index=level_start_index,
            prev_bev=prev_bev,
            shift=shift,
            **kwargs
        )

包括self-attention，layernorm，cross_attention，layernorm，ffn，layernorm
self-attention就是Temporal Self-Attention，这里没有prev_bev，所以将2个bev_query，stack一下，得到query（ $B * 2 * 20000 * 256$ ），value同样的大小，然后经过deformable-attention（详细计算流程可参考下面decoder，计算过程类似），然后将2个进行平均得到最终输出query（ $B * * 20000 * 256$ ）
cross-attention就是Spatial Cross-Attention，这里query是上面的query，key和value对应输入的feat_flatten，其中因为bev空间下的点，在投射到图像坐标系上，只有少量的点，所以做了一步优化，就是根据bev_mask，选取投影到的点，根据多batch设置，就选多个batch里面的最大值作为max_len，不够的batch相应的query和reference_point都设为0，得到queries_rebatch( $B*6*max_len*256$ )，reference_points_rebatch( $B*6*max_len*4*2$ )，经过deformable-attention，得到queries( $B*6*max_len*256$ )，最后在还原到原始20000大小的query上（非图像上点的query为0），然后对6个摄像头点数求和，归一化，得到最终query( $B * 20000 * 256$ )
经过6层layer之后的query就是最终的bev_features( $B * 20000 * 256$ )

2.3 Decoder模块

输入query，采用instance_pts形式，即instance（instance一共有350个，主要是50+300,50是one2one，300是后续one2many多扩展的6倍）和每个instance对应的20个点，分开初始化，最终得到object_query_embeds $7000 * 512 （其中 7000 是对应 350 * 20 ， 512 是对应 q u ery 和 q u ery - p os 合到一起的，也就是 q u ery 和 q u ery - p os 特征是 350 * 20 * 256 ）$
这里设置了个self_attn_mask，大小是 $350 * 350$ ，就是左上角的 $50 * 50$ 和右下角的 $300 * 300$ 是False，是为了隔开one2one和one2many的query，互相不干扰

2.3.1 decoder过程，主要参考deformable attention

MapTRDecoder(
  (layers): ModuleList(
    (0): DecoupledDetrTransformerDecoderLayer(
      (attentions): ModuleList(
        (0): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (1): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (2): CustomMSDeformableAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (sampling_offsets): Linear(in_features=256, out_features=64, bias=True)
          (attention_weights): Linear(in_features=256, out_features=32, bias=True)
          (value_proj): Linear(in_features=256, out_features=256, bias=True)
          (output_proj): Linear(in_features=256, out_features=256, bias=True)
        )
      )
      (ffns): ModuleList(
        (0): FFN(
          (activate): ReLU(inplace=True)
          (layers): Sequential(
            (0): Sequential(
              (0): Linear(in_features=256, out_features=512, bias=True)
              (1): ReLU(inplace=True)
              (2): Dropout(p=0.1, inplace=False)
            )
            (1): Linear(in_features=512, out_features=256, bias=True)
            (2): Dropout(p=0.1, inplace=False)
          )
          (dropout_layer): Identity()
        )
      )
      (norms): ModuleList(
        (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
    )
    (1): DecoupledDetrTransformerDecoderLayer(
      (attentions): ModuleList(
        (0): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (1): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (2): CustomMSDeformableAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (sampling_offsets): Linear(in_features=256, out_features=64, bias=True)
          (attention_weights): Linear(in_features=256, out_features=32, bias=True)
          (value_proj): Linear(in_features=256, out_features=256, bias=True)
          (output_proj): Linear(in_features=256, out_features=256, bias=True)
        )
      )
      (ffns): ModuleList(
        (0): FFN(
          (activate): ReLU(inplace=True)
          (layers): Sequential(
            (0): Sequential(
              (0): Linear(in_features=256, out_features=512, bias=True)
              (1): ReLU(inplace=True)
              (2): Dropout(p=0.1, inplace=False)
            )
            (1): Linear(in_features=512, out_features=256, bias=True)
            (2): Dropout(p=0.1, inplace=False)
          )
          (dropout_layer): Identity()
        )
      )
      (norms): ModuleList(
        (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
    )
    (2): DecoupledDetrTransformerDecoderLayer(
      (attentions): ModuleList(
        (0): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (1): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (2): CustomMSDeformableAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (sampling_offsets): Linear(in_features=256, out_features=64, bias=True)
          (attention_weights): Linear(in_features=256, out_features=32, bias=True)
          (value_proj): Linear(in_features=256, out_features=256, bias=True)
          (output_proj): Linear(in_features=256, out_features=256, bias=True)
        )
      )
      (ffns): ModuleList(
        (0): FFN(
          (activate): ReLU(inplace=True)
          (layers): Sequential(
            (0): Sequential(
              (0): Linear(in_features=256, out_features=512, bias=True)
              (1): ReLU(inplace=True)
              (2): Dropout(p=0.1, inplace=False)
            )
            (1): Linear(in_features=512, out_features=256, bias=True)
            (2): Dropout(p=0.1, inplace=False)
          )
          (dropout_layer): Identity()
        )
      )
      (norms): ModuleList(
        (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
    )
    (3): DecoupledDetrTransformerDecoderLayer(
      (attentions): ModuleList(
        (0): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (1): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (2): CustomMSDeformableAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (sampling_offsets): Linear(in_features=256, out_features=64, bias=True)
          (attention_weights): Linear(in_features=256, out_features=32, bias=True)
          (value_proj): Linear(in_features=256, out_features=256, bias=True)
          (output_proj): Linear(in_features=256, out_features=256, bias=True)
        )
      )
      (ffns): ModuleList(
        (0): FFN(
          (activate): ReLU(inplace=True)
          (layers): Sequential(
            (0): Sequential(
              (0): Linear(in_features=256, out_features=512, bias=True)
              (1): ReLU(inplace=True)
              (2): Dropout(p=0.1, inplace=False)
            )
            (1): Linear(in_features=512, out_features=256, bias=True)
            (2): Dropout(p=0.1, inplace=False)
          )
          (dropout_layer): Identity()
        )
      )
      (norms): ModuleList(
        (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
    )
    (4): DecoupledDetrTransformerDecoderLayer(
      (attentions): ModuleList(
        (0): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (1): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (2): CustomMSDeformableAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (sampling_offsets): Linear(in_features=256, out_features=64, bias=True)
          (attention_weights): Linear(in_features=256, out_features=32, bias=True)
          (value_proj): Linear(in_features=256, out_features=256, bias=True)
          (output_proj): Linear(in_features=256, out_features=256, bias=True)
        )
      )
      (ffns): ModuleList(
        (0): FFN(
          (activate): ReLU(inplace=True)
          (layers): Sequential(
            (0): Sequential(
              (0): Linear(in_features=256, out_features=512, bias=True)
              (1): ReLU(inplace=True)
              (2): Dropout(p=0.1, inplace=False)
            )
            (1): Linear(in_features=512, out_features=256, bias=True)
            (2): Dropout(p=0.1, inplace=False)
          )
          (dropout_layer): Identity()
        )
      )
      (norms): ModuleList(
        (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
    )
    (5): DecoupledDetrTransformerDecoderLayer(
      (attentions): ModuleList(
        (0): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (1): MultiheadAttention(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (proj_drop): Dropout(p=0.0, inplace=False)
          (dropout_layer): Dropout(p=0.1, inplace=False)
        )
        (2): CustomMSDeformableAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (sampling_offsets): Linear(in_features=256, out_features=64, bias=True)
          (attention_weights): Linear(in_features=256, out_features=32, bias=True)
          (value_proj): Linear(in_features=256, out_features=256, bias=True)
          (output_proj): Linear(in_features=256, out_features=256, bias=True)
        )
      )
      (ffns): ModuleList(
        (0): FFN(
          (activate): ReLU(inplace=True)
          (layers): Sequential(
            (0): Sequential(
              (0): Linear(in_features=256, out_features=512, bias=True)
              (1): ReLU(inplace=True)
              (2): Dropout(p=0.1, inplace=False)
            )
            (1): Linear(in_features=512, out_features=256, bias=True)
            (2): Dropout(p=0.1, inplace=False)
          )
          (dropout_layer): Identity()
        )
      )
      (norms): ModuleList(
        (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)

根据query_pos 走一个线性变化，得到reference_points $B * 7000 * 2$ ，然后走sigmoid，得到初始化init_reference_out $B * 7000 * 2$
输入的img_neck特征，加上cams_embeds 和 level_embeds，得到feat_flatten $6 * 375 * B * 256 （ 375 是 15 * 25 ）$
进入decoder过程
这里经过6层decoder，每一层有self-attention，layer_norm，self-attention，layer_norm，cross-attention，layer_norm，FFN，layer_norm
第一次self-attention是nn.MultiheadAttention，输入是query和query_pos，这里就用到了前面的 self_attn_mask，在nn.MultiheadAttention模块中，mask=1-attn_mask，对应上面的设置
第二次self-attention，其中attn_mask设置为None
cross-attention，采用CustomMSDeformableAttention，输入query，key=None，value是对应的bev_embed；value经过一个Linear，得到最终输入value，query经过Linear生成多头的sampling_offsets $B * 7000 * 8 * 1 * 4 * 2 （ 7000 表示是 350 * 20 个实例， 8 是 m u lt i - h e a d ， 1 是只有一个 l e v e l ， 4 是生成 4 个点， 2 是对应的 x y 偏移）$ ；query经过Linear生成多头的attention_weights $B * 7000 * 8 * 1 * 4 （ 7000 表示是 350 * 20 个实例， 8 是 m u lt i - h e a d ， 1 是只有一个 l e v e l ， 4 是生成 4 个点）$ ，在经过softmax；通过reference_points+sampling_offsets/shape，得到最终的sampling_locations，整个过程就是通过reference_ponits，加上4个offsets，得到最终4个点的位置，然后在value上面进行双线性插值得到特征，然后在乘以attention_weights，在求和得到最终output $B * 7000 * 256$ ，在经过Linear以及和输入的query做残差连接，得到最终cross-attention输出 $7000 * B * 256$
FFN，主要参考如下
得到最终output $B * 7000 * 256$ ，当成下一层的query输入，output经过reg_branches（Linear+Relu+Linear+Relu+Linear），得到新参考点的偏移，之后与初始输入的reference_points(经过逆sigmoid)相加之后得到new_reference_points，并经过sigmoid当成下一层的inference_points的输入
最终经过6层之后，保留每一层的输出output和inference_points，后面计算损失
对于每一层输出的output $B * 7000 * 256$ ，转换成 $B * 350 * 20 * 256$ ，并对第三维求平均，得到 $B * 350 * 256$ 经过cls_branches (Linear+LayerNorm+Relu+Linear+LayerNorm+Relu+Linear)，得到最终分类结果 $B * 350 * 3$ ，一共只有3类；代码中会重新生成reference_points与上面生成reference_points相同，代码里面属于重复生成了，可以删除，最终得到点的坐标 $B * 7000 * 2 （也就是 B * 350 * 20 * 2 ，一共 350 个 in s t an ce ，一个 in s t an ce 对应 20 个点坐标）$ ，然后生成对应的外接矩形框和对应的20个点坐标
这里采用辅助分割，第一个根据bev_embed，通过seg_head （Conv2d+Relu+Conv2d），得到在bev下的语义分割结果 $B * 1 * 200 * 100$ ；第二个根据img_neck $B * 6 * 256 * 15 * 25$ ，通过pv_seg_head （Conv2d+Relu+Conv2d），得到原始6张pv图下的语义分割结果 $B * 6 * 1 * 15 * 25$

2.4 Loss计算

depth loss ，基于LSS计算的，后续补充
对于输出一共350个instance，这里分成50个one2one，和300个one2many，对应one2many的gt_label也是相应复制6份

2.4.1 进行maptr_assigner

这里以one2one计算为例，对于gt处理，目前一共三类，是车道线，边界线和人行横道，对于前面两类会增加正序和逆序，对于人行横道是环形，这里就是循环生成19个实例，对于前面两类不足19个就补-1，最终得到gt_shifts_pts_list $N * 19 * 20 * 2 （ N 表示一个输入里面包括 N 个 g t ）$
计算loss，包括cls_loss （focal_loss），box_reg_los （L1 loss），pts_loss（L1 loss），iou_loss（giou loss），但是基于box的box_reg_los和iou_loss的weight都设置为0
这里对于pts_loss，计算不同的是，会计算这50个实例和这19个新增的gt的loss，然后在这19个选择最小的一个作为最终loss计算
流程就是计算所有loss，根据匈牙利匹配，选取1对1的gt和pred，然后计算最终loss

2.4.2 进行最终loss

最终loss ，包括cls_loss （focal_loss），box_reg_los （L1 loss），pts_loss（L1 loss），iou_loss（giou loss），dir_loss（是一个方向loss，采用nn.CosineEmbeddingLoss），但是基于box的box_reg_los和iou_loss的weight都设置为0
其他loss和上面类似，主要是dir_loss，这里是一共20个点（实际数值，不是归一化之后的数值），然后后面一个点减去前面一个点，得到插值，结果是 $N * 50 * 19 （ N 表示一个输入里面包括 N 个 g t ）$ ，然后与gt计算余弦相似度 loss