LabelSmoother计算损失-优快云博客

针对LLM模型，计算每一个batch的损失

@dataclass
class LabelSmoother:
    """
    Adds label-smoothing on a pre-computed output from a Transformers model.

    Args:
        epsilon (`float`, *optional*, defaults to 0.1):
            The label smoothing factor.
        ignore_index (`int`, *optional*, defaults to -100):
            The index in the labels to ignore when computing the loss.
    """

LabelSmoother 数据类注释已表明：在transformer 模型的预计算输出中添加标签滑动。epsilon表示标签滑动因子，默认0.1；ignore_index 为计算损失时忽略不计的index值，默认-100。

源码如下：

    epsilon: float = 0.1
    ignore_index: int = -100

    def __call__(self, model_output, labels, shift_labels=False):
        logits = model_output["logits"] if isinstance(model_output, dict) else model_output[0]
        if shift_labels:
            logits = logits[..., :-1, :].contiguous()
            labels = labels[..., 1:].contiguous()

        log_probs = -nn.functional.log_softmax(logits, dim=-1)
        if labels.dim() == log_probs.dim() - 1:
            labels = labels.unsqueeze(-1)

        padding_mask = labels.eq(self.ignore_index)
        # In case the ignore_index is -100, the gather will fail, so we replace labels by 0. The padding_mask
        # will ignore them in any case.
        labels = torch.clamp(labels, min=0)
        nll_loss = log_probs.gather(dim=-1, index=labels)
        # works for fp16 input tensor too, by internally upcasting it to fp32
        smoothed_loss = log_probs.sum(dim=-1, keepdim=True, dtype=torch.float32)

        nll_loss.masked_fill_(padding_mask, 0.0)
        smoothed_loss.masked_fill_(padding_mask, 0.0)

        # Take the mean over the label dimensions, then divide by the number of active elements (i.e. not-padded):
        num_active_elements = padding_mask.numel() - padding_mask.long().sum()
        nll_loss = nll_loss.sum() / num_active_elements
        smoothed_loss = smoothed_loss.sum() / (num_active_elements * log_probs.shape[-1])
        return (1 - self.epsilon) * nll_loss + self.epsilon * smoothed_loss

在自回归模型中，如GPT系列模型，预测是基于当前位置的上下文信息来预测下一个位置的token。这意味着在训练过程中，模型学习的是 根据当前序列生成下一个词 的任务。

shift_labels 为标签移位标识，false不移位，true需要移位（logits 倒数第二个维度删除最后一行；labels最后一个维度删除一列），为什么要如此处理尼？效果如下：

原始预训练语料为：我爱中国 (分词结果为[100, 101, 102, 103])

则，input_ids=[我，爱，中，国] = [100, 101, 102, 103]

logits=[爱，中，国] = [101, 102, 103]

labels=[爱，中，国] = [101, 102, 103]

其他说明：

1.padding_mask 记录labels 对应位置是否为填充值（索引为-100）

2.nll_loss 为-log(softmaxt(C))，C为对应的labels的值

3.num_active_elements 所有有效值的个数（即lables中索引不为-100）

4.return (1 - self.epsilon) * nll_loss + self.epsilon * smoothed_loss 返回计算的损失值。