什么时候需要DataCollator以及一些常见的DataCollator

最新推荐文章于 2025-04-14 22:45:00 发布

鱼鱼9901

最新推荐文章于 2025-04-14 22:45:00 发布

阅读量1.1k

点赞数 11

分类专栏： nlp 文章标签：人工智能语言模型 python

本文链接：https://blog.youkuaiyun.com/weixin_72100405/article/details/135727536

版权

nlp 专栏收录该内容

15 篇文章

订阅专栏

本文介绍了PyTorch中几种DataCollator类，如用于动态填充输入和标签的DataCollator，以及在序列到序列、标记分类和语言模型任务中的具体实现。特别强调了在使用DataCollator时，当数据未进行填充且有动态需求时的重要性，以及如何配合PreTrainedTokenizer的special_tokens_mask参数以优化性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

DataCollator：如果不指定也会有个默认的DataCollator，默认的DataCollator作用是将输入转换为tensor，常见的需要手动指定的时候就是数据没有做padding的时候，要动态padding。也就是说如果在data_process中做了padding，并且没有特殊处理需求，那么也许就不需要collator了。

DataCollatorForSeq2Seq: Data collator that will dynamically pad the inputs received, as well as the labels.(区分input和output)

class DataCollatorWithPadding:Data collator that will dynamically pad the inputs received.

class DataCollatorForTokenClassification:Data collator that will dynamically pad the inputs received, as well as the labels.

class DataCollatorForLanguageModeling:Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

args: ①mlm (`bool`, *optional*, defaults to `True`):Whether or not to use masked language modeling. If set to `False`, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.②mlm_probability (`float`, *optional*, defaults to 0.15):The probability with which to (randomly) mask tokens in the input, when `mlm` is set to `True`.

Tip: For best performance, this data collator should be used with a dataset having items that are dictionaries or BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.