什么时候需要DataCollator以及一些常见的DataCollator

本文介绍了PyTorch中几种DataCollator类,如用于动态填充输入和标签的DataCollator,以及在序列到序列、标记分类和语言模型任务中的具体实现。特别强调了在使用DataCollator时,当数据未进行填充且有动态需求时的重要性,以及如何配合PreTrainedTokenizer的special_tokens_mask参数以优化性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

DataCollator:如果不指定也会有个默认的DataCollator,默认的DataCollator作用是将输入转换为tensor,常见的需要手动指定的时候就是数据没有做padding的时候,要动态padding。也就是说如果在data_process中做了padding,并且没有特殊处理需求,那么也许就不需要collator了。

DataCollatorForSeq2Seq: Data collator that will dynamically pad the inputs received, as well as the labels.(区分input和output)

class DataCollatorWithPadding:Data collator that will dynamically pad the inputs received.

class DataCollatorForTokenClassification:Data collator that will dynamically pad the inputs received, as well as the labels.

class DataCollatorForLanguageModeling:Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

  args: ①mlm (`bool`, *optional*, defaults to `True`):Whether or not to use masked language modeling. If set to `False`, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked  tokens and the value to predict for the masked token.②mlm_probability (`float`, *optional*, defaults to 0.15):The probability with which to (randomly) mask tokens in the input, when `mlm` is set to `True`.

Tip:  For best performance, this data collator should be used with a dataset having items that are dictionaries or BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值