ms-swift训练的感悟2

原创于 2025-12-06 19:41:39 发布 · 890 阅读

CC 4.0 BY-SA版权

文章标签：

ms-swift官方中文文档
https://swift.readthedocs.io/zh-cn/latest/BestPractices/Reranker.html
原文

默认会从每条数据中取出MAX_POSITIVE_SAMPLES条正样本和MAX_NEGATIVE_SAMPLES条负样本，每条正样本会和MAX_NEGATIVE_SAMPLES条负样本组成一个group，因此每条数据会扩展成MAX_POSITIVE_SAMPLESx(1 + MAX_NEGATIVE_SAMPLES)条数据。如果数据中正例/负例数量不足，会取全部正例/负例，如果数据中正例和负例数量超过MAX_POSITIVE_SAMPLES和MAX_NEGATIVE_SAMPLES，会进行随机采样。 IMPORTANT：展开后的数据会放在同一个batch中，因此每个设备上的实际批处理大小（effective batch size）将是 per_device_train_batch_size × MAX_POSITIVE_SAMPLES × (1 + MAX_NEGATIVE_SAMPLES)。请注意调整 per_device_train_batch_size 以避免显存不足。

MAX_POSITIVE_SAMPLESx(1 + MAX_NEGATIVE_SAMPLES) 为什么这里是1+?而不是MAX_POSITIVE_SAMPLESx(1 + MAX_NEGATIVE_SAMPLES) ？

这与他的训练范式有关，他本来就是point2point的，只是损失函数有区别
什么是point2point？直接给prompt

比如

你是一名优秀的数据专家，请从refer_doc中选择与用户query最相关的数据
<query>
迪迦奥特曼是哪一年播出的
</query>
<refer_doc>
a.迪迦奥特曼是1996年在日本首映的
b.迪迦奥特曼的人间体是大古
c.盖亚奥特曼是大地毁灭者
</refer_doc>

你是一名优秀的数据专家，请从refer_doc中选择与用户query最相关的数据
<query>
迪迦奥特曼是哪一年播出的
</query>
<refer_doc>
a.迪迦奥特曼是1996年在日本首映的
</refer_doc>


你是一名优秀的数据专家，请从refer_doc中选择与用户query最相关的数据
<query>
迪迦奥特曼是哪一年播出的
</query>
<refer_doc>
a.迪迦奥特曼的人间体是大古
</refer_doc>


你是一名优秀的数据专家，请从refer_doc中选择与用户query最相关的数据
<query>
迪迦奥特曼是哪一年播出的
</query>
<refer_doc>
a.盖亚奥特曼是大地毁灭者
</refer_doc>

query Official Chinese Documentation of ms-swift
https://swift.readthedocs.io/zh-cn/latest/BestPractices/Reranker.html
Original Text

By default, MAX_POSITIVE_SAMPLES positive samples and MAX_NEGATIVE_SAMPLES negative samples will be taken from each data point. Each positive sample will be paired with MAX_NEGATIVE_SAMPLES negative samples to form a group. Therefore, each data point will be expanded into MAX_POSITIVE_SAMPLES x (1 + MAX_NEGATIVE_SAMPLES) data points. If the number of positive/negative examples in the data is insufficient, all positive/negative examples will be taken. If the number of positive and negative examples exceeds MAX_POSITIVE_SAMPLES and MAX_NEGATIVE_SAMPLES, random sampling will be performed. IMPORTANT: The expanded data will be placed in the same batch, so the actual batch size on each device (effective batch size) will be per_device_train_batch_size × MAX_POSITIVE_SAMPLES × (1 + MAX_NEGATIVE_SAMPLES). Please adjust per_device_train_batch_size to avoid running out of GPU memory.

Why is it 1+ here instead of MAX_POSITIVE_SAMPLESx(1 + MAX_NEGATIVE_SAMPLES) ?

This is related to his training paradigm. He is originally point2point, just with a different loss function.
What is point2point? Just give the prompt.

For example

You are an excellent data expert, please select the data most relevant to the user's query from refer_doc
<query>
When was Ultraman Tiga broadcasted?
</query>
<refer_doc>
a. Ultraman Tiga premiered in Japan in 1996
b. Ultraman Tiga's human form is Takeru
c. Gaia Ultraman is the Earth Destroyer
</refer_doc>

point2point

You are an excellent data expert, please select the most relevant data from refer_doc for the user's query
<query>
When was Ultraman Tiga broadcasted
</query>
<refer_doc>
a. Ultraman Tiga premiered in Japan in 1996
</refer_doc>


You are an excellent data expert, please select the data from refer_doc that is most relevant to the user's query
<query>
When was Ultraman Tiga aired
</query>
<refer_doc>
a. Tiga Ultraman's human form is Gao
</refer_doc>


You are an excellent data expert, please select the most relevant data from refer_doc to the user's query
<query>
When was Ultraman Tiga aired?
</query>
<refer_doc>
a. Ultraman Gaia is the Earth Destroyer
</refer_doc>

MAX_POSITIVE_SAMPLES x (1 + MAX_NEGATIVE_chouchSAMPLES)