LLaMA-Factory：无代码微调大模型，小白也能上手（以Qwen3为例，包括lora、dpo、ppo等）

原创于 2025-06-13 15:28:54 发布 · 1.4k 阅读

26 ·

CC 4.0 BY-SA版权

文章标签：

#llama #深度学习 #AIGC #自然语言处理

大语言模型同时被 3 个专栏收录

28 篇文章

订阅专栏

多模态大模型

21 篇文章

订阅专栏

人工智能入门课

8 篇文章

订阅专栏

1.简介

LLaMA Factory 是一个用于训练和微调大型语言模型的高效平台，支持多种模型和训练方法。它提供了零代码的命令行界面（CLI）和 Web UI，用户无需编写代码即可在本地微调超过100种预训练模型。

在模型选择上，它提供了LLaMA、LLaVA、Mistral等众多预训练模型，用户可以根据自己的需求进行选择。训练方法方面，它支持多种方式，如预训练、指令监督微调、奖励模型训练等，为用户提供了丰富的选择。

在计算精度和优化算法方面，LLaMA Factory也表现出色。它支持多种精度的微调方式，包括16位全参数微调、LoRA微调以及多种量化微调方法，能够满足不同精度需求的训练任务。优化算法的选择也非常丰富，如GaLore、BAdam、DoRA等，这些算法能够帮助用户更高效地进行模型训练。此外，它还提供了加速操作FlashAttention-2和Unsloth，以及支持Transformers和vLLM的推理引擎，进一步提升了训练和推理的效率。

LLaMA Factory提供了零代码的命令行界面和Web UI，用户无需编写代码即可在本地微调超过100种预训练模型。Web UI界面方便用户进行模型微调和测试，用户可以通过简单的参数调整来启动训练过程，并在训练完成后将模型导出到Hugging Face或保存到本地。它还支持多种数据集选项，用户可以根据需要选择合适的数据集来微调模型。

源码地址：GitHub - hiyouga/LLaMA-Factory: Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

官方文档：LLaMA Factory

2.使用

安装

首先下载代码：GitHub - hiyouga/LLaMA-Factory: Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

或者使用git命令：

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git

然后安装命令

cd LLaMA-Factory
pip install -e ".[torch,metrics]"

如果出现环境冲突，请尝试使用以下命令解决：

pip install --no-deps -e .

完成安装后，可以通过使用 llamafactory-cli version 来快速校验安装是否成功。如果您能成功看到类似下面的界面，就说明安装成功了。

如果您想在 Windows 上启用量化 LoRA（QLoRA），请根据您的 CUDA 版本选择适当的 bitsandbytes 发行版本。如：

pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl

如果您要在 Windows 平台上启用 FlashAttention-2，请根据您的 CUDA 版本选择适当的 flash-attention 发行版本。

数据处理

首先，不管进行什么训练方法，都必须遵循以下原则：

使用的数据应当保存在LLaMA-Factory-main/data文件夹下
使用json(主要格式)/jsonl/parquet格式，后文会针对不同方法介绍不同格式
如果使用自定义的数据集，务必在 LLaMA-Factory-main/data/dataset_info.json 文件中添加对数据集及其内容的定义。

LLaMA-Factory-main/data文件夹下提供了大量的示例，读者如有不明白之处可以自行查看。

简单构建数据集

以sft方法为例，我们可以简单的构建一个数据集：

我们首先构建一个json文件，命名为"data.json"，其中每一条数据如下所示：

其中：

instruction 列对应的内容为人类指令，
input 列对应的内容为人类输入（选填）
output 列对应的内容为模型回答。

{
  "instruction": "计算这些物品的总费用。 ",
  "input": "输入：汽车 - $3000，衣服 - $100，书 - $20。",
  "output": "汽车、衣服和书的总费用为 $3000 + $100 + $20 = $3120。"
},

你可以仿照上述案例生成大批数据，用于训练。所有数据要用[ ]包裹起来，如下：

[
  {
    "instruction": "计算这些物品的总费用。 ",
    "input": "输入：汽车 - $3000，衣服 - $100，书 - $20。",
    "output": "汽车、衣服和书的总费用为 $3000 + $100 + $20 = $3120。"
  },
    ...
]

然后在dataset_info.json添加对数据集及其内容的定义。如下：

"数据集名称": {
    "file_name": "数据集名称及后缀"
  },

列如
"data": {
    "file_name": "data.json"
  },

以官方提供的identity.json为例，它在dataset_info.json里的信息如下：

详细构建数据集

当然，你也可以写的更详细一些，以应对各种情况，如下：

instruction 列对应的内容为人类指令，
input 列对应的内容为人类输入，
output 列对应的内容为模型回答。
system 列对应的内容将被作为系统提示词。
history 列是由多个字符串二元组构成的列表，分别代表历史消息中每轮对话的指令和回答。

[
  {
    "instruction": "人类指令（必填）",
    "input": "人类输入（选填）",
    "output": "模型回答（必填）",
    "system": "系统提示词（选填）",
    "history": [
      ["第一轮指令（选填）", "第一轮回答（选填）"],
      ["第二轮指令（选填）", "第二轮回答（选填）"]
    ]
  }
]

如：

[
  {
    "instruction": "今天的天气怎么样？",
    "input": "",
    "output": "今天的天气不错，是晴天。",
    "history": [
      [
        "今天会下雨吗？",
        "今天不会下雨，是个好天气。"
      ],
      [
        "今天适合出去玩吗？",
        "非常适合，空气质量很好。"
      ]
    ]
  }
]

对应的dataset_info.json写法如下：

"data": {
  "file_name": "data.json",
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
    "system": "system",
    "history": "history"
  }
}

定义完数据集后在dataset_info.json添加对数据集及其内容的定义。

我们现在在此处先暂时介绍sft的数据格式，后续会介绍其他微调方法所需的数据集格式。

Web-UI

在命令行输入以下命令即可启动

llamafactory-cli webui

web-ui的操作过于简单，web-ui只需调整相应参数，并配置好数据集即可。

后续我们只介绍纯代码/命令行的使用方法。

SFT-lora微调

指令监督微调(Instruct Tuning)通过让模型学习详细的指令以及对应的回答来优化模型在特定指令下的表现。这里我们使用lora方法进行微调。

数据集构建

可参考LLaMA-Factory-main/data/alpaca_zh_demo.json进行构建。官方提供了三个样本：alpaca_en_demo.json、alpaca_zh_demo.json、identity.json。我们接下来直接使用官方提供的数据集进行讲解，其数据格式如下：

[
  {
    "instruction": "识别并解释给定列表中的两个科学理论：细胞理论和日心说。",
    "input": "",
    "output": "细胞理论是生物科学的一个理论，它认为所有生命体都是由微小的基本单元——细胞所构成。这是生物学的一个基础理论，认为细胞是所有生物的基本结构和功能单位，所有的生物都是由一个或多个细胞组成，细胞只能通过细胞分裂产生新的细胞。这一理论由薛定谔、施瓦内和雪莱于1839年首次提出。\n\n日心说是指太阳是太阳系的中心，也就是说，行星围绕太阳旋转的理论。这个理论打破了传统的地心说观点，认为地球并不是宇宙的中心。日心说的提出者是尼古拉·哥白尼，他在16世纪初发表了他的著作《天体运行论》，阐述了太阳系行星围绕太阳运行的模型，为天文学的发展做出了巨大贡献。"
  },

  ......,

  {
    "instruction": "比较以下两位著名人物，使用特征列表并以句子形式提供输出。",
    "input": "人物1：阿尔伯特·爱因斯坦\n人物2：斯蒂芬·霍金\n特征：智力、教育、对世界的影响",
    "output": "阿尔伯特·爱因斯坦和斯蒂芬·霍金都是拥有极其出色智力的人物。两人都取得过非常高的教育成就，他们推进了科学发展并在世界范围内产生了深远的影响。爱因斯坦以其相对论和质能关系公式而闻名，而霍金以其关于黑洞和宇宙的发现而著称。两位科学家都以其深厚的学识和非凡的贡献影响了世界。"
  },
]

该数据集同样可以使用上文提到的构建方法进行更详细的构建。

lora微调训练

我们需要首先配置yaml文件，所有yaml文件位于LLaMA-Factory-main/examples/下，如dpo、ppo、lora对应lora文件夹，而qlora对应qlora文件夹。

我自己写的qwen3_lora_sft.yaml配置信息如下，读者也可参考LLaMA-Factory-main/examples/train_qlora或LLaMA-Factory-main/examples/train_lora下其他文件进行配置。这里面最关键的修改配置如下：

模型路径model_name_or_path：模型所在位置，可以使用绝对地址
数据集dataset：训练所需数据集，注意一定要在dataset_info.json定义。这里直接写名字就可以了，不需要后缀
template：模板改成模型的名字，如qwen
max_samples：最大数据量，如果数据集有10w条数据，我只想要用1w条，可以在这里设置；不足则为数据集大小。
output_dir：模型lora权重保存位置
save_steps：每隔多少步保存一次
per_device_train_batch_size：batchsize
num_train_epochs：总共训练多少轮
val_size：测试集比例。比如我的数据集有1w条，0.1表示9000条为训练集，1000条为测试集

更多参数请参考文末更多参数部分

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen3-4B

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all

### dataset
dataset: identity,alpaca_en_demo    # 数据集可以写多个，用逗号隔开
template: qwen
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/qwen3-4b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

编辑完配置信息后，将其放在LLaMA-Factory-main/examples/train_lora/目录下，然后我们在命令行使用如下命令，即可开始lora微调（注意要在llama_factory目录下使用，后面的yaml是配置信息的路径）

llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml

使用单机多卡训练需加：

FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml

出现下面的界面，说明正在划分数据集（若数据集太大请稍等片刻）

Generating train split: 91 examples [00:00, 5903.18 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 91/91 [00:00<00:00, 356.64 examples/s]
[INFO|2025-05-21 20:31:11] llamafactory.data.loader:143 >> Loading dataset alpaca_en_demo.json...
Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard.
Generating train split: 1000 examples [00:00, 42687.08 examples/s]
Converting format of dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 3822.04 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1091/1091 [00:02<00:00, 402.19 examples/s]

出现下面的界面，说明开始训练了

Loading checkpoint shards:   0%|                                                                                                                                                                     | 0/3 [00:00<?, ?it/s]

出现下面的界面，说明训练完成了

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 110/110 [00:08<00:00, 13.35it/s]
***** eval metrics *****
  epoch                   =        3.0
  eval_loss               =     1.1679
  eval_runtime            = 0:00:08.31
  eval_samples_per_second =     13.226
  eval_steps_per_second   =     13.226
[INFO|modelcard.py:450] 2025-05-21 20:54:16,634 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

训练完成后可在LLaMA_Factory/saves/qwen3-4b/lora/sft文件夹下找到保存的结果，包括模型权重和运行信息。其中：

checkpoint-xxx：第xxx轮时，保存的模型lora权重
sft文件下：最终保存的权重及训练时的信息（如loss曲线、log信息等）

更多sft配置信息：

model_name_or_path	模型名称或路径
stage	训练阶段，可选: rm(reward modeling), pt(pretrain), sft(Supervised Fine-Tuning), PPO, DPO, KTO, ORPO，lora选择sft
do_train	true用于训练, false用于评估
finetuning_type	微调方式。可选: freeze, lora, full
lora_target	采取LoRA方法的目标模块，默认值为 `all`。
dataset	使用的数据集，使用”,”分隔多个数据集
template	数据集模板，请保证数据集模板与模型相对应。
output_dir	输出路径
logging_steps	日志输出步数间隔
save_steps	模型断点保存间隔
overwrite_output_dir	是否允许覆盖输出目录
per_device_train_batch_size	每个设备上训练的批次大小
gradient_accumulation_steps	梯度积累步数
max_grad_norm	梯度裁剪阈值
learning_rate	学习率
lr_scheduler_type	学习率曲线，可选 `linear`, `cosine`, `polynomial`, `constant` 等。
num_train_epochs	训练周期数，一般几轮就够了
bf16	是否使用 bf16 格式
warmup_ratio	学习率预热比例
warmup_steps	学习率预热步数
push_to_hub	是否推送模型到 Huggingface

lora配置信息

参数名称	类型	介绍
additional_target[非必须]	[str,]	除 LoRA 层之外设置为可训练并保存在最终检查点中的模块名称。使用逗号分隔多个模块。默认值为 `None`
lora_alpha[非必须]	int	LoRA 缩放系数。一般情况下为 lora_rank * 2, 默认值为 `None`
lora_dropout	float	LoRA 微调中的 dropout 率。默认值为 `0`
lora_rank	int	LoRA 微调的本征维数 `r`， `r` 越大可训练的参数越多。默认值为 `8`
lora_target	str	应用 LoRA 方法的模块名称。使用逗号分隔多个模块，使用 `all` 指定所有模块。默认值为 `all`
loraplus_lr_ratio[非必须]	float	LoRA+ 学习率比例(`λ = ηB/ηA`)。 `ηA, ηB` 分别是 adapter matrices A 与 B 的学习率。LoRA+ 的理想取值与所选择的模型和任务有关。默认值为 `None`
loraplus_lr_embedding[非必须]	float	LoRA+ 嵌入层的学习率, 默认值为 `1e-6`
use_rslora	bool	是否使用秩稳定 LoRA(Rank-Stabilized LoRA)，默认值为 `False`。
use_dora	bool	是否使用权重分解 LoRA（Weight-Decomposed LoRA），默认值为 `False`
pissa_init	bool	是否初始化 PiSSA 适配器，默认值为 `False`
pissa_iter	int	PiSSA 中 FSVD 执行的迭代步数。使用 `-1` 将其禁用，默认值为 `16`
pissa_convert	bool	是否将 PiSSA 适配器转换为正常的 LoRA 适配器，默认值为 `False`
create_new_adapter	bool	是否创建一个具有随机初始化权重的新适配器，默认值为 `False`

用户可根据以上信息自行配置参数

合并lora矩阵

在终端运行：

llamafactory-cli export examples/merge_lora/qwen3_lora_sft.yaml

其中qwen3_lora_sft.yaml文件如下。用户也可参考LLaMA-Factory-main/examples/merge_lora下其他文件进行配置，具体来说：

model_name_or_path（重要）: 原模型的名称或路径
adapter_name_or_path（重要）：保存的lora矩阵的位置
template: 模型模板
export_dir（重要）: 导出路径
export_size: 最大导出模型文件大小
export_device: 导出设备
export_legacy_format: 是否使用旧格式导出

### Note: DO NOT use quantized model or quantization_bit when merging lora adapters

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen3-4B
adapter_name_or_path: saves/qwen3-4b/lora/sft
template: qwen3
trust_remote_code: true

### export
export_dir: output/qwen3_lora_sft
export_size: 5
export_device: cpu  # choices: [cpu, auto]
export_legacy_format: false

结束后看到下面的就说明运行成功了

[INFO|2025-05-21 21:00:03] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-05-21 21:00:13] llamafactory.model.adapter:143 >> Merged 1 adapter(s).
[INFO|2025-05-21 21:00:13] llamafactory.model.adapter:143 >> Loaded adapter(s): saves/qwen3-4b/lora/sft
[INFO|2025-05-21 21:00:13] llamafactory.model.loader:143 >> all params: 4,022,468,096
[INFO|2025-05-21 21:00:13] llamafactory.train.tuner:143 >> Convert model dtype to: torch.bfloat16.
[INFO|configuration_utils.py:424] 2025-05-21 21:00:13,481 >> Configuration saved in output/qwen3_lora_sft/config.json
[INFO|configuration_utils.py:904] 2025-05-21 21:00:13,481 >> Configuration saved in output/qwen3_lora_sft/generation_config.json
[INFO|modeling_utils.py:3730] 2025-05-21 21:00:22,103 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at output/qwen3_lora_sft/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2356] 2025-05-21 21:00:22,103 >> chat template saved in output/qwen3_lora_sft/chat_template.jinja
[INFO|tokenization_utils_base.py:2525] 2025-05-21 21:00:22,104 >> tokenizer config file saved in output/qwen3_lora_sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2534] 2025-05-21 21:00:22,104 >> Special tokens file saved in output/qwen3_lora_sft/special_tokens_map.json
[INFO|2025-05-21 21:00:22] llamafactory.train.tuner:143 >> Ollama modelfile saved in output/qwen3_lora_sft/Modelfile

模型导出时的更多相关参数如下：

参数名称	类型	介绍	默认值
export_dir	Optional[str]	导出模型保存目录的路径。	None
export_size	int	导出模型的文件分片大小（以GB为单位）。	5
export_device	Literal[“cpu”, “auto”]	导出模型时使用的设备，auto 可自动加速导出。	cpu
export_quantization_bit	Optional[int]	量化导出模型时使用的位数。	None
export_quantization_dataset	Optional[str]	用于量化导出模型的数据集路径或数据集名称。	None
export_quantization_nsamples	int	量化时使用的样本数量。	128
export_quantization_maxlen	int	用于量化的模型输入的最大长度。	1024
export_legacy_format	bool	True： .bin 格式保存。 False： .safetensors 格式保存。	False
export_hub_model_id	Optional[str]	模型上传至 Huggingface 的仓库名称。	None

需要注意的是，微调后推理时不需要合并矩阵也能运行，在推理代码中，只需要使用保存的lora文件的位置即可，因为保存的文件中指明了原模型的位置。如果你还想进一步微调，那就需要合并矩阵了。

直接使用lora矩阵的位置，如：

from transformers import AutoModelForCausalLM, AutoTokenizer


model_name = "saves/qwen3-4b/lora/sft/checkpoint-2000"        # 路径是lora矩阵权重路径
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

参考官方文档(lora)：调优算法 - LLaMA Factory

参考官方文档(sft)：SFT 训练 - LLaMA Factory

多模态SFT-lora(qwen2.5-vl)

数据集构建

llama-factory支持多模态图像数据集、视频数据集以及音频数据集的输入。

多模态图像数据集需要额外添加一个 images 列，包含输入图像的路径。注意图片的数量必须与文本中所有 <image> 标记的数量严格一致。可参考LLaMA-Factory-main/data/mllm_demo.json进行构建。图片位于data/mllm_demo_data下。

数据集示例如下，其中<image>代表图片

[
  {
    "messages": [
      {
        "content": "<image>Who are they?",
        "role": "user"
      },
      {
        "content": "They're Kane and Gretzka from Bayern Munich.",
        "role": "assistant"
      },
      {
        "content": "What are they doing?<image>",
        "role": "user"
      },
      {
        "content": "They are celebrating on the soccer field.",
        "role": "assistant"
      }
    ],
    "images": [
      "mllm_demo_data/1.jpg",
      "mllm_demo_data/1.jpg"
    ]
  },

    ...

  {
    "messages": [
      {
        "content": "<image>Who is he?",
        "role": "user"
      },
      {
        "content": "He's Thomas Muller from Bayern Munich.",
        "role": "assistant"
      },
      {
        "content": "Why is he on the ground?",
        "role": "user"
      },
      {
        "content": "Because he's sliding on his knees to celebrate.",
        "role": "assistant"
      }
    ],
    "images": [
      "mllm_demo_data/2.jpg"
    ]
  }
]

在dataset_info.json 中的 数据集描述 应为：

"mllm_demo": {
    "file_name": "mllm_demo.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages",
      "images": "images"            # 重要
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant"
    }
  },

其位置关系如下：

多模态视频数据集需要额外添加一个 videos 列，包含输入视频的路径。注意视频的数量必须与文本中所有 <video> 标记的数量严格一致。可参考LLaMA-Factory-main/data/mllm_video_demo.json进行构建。视频位于data/mllm_demo_data下。

数据集示例如下，其中<video>代表视频

[
  {
    "messages": [
      {
        "content": "<video>Why is this video funny?",
        "role": "user"
      },
      {
        "content": "Because a baby is reading, and he is so cute!",
        "role": "assistant"
      }
    ],
    "videos": [
      "mllm_demo_data/1.mp4"
    ]
  },

    ...

  {
    "messages": [
      {
        "content": "<video>What is she doing?",
        "role": "user"
      },
      {
        "content": "She is cooking.",
        "role": "assistant"
      }
    ],
    "videos": [
      "mllm_demo_data/2.avi"
    ]
  }
]

在dataset_info.json 中的 数据集描述 应为：

  "mllm_video_demo": {
    "file_name": "mllm_video_demo.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages",
      "videos": "videos"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant"
    }
  },

多模态音频数据集需要额外添加一个 audio 列，包含输入图像的路径。注意音频的数量必须与文本中所有 <audio> 标记的数量严格一致。

数据集示例如下，其中<audio>代表音频

[
  {
    "messages": [
      {
        "content": "<video><audio>What is the video describing?",
        "role": "user"
      },
      {
        "content": "A girl who is drawing a picture of a guitar and feel nervous.",
        "role": "assistant"
      }
    ],
    "videos": [
      "mllm_demo_data/4.mp4"
    ],
    "audios": [
      "mllm_demo_data/4.mp3"
    ]
  },

    ...

  {
    "messages": [
      {
        "content": "<video><audio>What does this girl say?",
        "role": "user"
      },
      {
        "content": "She says: 'Hello! Take a look at what am I drawing!'",
        "role": "assistant"
      }
    ],
    "videos": [
      "mllm_demo_data/4.mp4"
    ],
    "audios": [
      "mllm_demo_data/4.mp3"
    ]
  }
]

在dataset_info.json 中的 数据集描述 应为：

  "mllm_video_audio_demo": {
    "file_name": "mllm_video_audio_demo.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages",
      "videos": "videos",
      "audios": "audios"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant"
    }
  },

lora微调训练

我们需要首先配置yaml文件，所有yaml文件位于LLaMA-Factory-main/examples/下，我们可参考examples/train_lora/qwen2_5vl_lora_sft.yaml

### model
model_name_or_path:  /media/good/4TB/mn/model/llm/qwen/Qwen2.5-VL-7B-Instruct
image_max_pixels: 262144    # 图像输入的最大像素数。
video_max_pixels: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: mllm_demo,identity,alpaca_en_demo  # 视频微调数据: mllm_video_demo
template: qwen2_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen2_5vl-7b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

参数名称	类型	介绍	默认值
image_max_pixels	int	图像输入的最大像素数。	768 x 768
image_min_pixels	int	图像输入的最小像素数。	32 x 32
video_max_pixels	int	视频输入的最大像素数。	256 x 256
video_min_pixels	int	视频输入的最小像素数。	16 x 16
video_fps	float	视频输入的采样帧率（每秒采样帧数）。	2.0
video_maxlen	int	视频输入的最大采样帧数。	128

出现下面的界面，说明训练完成了

[INFO|tokenization_utils_base.py:2356] 2025-06-03 15:40:13,851 >> chat template saved in saves/qwen2_5vl-7b/lora/sft/chat_template.jinja
[INFO|tokenization_utils_base.py:2525] 2025-06-03 15:40:13,851 >> tokenizer config file saved in saves/qwen2_5vl-7b/lora/sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2534] 2025-06-03 15:40:13,851 >> Special tokens file saved in saves/qwen2_5vl-7b/lora/sft/special_tokens_map.json
***** train metrics *****
  epoch                    =        3.0
  total_flos               = 26151461GF
  train_loss               =      0.811
  train_runtime            = 0:18:16.41
  train_samples_per_second =      3.002
  train_steps_per_second   =      0.378
Figure saved at: saves/qwen2_5vl-7b/lora/sft/training_loss.png

合并lora矩阵

在终端运行：

llamafactory-cli export examples/merge_lora/qwen2_5vl_lora_sft.yaml

其中qwen2_5vl_lora_sft.yaml文件如下。用户也可参考LLaMA-Factory-main/examples/merge_lora下其他文件进行配置

### Note: DO NOT use quantized model or quantization_bit when merging lora adapters

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen2.5-VL-7B-Instruct
adapter_name_or_path: saves/qwen2_5vl-7b/lora/sft
template: qwen2_vl
trust_remote_code: true

### export
export_dir: output/qwen2_5vl_lora_sft
export_size: 5
export_device: cpu  # choices: [cpu, auto]
export_legacy_format: false

结束后看到下面的就说明运行成功了

[INFO|tokenization_utils_base.py:2356] 2025-06-03 20:30:17,853 >> chat template saved in output/qwen2_5vl_lora_sft/chat_template.jinja
[INFO|tokenization_utils_base.py:2525] 2025-06-03 20:30:17,853 >> tokenizer config file saved in output/qwen2_5vl_lora_sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2534] 2025-06-03 20:30:17,853 >> Special tokens file saved in output/qwen2_5vl_lora_sft/special_tokens_map.json
[INFO|image_processing_base.py:260] 2025-06-03 20:30:18,017 >> Image processor saved in output/qwen2_5vl_lora_sft/preprocessor_config.json
[INFO|tokenization_utils_base.py:2356] 2025-06-03 20:30:18,192 >> chat template saved in output/qwen2_5vl_lora_sft/chat_template.jinja
[INFO|tokenization_utils_base.py:2525] 2025-06-03 20:30:18,219 >> tokenizer config file saved in output/qwen2_5vl_lora_sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2534] 2025-06-03 20:30:18,220 >> Special tokens file saved in output/qwen2_5vl_lora_sft/special_tokens_map.json
[INFO|video_processing_utils.py:491] 2025-06-03 20:30:18,704 >> Video processor saved in output/qwen2_5vl_lora_sft/video_preprocessor_config.json
[INFO|processing_utils.py:674] 2025-06-03 20:30:19,218 >> chat template saved in output/qwen2_5vl_lora_sft/chat_template.jinja
[INFO|2025-06-03 20:30:19] llamafactory.train.tuner:143 >> Ollama modelfile saved in output/qwen2_5vl_lora_sft/Modelfile

DPO

DPO（Direct Preference Optimization）是一种用于训练语言模型的方法，旨在通过直接优化模型的输出以更好地满足人类偏好。这种方法的核心思想是利用人类的偏好反馈来指导模型的训练过程，从而生成更符合人类期望的文本。

数据集构建

偏好数据集用于DPO 训练。对于系统指令和人类输入，偏好数据集给出了一个更优的回答和一个更差的回答。偏好数据集需要在 chosen 列中提供更优的回答，并在 rejected 列中提供更差的回答。

可参考LLaMA-Factory-main/data/dpo_zh_demo.json构建数据集，每条数据用{ }包住，使用数据用[ ]包住，每条数据由三部分组成：

conversations：用户指令
chosen：模型需学习的优质回答
rejected：模型需拒绝的劣质回答

我们自己在做数据集的时候，可以直接套下面的模板，只需改value里面的值即可！

数据集示例如下：

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "国会的转发\n美国国会由众议院和参议院组成，每两年换届一次（参议员任期为6年，但参议院选举是错位的，使得国会的组成仍然每两年变化一次）。这两年期间按顺序标记，第115届国会发生在2017-2018年。\n\n密歇根大学信息学院的研究人员在这段时间内收集了现任国会议员（我们将“国会议员”缩写为MoC）的Twitter帖子，并对它们进行编码，标记为原创声明或其他用户提交的转发。我们将重点关注转发数据。这些发布的数据不仅包括转发的文本，还包括国会议员的信息和原始推文的帐户。\n#python:\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sb\nimport statsmodels.api as sm\nimport os\nfrom pandas.plotting import register\\_matplotlib\\_converters\nregister\\_matplotlib\\_converters()\n\npd.options.mode.chained\\_assignment = None\n\n# 在接下来的内容中，我们将加载数据，但出于速度原因，我们将排除推文本身的文本。\n\ntweets = pd.read\\_csv(os.path.join(base, \"retweets\\_congress\\_115th.csv.gz\"), \n usecols = lambda x: x != \"full\\_text\",\n dtype = {\"tweet\\_id\\_str\": \"str\"},\n index\\_col = \"tweet\\_id\\_str\")\ntweets.shape\nout: (209856, 21)\n\ntweets.columns\nout:Index(['user\\_id\\_str', 'user\\_screen\\_name', 'user\\_followers\\_count', 'created\\_at',\n 'retweet\\_count', 'retweeted\\_status', 'year', 'bioguide', 'birthday',\n 'gender', 'name', 'state', 'chamber', 'party', 'region', 'rt\\_user\\_id',\n 'rt\\_screen\\_name', 'rt\\_followers\\_count', 'rt\\_bio', 'age', 'age\\_bucket'],\n dtype='object')\n\n第（a）部分\n国会主要由两个政党（民主党和共和党）组成，还有一些独立派。哪个政党负责大部分的转发？\n\n计算推文的政党的边际概率，并将结果保存到party\\_marg。显示这个结果。\n\n第（b）部分\n我们可能会问的一个问题是，“国会成员彼此转发的频率是多少？”\n\n创建一个名为“is\\_congress\\_retweet”的新列，指示转发是否来自MoC。首先使用.unique()方法获取唯一的“user\\_id\\_str”值集合。然后使用这组id，在“rt\\_user\\_id”列上使用.isin(values)方法创建新的布尔列“is\\_congress\\_retweet”。\n\n找到转发MoC的边际概率，并将结果保存到“congress\\_retweet\\_marg”。显示这些结果。是更常见地转发国会议员还是来自国会外部？"
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "计算推文的政党边际概率，我们可以使用以下代码：\n```python\ntweets['party'] = tweets['party'].replace(0, 'Democrat')\ntweets['party'] = tweets['party'].replace(1, 'Republican')\ntweets['party'] = tweets['party'].replace(2, 'Independent')\n\nparty_counts = tweets.groupby('party').size().sort_values(ascending=False)\n\nparty_marg = party_counts / party_counts.sum()\n\nprint(\"政党的边际概率：\")\nprint(party_marg)\n```\n这段代码将'party'列中的数字值替换为相应的政党名称，然后按政党对推文进行分组并计算每个组的大小。最后，通过将每个组的大小除以总推文数来计算每个政党的边际概率。\n\n输出应该类似于：\n```markdown\n政党的边际概率：\n0.522152    民主党\n0.445541    共和党\n0.032307    独立派\n```\n这表明大多数转发来自民主党，其次是共和党，然后是独立派。\n\n要创建一个名为\"is\\_congress\\_retweet\"的新列，指示转发是否来自国会议员，我们可以使用以下代码：\n```python\ncongress_ids = tweets['user_id_str'].unique()\nis_congress_retweet = tweets['rt_user_id'].isin(congress_ids)\ntweets['is_congress_retweet'] = is_congress_retweet\n\ncongress_retweet_marg = tweets['is_congress_retweet'].mean()\nprint(\"转发国会议员的边际概率：\")\nprint(congress_retweet_marg)\n```\n这段代码首先获取唯一的\"user\\_id\\_str\"值集合，对应于国会议员的集合。然后，在\"rt\\_user\\_id\"列上使用.isin(values)方法创建一个新的布尔列\"is\\_congress\\_retweet\"，指示转发是否来自国会议员。最后，通过取\"is\\_congress\\_retweet\"列的平均值来计算转发国会议员的边际概率。\n\n输出应该类似于：\n```markdown\n转发国会议员的边际概率：\n0.775151\n```\n这表明大多数转发不是来自国会议员，而是来自国会之外。"
    },
    "rejected": {
      "from": "gpt",
      "value": "回答问题的第（a）部分，您可以通过首先使用groupby（）函数按政党对数据进行分组，然后计算每个组的转发数来计算政党的边际概率。具体来说，您可以使用以下代码：\n```scss\nparty_count = tweets.groupby('party').count()\nparty_count.loc['Democrats','retweet_count'] / party_count.loc['Republicans','retweet_count']\n```\n回答问题的第（b）部分，您可以创建一个新列来指示转发是否来自国会议员，方法是使用“bioguide”列上的.unique（）方法获取唯一的国会议员用户ID集合，然后使用该集合创建一个布尔列，指示转发是否来自国会议员，使用“rt_user_id”列上的.isin（）方法。最后，使用.sum（）方法来计算转发国会议员的边际概率，然后将其与转发来自国会以外的人的概率进行比较。以下是相关代码：\n```makefile\nimport pandas as pd\ntweets = pd.read_csv('retweets_congress_115th.csv')\n\n# 获取唯一的国会议员ID集合\nmovers_ids = tweets['bioguide'].unique()\n\n# 创建一个新的布尔列，指示转发是否来自国会议员\ntweets['is_congress_retweet'] = tweets['rt_user_id'].isin(movers_ids)\n\n# 计算转发国会议员或来自国会以外的人的边际概率\nmovers_retweets = tweets[tweets['is_congress_retweet'] == True].shape[0]\nnon_movers_retweets = tweets[tweets['is_congress_retweet'] == False].shape[0]\nproportion_movers_retweets = movers_retweets / (movers_retweets + non_movers_retweets)\nprint('Proportion of retweets from MoCs:', proportion_movers_retweets)\n```"
    }
  }

    ......

  {
    "conversations": [
      {
        "from": "human",
        "value": "请提供一下对香港银行业的分析以及目前面临的挑战。"
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "香港银行业长期以来一直是该城市金融服务业和整体经济的重要组成部分。其特点是健全的监管框架、大量银行和资本流动的便利。以下分析涵盖了该行业面临的一些关键因素和当前挑战：\n\n1. 市场结构：香港银行业包括三类机构：持牌银行、受限牌照银行和接受存款公司。截至2021年4月，香港有160家持牌银行、19家受限牌照银行和13家接受存款公司。该行业由几家主要银行主导，汇丰银行、恒生银行、中国银行（香港）和渣打银行占据最大的市场份额。\n\n2. 监管框架：香港金融管理局负责监督银行体系的正常运作、稳定性和完整性。它通过实施法规和指导方针、监督持牌机构，并确保符合巴塞尔III等国际标准。\n\n3. 资本和流动性：香港银行保持强大的资本和流动性状况，资本比率远高于巴塞尔III标准设定的最低要求。这有助于该行业对经济冲击的稳定性和抵御能力。\n\n尽管具有这些优势，香港银行业目前面临着几个挑战：\n\n1. 经济放缓和中美贸易紧张局势：香港经济高度依赖贸易，受到中国内地经济放缓和中美贸易冲突的负面影响。这可能影响银行的贷款组合、信用质量和整体盈利能力。\n\n2. 低利率环境：由于全球经济衰退，世界各国央行一直采取宽松的货币政策，降低利率。香港银行受到影响，因为该市与美元挂钩的汇率制度导致净利差收窄和盈利能力降低。\n\n3. 地缘政治风险：香港实施国家安全法和美中紧张局势加剧引发了对该市银行业潜在影响的担忧，如制裁或国际银行撤离。\n\n4. 技术颠覆：金融科技和数字银行的兴起为香港银行业带来机遇和挑战。传统银行需要增强数字能力，投资创新技术以与新兴金融科技公司和虚拟银行竞争。\n\n5. 合规和监管变化：随着国际法规的演变，香港银行必须不断适应并遵守新规定，这可能耗时且成本高昂。此外，他们需要管理与反洗钱（AML）和打击恐怖融资（CFT）法规相关的风险。\n\n总之，香港银行业仍具有强大的基本面，但需要解决上述挑战，以在不断变化的全球格局中保持竞争力和稳定性。"
    },
    "rejected": {
      "from": "gpt",
      "value": "香港的银行业面临着诸多挑战，如低利率、高房价和经济疲软。新冠疫情也给该行业带来了压力，企业和消费者都在艰难度日。一些分析师建议政府可能需要采取措施，如降低贷款利率或向陷入困境的企业提供财政援助。\n\n尽管面临这些挑战，香港的银行业仍受到良好监管，消费者和企业对其信任度高。该行业还以其强调创新和采用新技术，如移动银行和数字支付而闻名。总体而言，香港银行业的前景仍然积极，但需要谨慎管理和关注持续发展。"
    }
  }
]

对于上述格式的数据，dataset_info.json 中的 数据集描述 应为：

"dpo_zh_demo": {
    "file_name": "dpo_zh_demo.json",
    "ranking": true,
    "formatting": "sharegpt",
    "columns": {
      "messages": "conversations",     # 重要   
      "chosen": "chosen",             # 重要   
      "rejected": "rejected"         # 重要   
    }
},

微调训练

在使用 DPO 时，请将 stage 设置为 dpo，其他设置和lora一致，确保使用的数据集符合偏好数据集格式并且设置偏好优化相关参数。以下是一个示例：

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen3-4B
trust_remote_code: true

### method
stage: dpo
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
pref_beta: 0.1
pref_loss: sigmoid  # choices: [sigmoid (dpo), orpo, simpo]

### dataset
dataset: dpo_en_demo
template: qwen3
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: aves/qwen3-4b/lora/dpo
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### eval
# eval_dataset: dpo_en_demo
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

编辑完配置信息后，将其放在LLaMA-Factory-main/examples/train_lora目录下，然后我们在命令行使用如下命令，即可开始dpo微调（注意要在llama_factory目录下使用，后面的yaml是配置信息的路径）

llamafactory-cli train examples/train_lora/qwen3_lora_dpo.yaml

单机多卡训练需加：

FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml

DPO相关参数如下：

参数名称	类型	介绍	默认值
pref_beta	float	偏好损失中的 beta 参数。	0.1
pref_ftx	float	DPO 训练中的 sft loss 系数。	0.0
pref_loss	Literal[“sigmoid”, “hinge”, “ipo”, “kto_pair”, “orpo”, “simpo”]	DPO 训练中使用的偏好损失类型。可选值为： sigmoid, hinge, ipo, kto_pair, orpo, simpo。	sigmoid
dpo_label_smoothing	float	标签平滑系数，取值范围为 [0,0.5]。	0.0

其他信息和lora微调一致，读者可自行参考lora微调部分。

多模态DPO(qwen2.5-vl)

数据集构建

方法一

你可以直接使用rlhf-v进行训练，代码会自动下载数据集，详见examples/train_lora/qwen2_5vl_lora_dpo.yaml

dataset: rlhf_v

方法二

数据集网址：https://huggingface.co/datasets/llamafactory/RLHF-V

也可以手动将数据集下载下来后，在dataset_info.json里指明路径：

"rlhf_v-local": {
    "file_name": "rlhf-v/rlhf-v.parquet",
    "ranking": true,
    "formatting": "sharegpt",
    "columns": {
      "messages": "conversations",
      "chosen": "chosen",
      "rejected": "rejected",
      "images": "images"
    }
  },

然后将dataset定义为rlhf_v-local

方法三

也可以使用json文件构建数据集：

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "<image>What are the key features you observe in the image?"
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "A young man standing on stage wearing a white shirt and black pants."
    },
    "rejected": {
      "from": "gpt",
      "value": "A young man standing on stage wearing white pants and shoes."
    },
    "images": [
      "mllm_demo_data/1.jpg"
    ]
  }
]

在dataset_info.json里指明路径：

"rlhf_v-json": {
    "file_name": "rlhf-v-json.json",
    "ranking": true,
    "formatting": "sharegpt",
    "columns": {
      "messages": "conversations",
      "chosen": "chosen",
      "rejected": "rejected",
      "images": "images"
    }
  },

微调训练

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen2.5-VL-7B-Instruct
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

### method
stage: dpo
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
pref_beta: 0.1
pref_loss: sigmoid  # choices: [sigmoid (dpo), orpo, simpo]

### dataset
dataset: rlhf_v-json
template: qwen2_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen2_5vl-7b/lora/dpo
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

出现下面的界面，说明训练完成了

[INFO|tokenization_utils_base.py:2356] 2025-06-03 16:09:19,382 >> chat template saved in saves/qwen2_5vl-7b/lora/dpo/chat_template.jinja
[INFO|tokenization_utils_base.py:2525] 2025-06-03 16:09:19,383 >> tokenizer config file saved in saves/qwen2_5vl-7b/lora/dpo/tokenizer_config.json
[INFO|tokenization_utils_base.py:2534] 2025-06-03 16:09:19,383 >> Special tokens file saved in saves/qwen2_5vl-7b/lora/dpo/special_tokens_map.json
***** train metrics *****
  epoch                    =        3.0
  total_flos               =    31250GF
  train_loss               =     0.0874
  train_runtime            = 0:00:03.96
  train_samples_per_second =      0.757
  train_steps_per_second   =      0.757

PPO

PPO（Proximal Policy Optimization，近端策略优化）是一种在强化学习领域广泛应用的算法，它通过优化策略网络来最大化累积奖励，同时保持策略更新的稳定性。PPO的核心思想是通过限制策略更新的幅度，避免因更新过大而导致的训练不稳定问题，从而实现更高效、更稳定的策略学习。

奖励模型数据集构建

数据集格式和dpo格式一样，可直接使用dpo的数据。

可参考LLaMA-Factory-main/data/dpo_zh_demo.json构建数据集，数据集示例如下：

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "国会的转发\n美国国会由众议院和参议院组成，每两年换届一次（参议员任期为6年，但参议院选举是错位的，使得国会的组成仍然每两年变化一次）。这两年期间按顺序标记，第115届国会发生在2017-2018年。\n\n密歇根大学信息学院的研究人员在这段时间内收集了现任国会议员（我们将“国会议员”缩写为MoC）的Twitter帖子，并对它们进行编码，标记为原创声明或其他用户提交的转发。我们将重点关注转发数据。这些发布的数据不仅包括转发的文本，还包括国会议员的信息和原始推文的帐户。\n#python:\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sb\nimport statsmodels.api as sm\nimport os\nfrom pandas.plotting import register\\_matplotlib\\_converters\nregister\\_matplotlib\\_converters()\n\npd.options.mode.chained\\_assignment = None\n\n# 在接下来的内容中，我们将加载数据，但出于速度原因，我们将排除推文本身的文本。\n\ntweets = pd.read\\_csv(os.path.join(base, \"retweets\\_congress\\_115th.csv.gz\"), \n usecols = lambda x: x != \"full\\_text\",\n dtype = {\"tweet\\_id\\_str\": \"str\"},\n index\\_col = \"tweet\\_id\\_str\")\ntweets.shape\nout: (209856, 21)\n\ntweets.columns\nout:Index(['user\\_id\\_str', 'user\\_screen\\_name', 'user\\_followers\\_count', 'created\\_at',\n 'retweet\\_count', 'retweeted\\_status', 'year', 'bioguide', 'birthday',\n 'gender', 'name', 'state', 'chamber', 'party', 'region', 'rt\\_user\\_id',\n 'rt\\_screen\\_name', 'rt\\_followers\\_count', 'rt\\_bio', 'age', 'age\\_bucket'],\n dtype='object')\n\n第（a）部分\n国会主要由两个政党（民主党和共和党）组成，还有一些独立派。哪个政党负责大部分的转发？\n\n计算推文的政党的边际概率，并将结果保存到party\\_marg。显示这个结果。\n\n第（b）部分\n我们可能会问的一个问题是，“国会成员彼此转发的频率是多少？”\n\n创建一个名为“is\\_congress\\_retweet”的新列，指示转发是否来自MoC。首先使用.unique()方法获取唯一的“user\\_id\\_str”值集合。然后使用这组id，在“rt\\_user\\_id”列上使用.isin(values)方法创建新的布尔列“is\\_congress\\_retweet”。\n\n找到转发MoC的边际概率，并将结果保存到“congress\\_retweet\\_marg”。显示这些结果。是更常见地转发国会议员还是来自国会外部？"
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "计算推文的政党边际概率，我们可以使用以下代码：\n```python\ntweets['party'] = tweets['party'].replace(0, 'Democrat')\ntweets['party'] = tweets['party'].replace(1, 'Republican')\ntweets['party'] = tweets['party'].replace(2, 'Independent')\n\nparty_counts = tweets.groupby('party').size().sort_values(ascending=False)\n\nparty_marg = party_counts / party_counts.sum()\n\nprint(\"政党的边际概率：\")\nprint(party_marg)\n```\n这段代码将'party'列中的数字值替换为相应的政党名称，然后按政党对推文进行分组并计算每个组的大小。最后，通过将每个组的大小除以总推文数来计算每个政党的边际概率。\n\n输出应该类似于：\n```markdown\n政党的边际概率：\n0.522152    民主党\n0.445541    共和党\n0.032307    独立派\n```\n这表明大多数转发来自民主党，其次是共和党，然后是独立派。\n\n要创建一个名为\"is\\_congress\\_retweet\"的新列，指示转发是否来自国会议员，我们可以使用以下代码：\n```python\ncongress_ids = tweets['user_id_str'].unique()\nis_congress_retweet = tweets['rt_user_id'].isin(congress_ids)\ntweets['is_congress_retweet'] = is_congress_retweet\n\ncongress_retweet_marg = tweets['is_congress_retweet'].mean()\nprint(\"转发国会议员的边际概率：\")\nprint(congress_retweet_marg)\n```\n这段代码首先获取唯一的\"user\\_id\\_str\"值集合，对应于国会议员的集合。然后，在\"rt\\_user\\_id\"列上使用.isin(values)方法创建一个新的布尔列\"is\\_congress\\_retweet\"，指示转发是否来自国会议员。最后，通过取\"is\\_congress\\_retweet\"列的平均值来计算转发国会议员的边际概率。\n\n输出应该类似于：\n```markdown\n转发国会议员的边际概率：\n0.775151\n```\n这表明大多数转发不是来自国会议员，而是来自国会之外。"
    },
    "rejected": {
      "from": "gpt",
      "value": "回答问题的第（a）部分，您可以通过首先使用groupby（）函数按政党对数据进行分组，然后计算每个组的转发数来计算政党的边际概率。具体来说，您可以使用以下代码：\n```scss\nparty_count = tweets.groupby('party').count()\nparty_count.loc['Democrats','retweet_count'] / party_count.loc['Republicans','retweet_count']\n```\n回答问题的第（b）部分，您可以创建一个新列来指示转发是否来自国会议员，方法是使用“bioguide”列上的.unique（）方法获取唯一的国会议员用户ID集合，然后使用该集合创建一个布尔列，指示转发是否来自国会议员，使用“rt_user_id”列上的.isin（）方法。最后，使用.sum（）方法来计算转发国会议员的边际概率，然后将其与转发来自国会以外的人的概率进行比较。以下是相关代码：\n```makefile\nimport pandas as pd\ntweets = pd.read_csv('retweets_congress_115th.csv')\n\n# 获取唯一的国会议员ID集合\nmovers_ids = tweets['bioguide'].unique()\n\n# 创建一个新的布尔列，指示转发是否来自国会议员\ntweets['is_congress_retweet'] = tweets['rt_user_id'].isin(movers_ids)\n\n# 计算转发国会议员或来自国会以外的人的边际概率\nmovers_retweets = tweets[tweets['is_congress_retweet'] == True].shape[0]\nnon_movers_retweets = tweets[tweets['is_congress_retweet'] == False].shape[0]\nproportion_movers_retweets = movers_retweets / (movers_retweets + non_movers_retweets)\nprint('Proportion of retweets from MoCs:', proportion_movers_retweets)\n```"
    }
  }

    ......

  {
    "conversations": [
      {
        "from": "human",
        "value": "请提供一下对香港银行业的分析以及目前面临的挑战。"
      }
    ],
    "chosen": {
      "from": "gpt",
      "value": "香港银行业长期以来一直是该城市金融服务业和整体经济的重要组成部分。其特点是健全的监管框架、大量银行和资本流动的便利。以下分析涵盖了该行业面临的一些关键因素和当前挑战：\n\n1. 市场结构：香港银行业包括三类机构：持牌银行、受限牌照银行和接受存款公司。截至2021年4月，香港有160家持牌银行、19家受限牌照银行和13家接受存款公司。该行业由几家主要银行主导，汇丰银行、恒生银行、中国银行（香港）和渣打银行占据最大的市场份额。\n\n2. 监管框架：香港金融管理局负责监督银行体系的正常运作、稳定性和完整性。它通过实施法规和指导方针、监督持牌机构，并确保符合巴塞尔III等国际标准。\n\n3. 资本和流动性：香港银行保持强大的资本和流动性状况，资本比率远高于巴塞尔III标准设定的最低要求。这有助于该行业对经济冲击的稳定性和抵御能力。\n\n尽管具有这些优势，香港银行业目前面临着几个挑战：\n\n1. 经济放缓和中美贸易紧张局势：香港经济高度依赖贸易，受到中国内地经济放缓和中美贸易冲突的负面影响。这可能影响银行的贷款组合、信用质量和整体盈利能力。\n\n2. 低利率环境：由于全球经济衰退，世界各国央行一直采取宽松的货币政策，降低利率。香港银行受到影响，因为该市与美元挂钩的汇率制度导致净利差收窄和盈利能力降低。\n\n3. 地缘政治风险：香港实施国家安全法和美中紧张局势加剧引发了对该市银行业潜在影响的担忧，如制裁或国际银行撤离。\n\n4. 技术颠覆：金融科技和数字银行的兴起为香港银行业带来机遇和挑战。传统银行需要增强数字能力，投资创新技术以与新兴金融科技公司和虚拟银行竞争。\n\n5. 合规和监管变化：随着国际法规的演变，香港银行必须不断适应并遵守新规定，这可能耗时且成本高昂。此外，他们需要管理与反洗钱（AML）和打击恐怖融资（CFT）法规相关的风险。\n\n总之，香港银行业仍具有强大的基本面，但需要解决上述挑战，以在不断变化的全球格局中保持竞争力和稳定性。"
    },
    "rejected": {
      "from": "gpt",
      "value": "香港的银行业面临着诸多挑战，如低利率、高房价和经济疲软。新冠疫情也给该行业带来了压力，企业和消费者都在艰难度日。一些分析师建议政府可能需要采取措施，如降低贷款利率或向陷入困境的企业提供财政援助。\n\n尽管面临这些挑战，香港的银行业仍受到良好监管，消费者和企业对其信任度高。该行业还以其强调创新和采用新技术，如移动银行和数字支付而闻名。总体而言，香港银行业的前景仍然积极，但需要谨慎管理和关注持续发展。"
    }
  }
]

奖励模型训练

奖励模型在偏好数据集上训练，从而可以更符合人类偏好地评价语言模型的输出。

在训练奖励模型时，请将 stage 设置为 rm，确保使用的数据集符合偏好数据集格式并且指定奖励模型的保存路径。以下提供一个示例：

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen3-4B
trust_remote_code: true

### method
stage: rm
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: dpo_en_demo
template: qwen3
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen3-4b/lora/reward
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### eval
# eval_dataset: dpo_en_demo
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

编辑完配置信息后，将其放在LLaMA-Factory-main/examples/train_lora目录下，然后我们在命令行使用如下命令，即可开始奖励模型的训练（注意要在llama_factory目录下使用，后面的yaml是配置信息的路径）

llamafactory-cli train examples/train_lora/qwen3_lora_reward.yaml

PPO数据集构建

PPO的训练数据可直接使用lora的数据。

[
  {
    "instruction": "识别并解释给定列表中的两个科学理论：细胞理论和日心说。",
    "input": "",
    "output": "细胞理论是生物科学的一个理论，它认为所有生命体都是由微小的基本单元——细胞所构成。这是生物学的一个基础理论，认为细胞是所有生物的基本结构和功能单位，所有的生物都是由一个或多个细胞组成，细胞只能通过细胞分裂产生新的细胞。这一理论由薛定谔、施瓦内和雪莱于1839年首次提出。\n\n日心说是指太阳是太阳系的中心，也就是说，行星围绕太阳旋转的理论。这个理论打破了传统的地心说观点，认为地球并不是宇宙的中心。日心说的提出者是尼古拉·哥白尼，他在16世纪初发表了他的著作《天体运行论》，阐述了太阳系行星围绕太阳运行的模型，为天文学的发展做出了巨大贡献。"
  },

  ......,

  {
    "instruction": "比较以下两位著名人物，使用特征列表并以句子形式提供输出。",
    "input": "人物1：阿尔伯特·爱因斯坦\n人物2：斯蒂芬·霍金\n特征：智力、教育、对世界的影响",
    "output": "阿尔伯特·爱因斯坦和斯蒂芬·霍金都是拥有极其出色智力的人物。两人都取得过非常高的教育成就，他们推进了科学发展并在世界范围内产生了深远的影响。爱因斯坦以其相对论和质能关系公式而闻名，而霍金以其关于黑洞和宇宙的发现而著称。两位科学家都以其深厚的学识和非凡的贡献影响了世界。"
  },
]

PPO 训练

在训练奖励完模型之后，我们可以开始进行模型的强化学习部分。

语言模型接受prompt作为输入，其输出作为奖励模型的输入。奖励模型评价语言模型的输出，并将评价返回给语言模型。确保两个模型都能良好运行是一个具有挑战性的任务。一种实现方式是使用近端策略优化（PPO，Proximal Policy Optimization）。其主要思想是：我们既希望语言模型的输出能够尽可能地获得奖励模型的高评价，又不希望语言模型的变化过于“激进”。通过这种方法，我们可以使得模型在学习趋近人类偏好的同时不过多地丢失其原有的解决问题的能力。

在使用 PPO 进行强化学习时，请将 stage 设置为 ppo，并且指定所使用奖励模型的路径。下面是一个示例：

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen3-4B
reward_model: saves/qwen3-4b/lora/reward
trust_remote_code: true

### method
stage: ppo
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: identity,alpaca_en_demo
template: qwen3
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen3-4b/lora/ppo
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### generate
max_new_tokens: 512
top_k: 0
top_p: 0.9

llamafactory-cli train examples/train_lora/qwen3_lora_ppo.yaml

PPO的更多参数如下：

参数名称	类型	介绍	默认值
pref_beta	float	偏好损失中的 beta 参数。	0.1
ppo_buffer_size	int	PPO 训练中的 mini-batch 大小。	1
ppo_epochs	int	PPO 训练迭代次数。	4
ppo_score_norm	bool	是否在 PPO 训练中使用归一化分数。	False
ppo_target	float	PPO 训练中自适应 KL 控制的目标 KL 值。	6.0
ppo_whiten_rewards	bool	PPO 训练中是否对奖励进行归一化。	False
ref_model	Optional[str]	PPO 或 DPO 训练中使用的参考模型路径。	None
ref_model_adapters	Optional[str]	参考模型的适配器路径。	None
ref_model_quantization_bit	Optional[int]	参考模型的量化位数，支持 4 位或 8 位量化。	None
reward_model	Optional[str]	PPO 训练中使用的奖励模型路径。	None
reward_model_adapters	Optional[str]	奖励模型的适配器路径。	None
reward_model_quantization_bit	Optional[int]	奖励模型的量化位数。	None
reward_model_type	Literal[“lora”, “full”, “api”]	PPO 训练中使用的奖励模型类型。可选值为： lora, full, api。	lora

KTO

KTO（Kahneman-Tversky Optimization）是一种基于前景理论（Prospect Theory）的优化方法，用于对大型语言模型（LLM）进行对齐。这种方法的核心思想是通过优化模型的输出，使其更符合人类的偏好和决策过程。KTO算法在训练过程中，特别关注负样本的敏感性，这意味着它能够更有效地处理那些不符合人类偏好的输出。

KTO的出现是为了解决成对的偏好数据难以获得的问题。 KTO使用了一种新的损失函数使其只需二元的标记数据，即只需标注回答的好坏即可训练，并取得与 DPO 相似甚至更好的效果。

数据集构建

KTO数据集与偏好数据集类似，但除了 instruction 以及 input 组成的人类最终输入和模型回答 output ，KTO 数据集还需要额外添加一个 kto_tag 列（label=true/false）来表示人类的反馈，true表示接受，false表示不接受。示例如下：

[
  {
    "messages": [
      {
        "content": "The Federal Trade Commission is going after spyware, bits of computer code that install themselves on the computers of Internet users to track or disrupt their online activities.\nQ: Which is the best summary of this article?\nPick your answer from:\n(A). World\n(B). Sports\n(C). Business\n(D). Science/Tech\nI think the answer is",
        "role": "user"
      },
      {
        "content": "Ooh ooh ah ah! *giggle* Let me think... Ooh ooh ah ah! *giggle* The best summary of this article is... *drumroll*... D! Science/Tech! Yay! *confetti* The Federal Trade Commission is talking about spyware, which is like a little computer bug that can watch what you do on your computer without you knowing. It's like a secret agent, but not a nice one! *gasp* The FTC wants to stop the spyware from doing its thing, so they're going after it! *cheer* Woohoo!",
        "role": "assistant"
      }
    ],
    "label": true
  },
  {
    "messages": [
      {
        "content": "Compare and contrast the roles of the hippocampus and the prefrontal cortex in the human brain, discussing their involvement in memory processing and decision making.",
        "role": "user"
      },
      {
        "content": "The human brain is a highly complex organ, responsible for a myriad of cognitive functions, including memory processing and decision making. Two critical regions implicated in these processes are the hippocampus and the prefrontal cortex. While both structures play significant roles in these cognitive domains, they each contribute distinct functions that work in concert to facilitate our ability to navigate the world and make informed choices.\n\nThe hippocampus, located within the medial temporal lobe, is predominantly associated with the formation, consolidation, and retrieval of memories. Specifically, it plays a crucial role in episodic memory, which encompasses the storage and recollection of autobiographical events, as well as spatial memory, which pertains to the encoding and retrieval of spatial information. The hippocampus is vital for the formation of new memories and the integration of these memories into existing knowledge structures. It has been demonstrated that damage to the hippocampus can result in anterograde amnesia, a condition characterized by an inability to form new memories, while leaving previously consolidated memories intact.\n\nIn contrast, the prefrontal cortex, situated in the anterior portion of the frontal lobes, is primarily involved in executive functions, which include higher-order cognitive processes such as decision making, planning, reasoning, and inhibitory control. The prefrontal cortex is responsible for integrating information from various sources, evaluating potential outcomes, and selecting the most advantageous course of action. It is also implicated in working memory, which refers to the temporary storage and manipulation of information necessary for complex cognitive tasks. Damage to the prefrontal cortex can result in deficits in decision making, impulsivity, and poor judgment.\n\nWhile the hippocampus and prefrontal cortex have distinct roles, they are interconnected and work synergistically to facilitate memory processing and decision making. The hippocampus provides the prefrontal cortex with contextual information and episodic memories that inform decision-making processes. Conversely, the prefrontal cortex can modulate the hippocampus's activity during memory encoding and retrieval, influencing the salience and organization of memories. This dynamic interplay between the two regions enables individuals to adapt their decision-making strategies based on past experiences and contextual information.\n\nIn summary, the hippocampus and prefrontal cortex are integral to memory processing and decision making, albeit with distinct contributions. The hippocampus is primarily implicated in the formation, consolidation, and retrieval of episodic and spatial memories, while the prefrontal cortex is responsible for executive functions, including decision making, planning, and working memory. The interplay between these two regions allows for the integration of past experiences and contextual information in guiding informed decision making and shaping our behavior.",
        "role": "assistant"
      },
      {
        "content": "Discuss the mechanisms through which the prefrontal cortex modulates the hippocampus's activity during memory encoding and retrieval.",
        "role": "user"
      },
      {
        "content": "The prefrontal cortex (PFC) and the hippocampus have a dynamic relationship that allows for the modulation of memory encoding and retrieval. This interaction is facilitated by direct and indirect neural pathways that connect the two regions, allowing for bidirectional communication.\n\nDuring memory encoding, the PFC is thought to modulate hippocampal activity through top-down control. The PFC can selectively enhance or suppress the processing of sensory information in the hippocampus based on its relevance to the task at hand. This process, known as 'attentional modulation', allows the PFC to guide the hippocampus in encoding memories that are relevant and suppressing those that are not. This is crucial for the formation of goal-directed behaviors and for the efficient use of cognitive resources.\n\nIn addition to attentional modulation, the PFC also influences the hippocampus through 'working memory' operations. Working memory, a temporary storage system for information, is largely mediated by the PFC. During memory encoding, the PFC can use this working memory to provide the hippocampus with a 'preview' of the information that is to be encoded, allowing the hippocampus to prepare for the incoming information and encode it more efficiently.\n\nDuring memory retrieval, the PFC again plays a modulatory role. It is thought that the PFC can initiate the retrieval process by sending a 'retrieval cue' to the hippocampus. This cue triggers the reactivation of the neural patterns associated with the memory, allowing for its retrieval. The PFC can also influence the focus of retrieval, determining whether the retrieval is broad (i.e., recalling the general gist of an event) or specific (i.e., recalling specific details).\n\nFurthermore, the PFC can modulate the emotional intensity of retrieved memories through its connections with the amygdala, a region involved in emotional processing. This can influence the subjective experience of the memory, affecting how it is perceived and responded to.\n\nIn summary, the PFC modulates hippocampal activity during memory encoding and retrieval through a variety of mechanisms, including attentional modulation, working memory operations, retrieval initiation, and emotional modulation. These processes allow the PFC to guide the hippocampus in encoding and retrieving memories in a way that is adaptive and efficient.",
        "role": "assistant"
      },
      {
        "content": "Can you elaborate on the role of the amygdala in modulating the emotional intensity of retrieved memories, and how this interaction with the prefrontal cortex influences our perception and response to these memories?",
        "role": "user"
      },
      {
        "content": "The amygdala plays a crucial role in the emotional processing of stored memories. It is a small almond-shaped structure situated deep within the medial temporal lobes that consists of multiple nuclei involved in different aspects of emotional processing, including the establishment of emotional associations and hedonic reactions to stimuli. The amygdala interacts extensively with both the hippocampus and the prefrontal cortex (PFC) to modulate emotional responses to retrieved memories.\n\nThe emotional component of a memory is largely encoded through the amygdala's interaction with the hippocampus during memory formation. When an emotionally significant event occurs, the hippocampus provides contextual and spatial aspects of the memory, while the amygdala assigns an emotional value or salience to the memory. The solidification of these emotional associations ensures that emotional information associated with a given memory is readily reactivated during memory retrieval.\n\nDuring memory retrieval, the PFC receives information about the emotional intensity of the memory through reciprocal connections with the amygdala. The PFC can then modulate the impact of emotional memories on behavior and cognitive processes by using top-down control. This regulation can influence how we experience and respond to retrieved memories.\n\nFor example, the PFC can enhance or diminish the emotional intensity associated with a retrieved memory. This modulation can result in increased or decreased arousal, as well as altered subjective feelings of pleasure, discomfort, or despair. Additionally, PFC involvement can affect the context in which a memory is reactivated, leading to changes in the emotional response. The PFC can prioritize and filter information based on the current context or goal, which may result in more suitable emotional responses or the suppression of inappropriate emotional reactions.\n\nMoreover, the amygdala, PFC, and hippocampus work together in the process of emotion regulation. The anterior cingulate cortex (ACC), a region that lies within the PFC, often interacts with the amygdala to create a \"circuit breaker\" for emotional responses. The ACC receives input from the amygdala indicating the emotional intensity and salience of a memory, and can then engage with the amygdala to reduce the emotional response. This mechanism allows an individual to react in a more rational and appropriate manner when faced with emotionally charged situations.\n\nIn summary, the amygdala's role in modulating the emotional intensity of retrieved memories is crucial for engaging with the world in an adaptive and meaningful way. Through its interactions with the hippocampus and, most significantly, the prefrontal cortex, the amygdala influences our perception and response to memories, contributing to our emotional experience and behavior.",
        "role": "assistant"
      }
    ],
    "label": true
  },

微调训练

在使用 KTO 时，请将 stage 设置为 kto ，设置偏好优化相关参数并使用 KTO 数据集。

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen3-4B
trust_remote_code: true

### method
stage: kto
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
pref_beta: 0.1

### dataset
dataset: kto_en_demo
template: qwen3
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen3-4b/lora/kto
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

llamafactory-cli train examples/train_lora/qwen3_lora_kto.yaml

KTO的更多参数配置如下：

参数名称	类型	介绍	默认值
pref_beta	float	偏好损失中的 beta 参数。	0.1
kto_chosen_weight	float	KTO 训练中 chosen 标签 loss 的权重。	1.0
kto_rejected_weight	float	KTO 训练中 rejected 标签 loss 的权重。	1.0

Pre-training

预训练是一种在机器学习和深度学习中广泛采用的策略，特别是在自然语言处理（NLP）领域。其核心思想是先在一个大规模的通用数据集上训练一个模型，让模型学习到语言的基本规律和特征，然后再将这个预训练好的模型应用到具体的下游任务上，通过微调（fine-tuning）等方式使其适应特定的任务需求。

预训练的过程通常涉及海量的数据，这些数据涵盖了各种语言现象和知识，使得模型能够学习到丰富的语言模式和语义信息。例如，在自然语言处理中，预训练模型会接触到大量的文本数据，包括新闻文章、书籍、网页内容等，通过这些数据学习单词之间的关系、句子的结构以及文本的语义含义。这种广泛的学习使得模型在后续面对具体的任务时，已经具备了对语言的深刻理解，能够更快速、更有效地适应新任务。

数据集构建

大语言模型通过学习未被标记的文本进行预训练，从而学习语言的表征。通常，预训练数据集从互联网上获得，因为互联网上提供了大量的不同领域的文本信息，有助于提升模型的泛化能力。预训练数据集文本描述格式如下：

{"text": "Don’t think you need all the bells and whistles? No problem. McKinley Heating Service Experts Heating & Air Conditioning offers basic air cleaners that work to improve the quality of the air in your home without breaking the bank. It is a low-cost solution that will ensure you and your family are living comfortably.\nIt’s a good idea to understand the efficiency rate of the filters, which measures what size of molecules can get through the filter. Basic air cleaners can filter some of the dust, dander and pollen that need to be removed. They are 85% efficient, and usually have a 6-inch cleaning surface.\nBasic air cleaners are not too expensive and do the job well. If you do want to hear more about upgrading from a basic air cleaner, let the NATE-certified experts at McKinley Heating Service Experts in Edmonton talk to you about their selection.\nEither way, now’s a perfect time to enhance and protect the indoor air quality in your home, for you and your loved ones.\nIf you want expert advice and quality service in Edmonton, give McKinley Heating Service Experts a call at 780-800-7092 to get your questions or concerns related to your HVAC system addressed."}
{"text": "To the apparent surprise of everyone, the Walt Disney Company has announced a deal to purchase Lucasfilm Ltd. According to the official press release, Disney has agreed to fork over $4.05 billion in cash and stock for George Lucas’ studio in a deal that brings together two of the world’s most important intellectual property libraries.\nAs you might expect, Disney is itching to take advantage of its new toys. “This transaction combines a world-class portfolio of content including Star Wars, one of the greatest family entertainment franchises of all time, with Disney’s unique and unparalleled creativity across multiple platforms, businesses, and markets to generate sustained growth and drive significant long-term value,” said Disney CEO Robert Iger in this afternoon’s announcement.\nUnder the terms of this agreement Disney will acquire control over all Lucasfilm iterations. This includes both its traditional film-making studio facilities, as well as the various technologies Lucasfilm has created over the years to further its various media properties. Thus, the gigantic Disney family now includes Lucasfilm itself, special effects house Industrial Light & Magic, Skywalker Sound and LucasArts, the company’s video game creation division.\nThis acquisition alone would be huge news, but as if to pre-empt fan speculation on the future of Star Wars the same announcement also mentions that a new Star Wars movie is scheduled to appear in 2015. Though the vast majority of recent Star Wars media has been focused on the property’s various animated iterations and LEGO crossovers, this new film will be the first official cinematic continuation of George Lucas’ original Star Wars trilogy. Though very few details are offered on this film, it has officially been dubbed Star Wars: Episode VII, and barring any major catastrophes it should hit theaters at some point in 2015 (if we had to guess, we’d assume an early summer release in keeping with the tradition established by its predecessors).\nPerhaps even more intriguing however, is the announcement’s claim that Episode VII’s release will herald a new era in which new Star Wars movies hit theaters “every two to three years.” It specifically mentions Episodes VIII and IX by name, though offers no solid details on either film.\nWhile the effects of the move won’t be fully known for at least a few months, we can think of a number of a things this new union might change. For instance, currently Dark Horse Comics publishes all Star Wars comic books, but with Disney owning Marvel Comics we can’t see that agreement lasting for long. Likewise, both Disney and Lucasfilm have sizable divisions dedicated to creating video games based on their various media properties. Normally these companies have had to seek outside publishing agreements, but now that they’ve joined forces and massively expanded the number of games either company is capable of releasing in any given year, it makes a lot of sense for Disney to invest in its own games publishing wing.\nFinally, this agreement almost certainly heralds future crossovers between Disney and Lucasfilm characters. We don’t know any specifics, but it’s only a matter of time before we see toys depicting Mickey Mouse dressed as Darth Vader. Whether that sounds awesome or stomach-churningly disgusting is entirely up to your rapidly waning sense of childhood whimsy.\nUpdate: Scratch that last prediction. Apparently Disney characters dressed as Star Wars characters is already a thing.\nOur partnership with LucasFilm has produced over 20 yrs worth of stories. We have Star Wars for the near future, and hope for years to come."}

可参考c3_demo.jsonl文件

训练

大语言模型通过在一个大型的通用数据集上通过无监督学习的方式进行预训练来学习语言的表征/初始化模型权重/学习概率分布。我们期望在预训练后模型能够处理大量、多种类的数据集，进而可以通过监督学习的方式来微调模型使其适应特定的任务。

预训练时，请将 stage 设置为 pt ，并确保使用的数据集符合预训练数据集格式。

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen3-4B
trust_remote_code: true

### method
stage: pt
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: c4_demo
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen3-4b/lora/pretrain
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### eval
# eval_dataset: c4_demo
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

llamafactory-cli train examples/train_lora/qwen3_lora_pretrain.yaml

全量微调

我们之前所有的设置均为lora微调，这使得对模型的参数量需求较小。

全参微调指的是在训练过程中对于预训练模型的所有权重都进行更新，但其对显存的要求是巨大的。如果我们需要进行全参微调，可以将 finetuning_type 设置为 full 。

可以参考examples/train_full/llama3_full_sft_ds3.yaml

其中的关键部分如下：

finetuning_type: full
# 如果需要使用deepspeed:
deepspeed: examples/deepspeed/ds_z3_config.json

Freeze

Freeze(冻结微调)指的是在训练过程中只对模型的小部分权重进行更新，这样可以降低对显存的要求。

如果需要进行冻结微调，可以将 finetuning_type 设置为 freeze 并且设置相关参数, 例如冻结的层数 freeze_trainable_layers 、可训练的模块名称 freeze_trainable_modules 等。

参数名称	类型	介绍
freeze_trainable_layers	int	可训练层的数量。正数表示最后 n 层被设置为可训练的，负数表示前 n 层被设置为可训练的。默认值为 `2`
freeze_trainable_modules	str	可训练层的名称。使用 `all` 来指定所有模块。默认值为 `all。其他可设置参数包括：self_attn, mlp, input_layernorm, post_attention_layernorm`
freeze_extra_modules[非必须]	str	除了隐藏层外可以被训练的模块名称，被指定的模块将会被设置为可训练的。使用逗号分隔多个模块。默认值为 `None`

示例：

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen3-4B

### method
stage: sft
do_train: true
# finetuning_type: lora
finetuning_type: freeze
freeze_trainable_layers: 1
freeze_trainable_modules: self_attn    # self_attn, mlp, input_layernorm, post_attention_layernorm
lora_target: all

### dataset
dataset: identity,alpaca_en_demo    # 数据集可以写多个，用逗号隔开
template: qwen
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/qwen3-4b/freeze/sftr
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

评估

通用能力评估

在完成模型训练后，可以利用llama-factory来评估模型效果。

配置示例文件 examples/train_lora/qwen3_lora_eval.yaml 具体如下：

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen3-4B
adapter_name_or_path: saves/qwen3-4b/lora/sft
trust_remote_code: true

### method
finetuning_type: lora

### dataset
task: mmlu_test  # choices: [mmlu_test, ceval_validation, cmmlu_test]
template: fewshot
lang: en
n_shot: 5

### output
save_dir: saves/qwen3-4b/lora/eval

### eval
batch_size: 4

其中的参数如下：


参数名称	类型	介绍
task	str	评估任务的名称，可选项有 mmlu_test, ceval_validation, cmmlu_test，其路径位于evaluation下
task_dir	str	包含评估数据集的文件夹路径，默认值为 `evaluation`。
lang	str	评估使用的语言，可选值为 `en`、 `zh`。默认值为 `en`。
n_shot	int	few-shot 的示例数量，默认值为 `5`。
save_dir	str	保存评估结果的路径，默认值为 `None`。如果该路径已经存在则会抛出错误。
download_mode	str	评估数据集的下载模式，默认值为 `DownloadMode.REUSE_DATASET_IF_EXISTS`。如果数据集已经存在则重复使用，否则则下载。

然后在命令行使用如下命令：

llamafactory-cli eval examples/train_lora/qwen3_lora_eval.yaml

出现下面的界面表示开始评估：

[INFO|tokenization_utils_base.py:2021] 2025-06-03 14:32:41,639 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2021] 2025-06-03 14:32:41,639 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2021] 2025-06-03 14:32:41,639 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2021] 2025-06-03 14:32:41,639 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2021] 2025-06-03 14:32:41,639 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2021] 2025-06-03 14:32:41,640 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2021] 2025-06-03 14:32:41,640 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2299] 2025-06-03 14:32:41,967 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:696] 2025-06-03 14:32:41,968 >> loading configuration file /media/good/4TB/mn/model/llm/qwen/Qwen3-4B/config.json
[INFO|configuration_utils.py:770] 2025-06-03 14:32:41,969 >> Model config Qwen3Config {

其中MMLU（Massive Multitask Language Understanding）是一个大规模多任务语言理解基准测试，旨在全面评估语言模型的知识广度和深度。它包含57个学科，涵盖从基础数学到专业法律、从历史到伦理学等多个领域。测试题目难度跨度极大，既有小学生级别的计算题，也有需要通过专业考试（如GRE、司法考试）的题目。其路径位于evaluation/mmlu。

出现下面的界面，表示运行完成：

Generating train split: 5 examples [00:00, 2289.72 examples/s]
Processing subjects: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [32:17<00:00, 33.98s/it, world religions]
        Average: 68.39                                                                                                                                                                                                     
           STEM: 70.28
Social Sciences: 78.71
     Humanities: 58.47
          Other: 71.25

结果保存至saves/qwen3-4b/lora/eval，其中results.json保存了模型的回答结果，而results.log保存了模型的评估指标结果

NLG 评估

我们还可以通过计算模型的 BLEU 和 ROUGE 分数以评价模型生成质量。

评估数据集可直接使用类似于sft的数据集进行评估

配置示例文件 examples/extras/nlg_eval/qwen3_lora_predict.yaml 具体如下：

# The batch generation can be SLOW using this config.
# For faster inference, we recommend to use `scripts/vllm_infer.py`.

### model
model_name_or_path: /media/good/4TB/mn/model/llm/qwen/Qwen3-4B
adapter_name_or_path: saves/qwen3-4b/lora/sft
trust_remote_code: true

### method
stage: sft
do_predict: true
finetuning_type: lora

### dataset
eval_dataset: identity,alpaca_en_demo
template: qwen
cutoff_len: 2048
max_samples: 50
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen3-4b/lora/predict
overwrite_output_dir: true
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### eval
per_device_eval_batch_size: 1
predict_with_generate: true
ddp_timeout: 180000000

然后在命令行使用如下命令：

llamafactory-cli train examples/extras/nlg_eval/qwen3_lora_predict.yaml

出现下面的界面，表示运行完成：

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [05:00<00:00,  3.09s/it]Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.446 seconds.
Prefix dict has been built successfully.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [05:02<00:00,  3.02s/it]
***** predict metrics *****
  predict_bleu-4                 =    62.5695
  predict_model_preparation_time =     0.0039
  predict_rouge-1                =    71.6423
  predict_rouge-2                =    54.1658
  predict_rouge-l                =    65.4101
  predict_runtime                = 0:05:04.02
  predict_samples_per_second     =      0.329
  predict_steps_per_second       =      0.329
[INFO|2025-06-03 15:15:56] llamafactory.train.sft.trainer:143 >> Saving prediction results to saves/qwen3-4b/lora/predict/generated_predictions.jsonl

结果保存至saves/qwen3-4b/lora/predict

加速

FlashAttention 能够加快注意力机制的运算速度，同时减少对内存的使用。

首先下载安装包：https://github.com/Dao-AILab/flash-attention/releases/

例如python3.8 torch2.3 cuda12需下载flash_attn-x.x.x+cu122torch2.3cxx11abiFALSE-cp38-cp38-linux_x86_64.whl

然后安装whl包

pip install flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp38-cp38-linux_x86_64.whl

在启动训练时在训练配置文件中添加以下参数：

flash_attn: fa2

其他参数

数据参数

参数名称	类型	介绍	默认值
template	Optional[str]	训练和推理时构造 prompt 的模板。	None
dataset	Optional[str]	用于训练的数据集名称。使用逗号分隔多个数据集。	None
eval_dataset	Optional[str]	用于评估的数据集名称。使用逗号分隔多个数据集。	None
eval_on_each_dataset	Optional[bool]	是否在每个评估数据集上分开计算loss，默认concate后为整体计算。	False
dataset_dir	str	存储数据集的文件夹路径。	“data”
media_dir	Optional[str]	存储图像、视频或音频的文件夹路径。如果未指定，默认为 dataset_dir。	None
data_shared_file_system	Optional[bool]	多机多卡时，不同机器存放数据集的路径是否是共享文件系统。数据集处理在该值为true时只在第一个node发生，为false时在每个node都处理一次。	false
cutoff_len	int	输入的最大 token 数，超过该长度会被截断。	2048
train_on_prompt	bool	是否在输入 prompt 上进行训练。	False
mask_history	bool	是否仅使用当前对话轮次进行训练。	False
streaming	bool	是否启用数据流模式。	False
buffer_size	int	启用 streaming 时用于随机选择样本的 buffer 大小。	16384
mix_strategy	Literal[“concat”, “interleave_under”, “interleave_over”]	数据集混合策略，支持 concat、 interleave_under、 interleave_over。	concat
interleave_probs	Optional[str]	使用 interleave 策略时，指定从多个数据集中采样的概率。多个数据集的概率用逗号分隔。	None
overwrite_cache	bool	是否覆盖缓存的训练和评估数据集。	False
preprocessing_batch_size	int	预处理时每批次的示例数量。	1000
preprocessing_num_workers	Optional[int]	预处理时使用的进程数量。	None
max_samples	Optional[int]	每个数据集的最大样本数：设置后，每个数据集的样本数将被截断至指定的 max_samples。	None
eval_num_beams	Optional[int]	模型评估时的 num_beams 参数。	None
ignore_pad_token_for_loss	bool	计算 loss 时是否忽略 pad token。	True
val_size	float	验证集相对所使用的训练数据集的大小。取值在 [0,1) 之间。启用 streaming 时 val_size 应是整数。	0.0
packing	Optional[bool]	是否启用 sequences packing。预训练时默认启用。	None
neat_packing	bool	是否启用不使用 cross-attention 的 sequences packing。	False
tool_format	Optional[str]	用于构造函数调用示例的格式。	None
tokenized_path	Optional[str]	Tokenized datasets的保存或加载路径。如果路径存在，会加载已有的 tokenized datasets；如果路径不存在，则会在分词后将 tokenized datasets 保存在此路径中。	None

模型参数

参数名称	类型	介绍	默认值
model_name_or_path	Optional[str]	模型路径（本地路径或 Huggingface/ModelScope 路径）。	None
adapter_name_or_path	Optional[str]	适配器路径（本地路径或 Huggingface/ModelScope 路径）。使用逗号分隔多个适配器路径。	None
adapter_folder	Optional[str]	包含适配器权重的文件夹路径。	None
cache_dir	Optional[str]	保存从 Hugging Face 或 ModelScope 下载的模型的本地路径。	None
use_fast_tokenizer	bool	是否使用 fast_tokenizer 。	True
resize_vocab	bool	是否调整词表和嵌入层的大小。	False
split_special_tokens	bool	是否在分词时将 special token 分割。	False
new_special_tokens	Optional[str]	要添加到 tokenizer 中的 special token。多个 special token 用逗号分隔。	None
model_revision	str	所使用的特定模型版本。	main
low_cpu_mem_usage	bool	是否使用节省内存的模型加载方式。	True
rope_scaling	Optional[Literal[“linear”, “dynamic”, “yarn”, “llama3”]]	RoPE Embedding 的缩放策略，支持 linear、dynamic、yarn 或 llama3。	None
flash_attn	Literal[“auto”, “disabled”, “sdpa”, “fa2”]	是否启用 FlashAttention 来加速训练和推理。可选值为 auto, disabled, sdpa, fa2。	auto
shift_attn	bool	是否启用 Shift Short Attention (S^2-Attn)。	False
mixture_of_depths	Optional[Literal[“convert”, “load”]]	需要将模型转换为 mixture_of_depths（MoD）模型时指定： convert 需要加载 mixture_of_depths（MoD）模型时指定： load。	None
use_unsloth	bool	是否使用 unsloth 优化 LoRA 微调。	False
use_unsloth_gc	bool	是否使用 unsloth 的梯度检查点。	False
enable_liger_kernel	bool	是否启用 liger 内核以加速训练。	False
moe_aux_loss_coef	Optional[float]	MoE 架构中 aux_loss 系数。数值越大，各个专家负载越均衡。	None
disable_gradient_checkpointing	bool	是否禁用梯度检查点。	False
use_reentrant_gc	bool	是否启用可重入梯度检查点	True
upcast_layernorm	bool	是否将 layernorm 层权重精度提高至 fp32。	False
upcast_lmhead_output	bool	是否将 lm_head 输出精度提高至 fp32。	False
train_from_scratch	bool	是否随机初始化模型权重。	False
infer_backend	Literal[“huggingface”, “vllm”]	推理时使用的后端引擎，支持 huggingface 或 vllm。	huggingface
offload_folder	str	卸载模型权重的路径。	offload
use_cache	bool	是否在生成时使用 KV 缓存。	True
infer_dtype	Literal[“auto”, “float16”, “bfloat16”, “float32”]	推理时使用的模型权重和激活值的数据类型。支持 auto, float16, bfloat16, float32。	auto
hf_hub_token	Optional[str]	用于登录 HuggingFace 的验证 token。	None
ms_hub_token	Optional[str]	用于登录 ModelScope Hub 的验证 token。	None
om_hub_token	Optional[str]	用于登录 Modelers Hub 的验证 token。	None
print_param_status	bool	是否打印模型参数的状态。	False
trust_remote_code	bool	是否信任来自 Hub 上数据集/模型的代码执行。	False
compute_dtype	Optional[torch.dtype]	用于计算模型输出的数据类型，无需手动指定。	None
device_map	Optional[Union[str, Dict[str, Any]]]	模型分配的设备映射，无需手动指定。	None
model_max_length	Optional[int]	模型的最大输入长度，无需手动指定。	None
block_diag_attn	bool	是否使用块对角注意力，无需手动指定。	False