【实战教程】结合自定义损失函数采用LoRA微调Llama3-8B

准备工作

  • 准备特定场景微调数据1000左右的量级作为训练集,300左右的数据作为测试集(最好是英文版),数据集的格式示例如下:
// 判别式分类任务
[
	{
   
		"instrcution": "# Goal\nYou are a senior python programmer, please check is there any risk of endless loop in the code snippet.\n#code\n```python\nint n = 0\nwhile n < 100:\ns = get_token();\n```",
		"input": "",
		"output": "{\"label\":1}"
	},	
	//...
]
//文本生成式任务
[
	{
   
		"instrcution": "# Goal\nYou are a senior python programmer, please write a binary search algorithm.",
		"input": "",
		"output": "def binary_search(arr, target):\nleft = 0\nright = len(arr) - 1\nwhile left <= right:\nmid = left + (right - left) // 2\nif arr[mid] == target:\n.   return mid\nelif arr[mid] > target:\nright = mid - 1\nelse:\nleft = mid + 1\nreturn -1"
	},	
	//...
]
# unsloth是一个比较轻量级的框架,适合开发者从0-1自定义微调路径
# 本实战教程是基于unsloth框架的
https://github.com/unslothai/unsloth
# 里面有很多微调教程、量化的大模型、数据集可参考使用

# 其他主流微调框架,功能全面,可视化完善
# https://github.com/huggingface/trl
# https://github.com/OpenAccess-AI-Collective/axolotl
# https://github.com/hiyouga/LLaMA-Factory
# https://github.com/modelscope/swift
# ...还有很多其他框架,百花齐放
  • python环境准备:
    查看系统GPU配置信息:
import torch
major_version, minor_version = torch.cuda.get_device_capability()
print(major_version)

根据配置信息安装指定的包

# major_version>=8时,GPU环境一般为Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)等,需要根据以下语句安装环境
pip3 install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
# major_version<8时,GPU环境一般为(V100, Tesla T4, RTX 20xx)等,需要根据以下语句安装环境
pip3 install --no-deps xformers trl peft accelerate bitsandbytes

导包如果报错的话,大概率是包版本的冲突,如果重装一些包还是存在冲突,实在解决不了,就新建一个conda环境,按提示一个一个装

LoRA微调

  • step1:样本转换,json转换至jsonl的脚本cover_alpaca2jsonl.py
import argparse
import json
from tqdm import tqdm


def format_example(example: dict) -> dict:
    context = f"Instruction: {
     example['instruction']}\n"
    if example.get("input"):
        context += f"Input: {
     example['input']}\n"
    context += "Answer: "
    target = example["output"]
    return {
   "context": context, "target": target}


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_path", type=str, default="data/train_data/samples_train_llama.json")
    parser.add_argument("--save_path", type=str, default="data/train_data/samples_train_llama.jsonl")

    args = parser.parse_args()
    with open(args.data_path) as f:
        examples = json.load(f)

    with open(args.save_path, 'w') as f:
        for example in tqdm(examples, desc="formatting.."):
            f.write(json.dumps(format_example(example)) + '\n')


if __name__ == "__main__":
    main()
  • step2:生产tokenize文件夹,为后续自定义损失函数做数据准备,脚本tokenize_dataset_rows.py
import argparse
import json
from tqdm import tqdm

import datasets
import transformers
import os


def preprocess(tokenizer, config, example, max_seq_length, version):
    if version == 'v1':
        prompt = example["context"]
        target = example["target"]
        prompt_ids = tokenizer.encode(prompt, max_length=max_seq_length, truncation=True)
        target_ids = tokenizer.encode(
            target,
            max_length=max_seq_length,
            truncation=True,
            add_special_tokens=False)
        input_ids = prompt_ids + target_ids + [config.eos_token_id]
        return {
   "input_ids": input_ids, "seq_len": len(prompt_ids)}

    if version == 'v2':
        query = example["context"]
        target = example["target"]
        history = None
        prompt = tokenizer.build_prompt(query, history)

        a_ids = tokenizer.encode(text=prompt, add_special_tokens=True, truncation=True,
                                 max_length=max_seq_length)
        b_ids = tokenizer.encode(text=target, add_special_tokens=False, truncation=True,
                                 max_length=max_seq_length)

        input_ids = a_ids + b_ids + [tokenizer.eos_token_id]

        return {
   "input_ids": input_ids, "seq_len": len(a_ids)}


def read_jsonl(path, max_seq_length, base_model_path, version='v1', skip_overlength=False):

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        base_model_path, trust_remote_code=True)
    config = transformers.AutoConfig.from_pretrained(
        base_model_path, trust_remote_code=True, device_map='auto')
    with open(path, "r") as f:
        for line in tqdm(f.readlines()):
            example = json.loads(line)
            # feature = preprocess(tokenizer, config, example, max_seq_length)
            feature = preprocess(tokenizer, config, example, max_seq_length, version)
            if skip_overlength and 
### 使用LoRA技术微调LLaMA大语言模型 #### 一、环境准备 为了成功应用LoRA技术对LLaMA进行微调,需确保安装必要的库并配置好运行环境。这通常涉及PyTorch及其扩展包`transformers`和`peft`(用于实现LoRA)。对于硬件资源有限的情况,调整参数如`micro_batch_size=1`可以有效降低内存占用,使得像A800这样的单张显卡也能处理大规模模型[^2]。 ```bash pip install torch transformers peft accelerate bitsandbytes ``` #### 二、加载预训练模型 通过Hugging Face的Transformers库轻松获取已有的LLaMA模型权重文件,并将其加载到指定设备上: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name_or_path = "your_model_path" tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) model = AutoModelForCausalLM.from_pretrained( model_name_or_path, load_in_8bit=True, # 如果使用int8量化则开启此选项 device_map="auto", # 自动分配至可用GPU ) ``` #### 三、初始化LoRA适配器 利用PEFT库中的功能来创建一个基于现有模型架构的新层——即所谓的低秩适应(LoRA),它允许只修改少量新增加的参数而不是整个网络结构: ```python from peft import LoraConfig, get_peft_model lora_config = LoraConfig( r=16, # 秩大小,默认为16 lora_alpha=32, # LoRA缩放因子 target_modules=["q_proj", "v_proj"], # 需要添加LoRA模块的位置 lora_dropout=0.05, # 应用在新加入矩阵上的dropout概率 ) model = get_peft_model(model, lora_config) print(f"Trainable parameters after applying LoRA: {sum(p.numel() for p in model.parameters() if not p.requires_grad)}") ``` #### 四、数据集准备与迭代优化过程 定义适合目标任务的数据读取逻辑以及相应的损失函数计算方式;接着就可以按照常规流程执行训练循环,在每轮结束后保存最佳版本的checkpoint以便后续部署或进一步精炼. ```python import datasets from torch.utils.data.dataloader import DataLoader from transformers import Trainer, TrainingArguments dataset = datasets.load_dataset(&#39;path_to_your_data&#39;) train_dataloader = DataLoader(dataset[&#39;train&#39;], batch_size=4) training_args = TrainingArguments(output_dir="./results", per_device_train_batch_size=4, num_train_epochs=3, logging_steps=10,) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataloader.dataset, ) trainer.train() ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值