使用reward-model-deberta-v3-large-v2奖励模型对QA对打分筛选高质量QA,再使用bert计算句子相似度使用kCenterGreedy选取多样性指令,再进行微调,微调后对LLM结果进行筛选低分,再补充表现不佳的数据进行SFT。
ps:看任务,我自己用reward-model-deberta-v3-large-v2测的分数感觉不行,做不了筛选。
magpie
由于LLM自回归训练所以能够自动生成用户输入,添加左侧模版,输入LLM自动生成指令,再将生成的指令输入LLM(本地/调用api)获得输出。使用guard_model_path=“meta-llama/Meta-Llama-Guard-2-8B”,reward_model_path=“sfairXC/FsfairX-LLaMA3-RM-v0.1"等调用api对生成数据进行安全性、奖励、质量、难度等打分,再筛选符合要求的数据。使用SentenceTransformer(all-mpnet-base-v2)构建Faiss索引,然后对每个文本搜索最近的k个邻居,计算余弦距离来判断相似度进行去重。
使用model = ArmoRMPipeline(“RLHFlow/ArmoRM-Llama3-8B-v0.1”, trust_remote_code=True, device_map=f"cuda:{args.device}”)对指令的回复进行打分。
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
多轮对话
"mt_append_template": "<|start_header_id|>user<|end_header_id|>\n\n",
mt_system_prompt = "You are a helpful Al assistant. The user will engage in a multi-round conversation with you, asking initial questions and following up with additional related questions. Your goal is to provide thorough,relevant and insightful responses to help the user with their queries."
print(f"Generating responses for turn {turn}...")
prompts = []
for item in batch:
if not args.tokenizer_template:
conv = get_conversation_template(MODEL_NAME)
if turn == 2:
conv.append_message(conv.roles[0], item[f'instruction'])
conv.append_message(conv.roles[1], item[f'response'])
else:
conv.append_message(conv.roles[0], item[f'instruction'])
conv.append_message(conv.roles[1], item[f'response'])
for i in range(2, turn):
conv.append_message(conv.roles[0], item[f'instruction_{i}'])
conv.append_message(conv.roles[1], item[f'response_{i}'])
conv.append_message(conv.roles[0], item[f'instruction_{turn}'])
conv.append_message(conv.roles[1], None)
template = conv.get_prompt()
else:
chat = []
if turn == 2:
chat.append({"role": "user", "content": item[f'instruction']})
chat.append({"role": "assistant", "content": item[f'response']})
else:
chat.append({"role": "user", "content": item[f'instruction']})
chat.append({"role": "assistant", "content": item[f'response']})
for i in range(2, turn):
chat.append({"role": "user", "content": item[f'instruction_{i}']})
chat.append({"role": "assistant", "content": item[f'response_{i}']})
chat.append({"role": "user", "content": item[f'instruction_{turn}']})
template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
prompts.append(template)
outputs = llm.generate(prompts, response_params)
for i, item in enumerate(batch):
item[f'response_{turn}'] = outputs[i].outputs[0].text.strip()
self-instruct
从175条种子任务(每条带有一个instruct和一条示例)使用LLM进行指令生成,根据是否是分类任务进行区分,对生成的instruct生成回复,再进行过滤筛选。