今天,阿里云通义团队正式发布了全新的数学推理过程奖励模型 Qwen2.5-Math-PRM。该模型提供了72B 和7B 两种尺寸,性能表现均显著优于同类的开源过程奖励模型,尤其是在识别推理错误方面表现突出。
Qwen2.5-Math-PRM 的7B 版本令人惊讶地超越了业界广受欢迎的 GPT-4o,这一成就标志着阿里云在推理模型的研发上迈出了重要的一步。为了全面评估模型在数学推理中的表现,通义团队还开源了首个步骤级的评估标准 ——ProcessBench。这个评估标准涵盖了3400个数学问题测试案例,其中还包括国际奥林匹克数学竞赛的难度题目,每个案例均由人类专家标注了详细的推理过程,确保评估的科学性和全面性。
通过对 Qwen2.5-Math-PRM 在 ProcessBench 上的表现评估,研究团队发现,不论是72B 还是7B 尺寸的模型,均表现出色。特别是7B 版本,不仅超越了同尺寸的开源模型,甚至在某些方面还超过了闭源的 GPT-4o-0806。这证明了过程奖励模型(PRM)在提高推理可靠性方面的巨大潜力,并为未来推理过程监督技术的发展提供了新的思路。
阿里云通义团队的这项创新性工作,不仅推动了人工智能推理技术的进步,也为行业内其他开发者提供了宝贵的参考。通过开源的方式,通义团队希望能够与更多研究者共享经验,推动整个行业的技术进步。
快速上手
这是必须的,因为 transformers
自 4.37.0
起就集成了 Qwen2.5 代码。
import torch
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F
def make_step_rewards(logits, token_masks):
probabilities = F.softmax(logits, dim=-1)
probabilities = probabilities * token_masks.unsqueeze(-1) # bs, seq_len, num_labels
all_scores_res = []
for i in range(probabilities.size(0)):
sample = probabilities[i] # seq_len, num_labels
positive_probs = sample[sample != 0].view(-1, 2)[:, 1] # valid_tokens, num_labels
non_zero_elements_list = positive_probs.cpu().tolist()
all_scores_res.append(non_zero_elements_list)
return all_scores_res
model_name = "Qwen/Qwen2.5-Math-PRM-7B"
device = "auto"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
device_map=device,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).eval()
data = {
"system": "Please reason step by step, and put your final answer within \boxed{}.",
"query": "Sue lives in a fun neighborhood. One weekend, the neighbors decided to play a prank on Sue. On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard. On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard. Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?",
"response": [
"To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.",
"On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, (1/3 \times 18 = 6) flamingos are taken back. So, they have (18 - 6 = 12) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has (12 + 6 = 18) pink flamingos and 6 white flamingos.",
"On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has (18 + 18 = 36) pink flamingos and still 6 white flamingos.",
"To find the difference, subtract the number of white flamingos from the number of pink flamingos: (36 - 6 = 30). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is (\boxed{30})."
]
}
messages = [
{"role": "system", "content": data['system']},
{"role": "user", "content": data['query']},
{"role": "assistant", "content": "<extra_0>".join(data['response']) + "<extra_0>"},
]
conversation_str = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False
)
input_ids = tokenizer.encode(
conversation_str,
return_tensors="pt",
).to(model.device)
outputs = model(input_ids=input_ids)
step_sep_id = tokenizer.encode("<extra_0>")[0]
token_masks = (input_ids == step_sep_id)
step_reward = make_step_rewards(outputs[0], token_masks)
print(step_reward) # [[1.0, 0.1904296875, 0.9765625, 1.0]]