大家好,我是烤鸭:
本来没打算写这个,翻译真没啥写的。transformer刚出来就是为了干翻译的,再加上市面上这么多的翻译的成熟化的产品,而且一个个的都有免费额度(个人用户基本用不完)。不过最近腾讯开源了一个比较强的翻译模型,今天主要写一下开源的翻译模型吧。
模型介绍
选了其中5个模型写一下。
| 参数 | facebook/nllb-200-distilled-600M | facebook/m2m100_418M | Hunyuan-MT-7B | Helsinki-NLP/opus-mt-zh-en | ByteDance-Seed/Seed-X-PPO-7B-AWQ-Int4 |
|---|---|---|---|---|---|
| 参数规模 | 600M | 418M | 7B | - | 7B |
| 架构 | Transformer | Transformer (M2M100) | Transformer | Transformer | Mistral 架构 (稀疏注意力 + 门控前馈网络) |
| 语言支持 | 200 + 语言 | 100 + 语言 | 33 种语言 + 5 种民汉方言 (粤 / 藏 / 哈萨克等) | 中文→英文 | 28 种主要语言 |
| 训练数据 | 低资源语言优化 | 100 种语言 75 亿句对 | 大规模多语言平行语料 | 中英平行数据 | 多领域专业语料 (生物医药 / 金融) |
| 最大输入长度 | 512 tokens | - | - | - | - |
| 推理速度 | A100 上 > 1500 tokens/s (batch=32) | - | RTX 4090 上 182 tokens/s (FP8) | - | A100 上数百 tokens/s |
| 压缩支持 | Int8 量化 | - | FP8 量化 (+30% 速度) | - | 4-bit/8-bit 量化 |
| 特色优势 | 极低资源环境部署覆盖全球 98% 人口语言 | 首个不依赖英语的多对多翻译模型 | WMT2025 大赛30/31 语种第 1 名 | 轻量级中英翻译体积小 (552MB) | 专业领域翻译卓越生物医药 (92.7%)/ 金融 (98%) |
| 适用场景 | 研究 / 低资源语言单句翻译 | 多语言互译不依赖英语中转 | 通用翻译多语言 API 服务 | 中文→英文翻译 | 专业文档翻译长文本处理 |
| 局限性 | 不适合长文本性能不及大模型 | 翻译质量略低于大参数版本 | - | 仅支持单向翻译 | - |
| 下载量 | 高 (蒸馏版) | 极高 | 极高 (2025 年 9 月开源) | 高 | 高 (2025 年 7 月开源) |
| 许可证 | Apache-2.0 | CC-BY-NC 4.0 | - | CC-BY-4.0 | - |
测试环境
- Python 3.11 or greater
- CUDA 12.8
- windows 4070S 12G
- linux 4090 24G
实际体验
我们的使用场景,是把一大段ASR文件整个丢给翻译模型,希望翻译的时候能根据上下文翻译。
其实我们想让模型可以有文本整合+翻译的能力,而不是单句翻译。之前我们也想过,先合并ASR文本,但是由于是带有时间戳,合并文本之后,很难再拆分了。
输入的测试文本:
[{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":1.36,"id":1331,"reTranslateText":"","seq":0,"sourceAudioSegment":"","start":0.0,"text":"你别再说你是车评人了","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":2.76,"id":1332,"reTranslateText":"","seq":1,"sourceAudioSegment":"","start":1.36,"text":"连我想要的车你都找不到","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":4.02,"id":1333,"reTranslateText":"","seq":2,"sourceAudioSegment":"","start":2.76,"text":"不是 你想要什么车呀","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":5.84,"id":1334,"reTranslateText":"","seq":3,"sourceAudioSegment":"","start":4.02,"text":"必须得好看 高级","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":6.96,"id":1335,"reTranslateText":"","seq":4,"sourceAudioSegment":"","start":5.84,"text":"还不能千篇一律","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":8.94,"id":1336,"reTranslateText":"","seq":5,"sourceAudioSegment":"","start":6.96,"text":"最主要的是还得智能好开","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":9.7,"id":1337,"reTranslateText":"","seq":6,"sourceAudioSegment":"","start":8.94,"text":"这简单呀","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":10.46,"id":1338,"reTranslateText":"","seq":7,"sourceAudioSegment":"","start":9.7,"text":"我要是找到了","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":12.12,"id":1339,"reTranslateText":"","seq":8,"sourceAudioSegment":"","start":10.46,"text":"咱们给兄弟们来做人车大片怎么样","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":13.98,"id":1340,"reTranslateText":"","seq":9,"sourceAudioSegment":"","start":12.12,"text":"行 那要是没找到呢","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":14.88,"id":1341,"reTranslateText":"","seq":10,"sourceAudioSegment":"","start":13.98,"text":"没找到","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":15.86,"id":1342,"reTranslateText":"","seq":11,"sourceAudioSegment":"","start":15.20,"text":"可以 来吧","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":18.1,"id":1343,"reTranslateText":"","seq":12,"sourceAudioSegment":"","start":17.06,"text":"就是这辆车了","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":19.66,"id":1344,"reTranslateText":"","seq":13,"sourceAudioSegment":"","start":18.1,"text":"2025款 东风一派","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":20.78,"id":1345,"reTranslateText":"","seq":14,"sourceAudioSegment":"","start":19.66,"text":"一菜 007","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":22.82,"id":1346,"reTranslateText":"","seq":15,"sourceAudioSegment":"","start":20.78,"text":"你刚才跟我说你想要好看的车是吧","translateText":"","updateTime":"2025-11-13T18:03:17"}]
nllb-200-distilled-600M
modelscope地址:https://modelscope.cn/models/facebook/nllb-200-distilled-600M/summary
hugging-face地址:https://huggingface.co/facebook/nllb-200-distilled-600M
抱脸上有很多空间可以测试,我随便贴一个。
测试地址:https://huggingface.co/spaces/TiberiuCristianLeon/StreamlitTranslate
https://huggingface.co/spaces/sepioo/facebook-translation
用在线的测试地址试了一下,各种问题,还是本地部署跑一下看看。
安装依赖
pip install transformers torch sentencepiece
#版本
#transformers==4.57.1
#torch==2.9.1
#sentencepiece==0.2.1
调用demo:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# 加载模型和 tokenizer
model_name = "D:\\models\\modelscope\\facebook\\nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# 翻译文本示例:中文 -> 英文
src_text = "你好,欢迎使用NLLB模型!"
inputs = tokenizer(src_text, return_tensors="pt")
# 指定目标语言代码
target_lang_code = "eng_Latn"
# 使用 convert_tokens_to_ids 方法来获取语言ID
# 这是最直接和最稳定的方法
try:
forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang_code)
except KeyError:
# 如果转换失败,说明该语言代码不在分词器的词汇表中
raise ValueError(f"目标语言代码 '{target_lang_code}' 不在分词器的词汇表中。")
# 检查ID是否有效 (不为空或特殊的未知ID)
if forced_bos_token_id == tokenizer.unk_token_id:
print(f"警告:语言代码 '{target_lang_code}' 被识别为未知token (ID: {forced_bos_token_id})。")
# 进行翻译
translated_tokens = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id)
# 解码输出
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print("翻译结果:", translated_text)
单句话调用是没有问题的,按照上面json传参的话,就没有结果返回了,在线测试和本地测试都一样。


facebook/m2m100_418M
modelscope地址:https://modelscope.cn/models/facebook/m2m100_418M
hugging-face地址:https://huggingface.co/facebook/m2m100_418M
抱脸上有很多空间可以测试,我随便贴一个。
测试地址:https://huggingface.co/spaces/fereen5/infinityai-tools
用在线的测试地址试了一下,崩溃。

Hunyuan-MT-7B
modelscope地址:https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-MT-7B/summary
hugging-face地址:https://huggingface.co/tencent/Hunyuan-MT-7B
抱脸上有很多空间可以测试,我随便贴一个。
测试地址:https://huggingface.co/spaces/TiberiuCristianLeon/StreamlitTranslate
在线地址有长度限制,我少翻译一点试试。

翻译的语句没啥问题,但是有一个格式问题,就是没有保持原始的json结构,开始我以为是开源版本的问题,就又去腾讯云控制台的混元模型调用了下。这时候就要表扬一下腾讯混元团队了,开源和收费有一样的bug,不知道是该表扬开源没有水分,还是该说模型有优化空间呢。

python代码调用:
from modelscope import AutoModelForCausalLM, AutoTokenizer
import os
model_name_or_path = "D:\\models\\modelscope\\Tencent-Hunyuan\\Hunyuan-MT-7B-fp8"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto") # You may want to use bfloat16 and/or move to GPU here
messages = [
{"role": "user", "content": "将以下的中文翻译成英文,仅翻译中文,保留其他原始输出, 不需要额外解释."
"\n\n"
"[{\"cloneAudioSegment\":\"\",\"createTime\":\"2025-11-13T18:03:17\",\"dataId\":42,\"end\":1.36,\"id\":1331,\"reTranslateText\":\"\",\"seq\":0,\"sourceAudioSegment\":\"\",\"start\":0,\"text\":\"你别再说你是车评人了\",\"translateText\":\"\",\"updateTime\":\"2025-11-13T18:03:17\"}]"
},
]
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=False,
return_tensors="pt"
)
outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=2048)
output_text = tokenizer.decode(outputs[0])
opus-mt-zh-en
modelscope地址:https://modelscope.cn/models/Helsinki-NLP/opus-mt-zh-en
hugging-face地址:https://huggingface.co/Helsinki-NLP/opus-mt-zh-en
抱脸上有很多空间可以测试,我随便贴一个。
测试地址:https://huggingface.co/spaces/Helsinki-NLP/opus-translate
23年的模型,算这几个里边年龄最大的了。跟混元一样的问题,丢符号。

调用代码
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
# 翻译示例
text = "自然语言处理是人工智能的一个重要分支。"
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=512)
translated_text = tokenizer.decode(outputs, skip_special_tokens=True)
print(translated_text)
ByteDance-Seed/Seed-X-PPO-7B-AWQ-Int4
modelscope地址:https://modelscope.cn/models/ByteDance-Seed/Seed-X-PPO-7B-AWQ-Int4/summary
hugging-face地址:https://huggingface.co/ByteDance-Seed/Seed-X-PPO-7B-AWQ-Int4
量化版本的在线地址没有,有的是7B版本的。
测试地址:https://huggingface.co/spaces/ByteDance-Seed/Seed-X

目前为止表现最好的模型了,翻译没问题,格式也保留了下来。支持28种语言,7B版本15G内存空间也还好,本机实际没跑过,可以试试量化版本8G。
https://www.modelscope.cn/models/ByteDance-Seed/Seed-X-PPO-7B-GPTQ-Int8
调用demo
from vllm import LLM, SamplingParams, BeamSearchParams
model_path = "ByteDance-Seed/Seed-X-PPO-7B"
model = LLM(model=model_path,
max_num_seqs=512,
tensor_parallel_size=8,
enable_prefix_caching=True,
gpu_memory_utilization=0.95)
messages = [
# without CoT
"Translate the following English sentence into Chinese:\nMay the force be with you <zh>",
# with CoT
"Translate the following English sentence into Chinese and explain it in detail:\nMay the force be with you <zh>"
]
# Beam search (We recommend using beam search decoding)
decoding_params = BeamSearchParams(beam_width=4,
max_tokens=512)
# Greedy decoding
decoding_params = SamplingParams(temperature=0,
max_tokens=512,
skip_special_tokens=True)
results = model.generate(messages, decoding_params)
responses = [res.outputs[0].text.strip() for res in results]
print(responses)
总结
在我们比较复杂的场景下,语义理解+文本整合+文本翻译的基础上,几个老外的模型都不太行,可能是专门做翻译的,没有这种场景。
不过facebook的模型主打小巧和快速,如果只做翻译场景,应该是够用的。
在我们使用的场景下,个人还是比较推荐字节的,虽然腾讯混元的晚了两个月推出,但也不是越新出的能力就越强。(虽然我们最后的解决方案是用的 ByteDance-Seed 首翻+ Hunyuan-MT修正+deepseek 格式修复)
Seed-X-PPO-7B 支持28种语言, Hunyuan-MT-7B 支持31种语言,相差也不大。
而且两个模型的fp8的量化版本差不多都是8G,普通个人电脑就可以跑。
都说国内外AI差距,我感觉在翻译这个场景下,开源的国产模型应该是完胜了。
另外还有不少收费 百度飞桨、腾讯混元、有道翻译 (机翻)对于一般场景也都够用,个人用户每个月还有百万的免费额度。主要还是看场景。
文章参考
https://blog.youkuaiyun.com/qq_42746084/article/details/154947534
5909

被折叠的 条评论
为什么被折叠?



