AI 翻译开源模型:轻量可本地化部署

Seed-Coder-8B-Base

Seed-Coder-8B-Base

文本生成
Seed-Coder

Seed-Coder是一个功能强大、透明、参数高效的 8B 级开源代码模型系列,包括基础变体、指导变体和推理变体,由字节团队开源

大家好,我是烤鸭:

 本来没打算写这个,翻译真没啥写的。transformer刚出来就是为了干翻译的,再加上市面上这么多的翻译的成熟化的产品,而且一个个的都有免费额度(个人用户基本用不完)。不过最近腾讯开源了一个比较强的翻译模型,今天主要写一下开源的翻译模型吧。

模型介绍

选了其中5个模型写一下。

参数facebook/nllb-200-distilled-600Mfacebook/m2m100_418MHunyuan-MT-7BHelsinki-NLP/opus-mt-zh-enByteDance-Seed/Seed-X-PPO-7B-AWQ-Int4
参数规模600M418M7B-7B
架构TransformerTransformer (M2M100)TransformerTransformerMistral 架构 (稀疏注意力 + 门控前馈网络)
语言支持200 + 语言100 + 语言33 种语言 + 5 种民汉方言 (粤 / 藏 / 哈萨克等)中文→英文28 种主要语言
训练数据低资源语言优化100 种语言 75 亿句对大规模多语言平行语料中英平行数据多领域专业语料 (生物医药 / 金融)
最大输入长度512 tokens----
推理速度A100 上 > 1500 tokens/s (batch=32)-RTX 4090 上 182 tokens/s (FP8)-A100 上数百 tokens/s
压缩支持Int8 量化-FP8 量化 (+30% 速度)-4-bit/8-bit 量化
特色优势极低资源环境部署覆盖全球 98% 人口语言首个不依赖英语的多对多翻译模型WMT2025 大赛30/31 语种第 1 名轻量级中英翻译体积小 (552MB)专业领域翻译卓越生物医药 (92.7%)/ 金融 (98%)
适用场景研究 / 低资源语言单句翻译多语言互译不依赖英语中转通用翻译多语言 API 服务中文→英文翻译专业文档翻译长文本处理
局限性不适合长文本性能不及大模型翻译质量略低于大参数版本-仅支持单向翻译-
下载量高 (蒸馏版)极高极高 (2025 年 9 月开源)高 (2025 年 7 月开源)
许可证Apache-2.0CC-BY-NC 4.0-CC-BY-4.0-

测试环境

  • Python 3.11 or greater
  • CUDA 12.8
  • windows 4070S 12G
  • linux 4090 24G

实际体验

我们的使用场景,是把一大段ASR文件整个丢给翻译模型,希望翻译的时候能根据上下文翻译。

其实我们想让模型可以有文本整合+翻译的能力,而不是单句翻译。之前我们也想过,先合并ASR文本,但是由于是带有时间戳,合并文本之后,很难再拆分了。

输入的测试文本:

[{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":1.36,"id":1331,"reTranslateText":"","seq":0,"sourceAudioSegment":"","start":0.0,"text":"你别再说你是车评人了","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":2.76,"id":1332,"reTranslateText":"","seq":1,"sourceAudioSegment":"","start":1.36,"text":"连我想要的车你都找不到","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":4.02,"id":1333,"reTranslateText":"","seq":2,"sourceAudioSegment":"","start":2.76,"text":"不是 你想要什么车呀","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":5.84,"id":1334,"reTranslateText":"","seq":3,"sourceAudioSegment":"","start":4.02,"text":"必须得好看 高级","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":6.96,"id":1335,"reTranslateText":"","seq":4,"sourceAudioSegment":"","start":5.84,"text":"还不能千篇一律","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":8.94,"id":1336,"reTranslateText":"","seq":5,"sourceAudioSegment":"","start":6.96,"text":"最主要的是还得智能好开","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":9.7,"id":1337,"reTranslateText":"","seq":6,"sourceAudioSegment":"","start":8.94,"text":"这简单呀","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":10.46,"id":1338,"reTranslateText":"","seq":7,"sourceAudioSegment":"","start":9.7,"text":"我要是找到了","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":12.12,"id":1339,"reTranslateText":"","seq":8,"sourceAudioSegment":"","start":10.46,"text":"咱们给兄弟们来做人车大片怎么样","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":13.98,"id":1340,"reTranslateText":"","seq":9,"sourceAudioSegment":"","start":12.12,"text":"行 那要是没找到呢","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":14.88,"id":1341,"reTranslateText":"","seq":10,"sourceAudioSegment":"","start":13.98,"text":"没找到","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":15.86,"id":1342,"reTranslateText":"","seq":11,"sourceAudioSegment":"","start":15.20,"text":"可以 来吧","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":18.1,"id":1343,"reTranslateText":"","seq":12,"sourceAudioSegment":"","start":17.06,"text":"就是这辆车了","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":19.66,"id":1344,"reTranslateText":"","seq":13,"sourceAudioSegment":"","start":18.1,"text":"2025款 东风一派","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":20.78,"id":1345,"reTranslateText":"","seq":14,"sourceAudioSegment":"","start":19.66,"text":"一菜 007","translateText":"","updateTime":"2025-11-13T18:03:17"},{"cloneAudioSegment":"","createTime":"2025-11-13T18:03:17","dataId":42,"end":22.82,"id":1346,"reTranslateText":"","seq":15,"sourceAudioSegment":"","start":20.78,"text":"你刚才跟我说你想要好看的车是吧","translateText":"","updateTime":"2025-11-13T18:03:17"}]

nllb-200-distilled-600M

modelscope地址:https://modelscope.cn/models/facebook/nllb-200-distilled-600M/summary

hugging-face地址:https://huggingface.co/facebook/nllb-200-distilled-600M

抱脸上有很多空间可以测试,我随便贴一个。

测试地址:https://huggingface.co/spaces/TiberiuCristianLeon/StreamlitTranslate

https://huggingface.co/spaces/sepioo/facebook-translation

用在线的测试地址试了一下,各种问题,还是本地部署跑一下看看。

安装依赖

pip install transformers torch sentencepiece
#版本
#transformers==4.57.1
#torch==2.9.1
#sentencepiece==0.2.1

调用demo:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 加载模型和 tokenizer
model_name = "D:\\models\\modelscope\\facebook\\nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# 翻译文本示例:中文 -> 英文
src_text = "你好,欢迎使用NLLB模型!"
inputs = tokenizer(src_text, return_tensors="pt")

# 指定目标语言代码
target_lang_code = "eng_Latn"

# 使用 convert_tokens_to_ids 方法来获取语言ID
# 这是最直接和最稳定的方法
try:
    forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang_code)
except KeyError:
    # 如果转换失败,说明该语言代码不在分词器的词汇表中
    raise ValueError(f"目标语言代码 '{target_lang_code}' 不在分词器的词汇表中。")

# 检查ID是否有效 (不为空或特殊的未知ID)
if forced_bos_token_id == tokenizer.unk_token_id:
    print(f"警告:语言代码 '{target_lang_code}' 被识别为未知token (ID: {forced_bos_token_id})。")

# 进行翻译
translated_tokens = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id)

# 解码输出
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print("翻译结果:", translated_text)

单句话调用是没有问题的,按照上面json传参的话,就没有结果返回了,在线测试和本地测试都一样。
在这里插入图片描述

在这里插入图片描述

facebook/m2m100_418M

modelscope地址:https://modelscope.cn/models/facebook/m2m100_418M

hugging-face地址:https://huggingface.co/facebook/m2m100_418M

抱脸上有很多空间可以测试,我随便贴一个。

测试地址:https://huggingface.co/spaces/fereen5/infinityai-tools

用在线的测试地址试了一下,崩溃。

在这里插入图片描述

Hunyuan-MT-7B

modelscope地址:https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-MT-7B/summary

hugging-face地址:https://huggingface.co/tencent/Hunyuan-MT-7B

抱脸上有很多空间可以测试,我随便贴一个。

测试地址:https://huggingface.co/spaces/TiberiuCristianLeon/StreamlitTranslate

在线地址有长度限制,我少翻译一点试试。

在这里插入图片描述

翻译的语句没啥问题,但是有一个格式问题,就是没有保持原始的json结构,开始我以为是开源版本的问题,就又去腾讯云控制台的混元模型调用了下。这时候就要表扬一下腾讯混元团队了,开源和收费有一样的bug,不知道是该表扬开源没有水分,还是该说模型有优化空间呢。

在这里插入图片描述

python代码调用:

from modelscope import AutoModelForCausalLM, AutoTokenizer
import os

model_name_or_path = "D:\\models\\modelscope\\Tencent-Hunyuan\\Hunyuan-MT-7B-fp8"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto")  # You may want to use bfloat16 and/or move to GPU here
messages = [
    {"role": "user", "content": "将以下的中文翻译成英文,仅翻译中文,保留其他原始输出, 不需要额外解释."
    "\n\n"
    "[{\"cloneAudioSegment\":\"\",\"createTime\":\"2025-11-13T18:03:17\",\"dataId\":42,\"end\":1.36,\"id\":1331,\"reTranslateText\":\"\",\"seq\":0,\"sourceAudioSegment\":\"\",\"start\":0,\"text\":\"你别再说你是车评人了\",\"translateText\":\"\",\"updateTime\":\"2025-11-13T18:03:17\"}]"
    },
]
tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=False,
    return_tensors="pt"
)

outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=2048)
output_text = tokenizer.decode(outputs[0])

opus-mt-zh-en

modelscope地址:https://modelscope.cn/models/Helsinki-NLP/opus-mt-zh-en

hugging-face地址:https://huggingface.co/Helsinki-NLP/opus-mt-zh-en

抱脸上有很多空间可以测试,我随便贴一个。

测试地址:https://huggingface.co/spaces/Helsinki-NLP/opus-translate

23年的模型,算这几个里边年龄最大的了。跟混元一样的问题,丢符号。

在这里插入图片描述

调用代码

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-zh-en")

# 翻译示例
text = "自然语言处理是人工智能的一个重要分支。"
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=512)
translated_text = tokenizer.decode(outputs, skip_special_tokens=True)
print(translated_text)

ByteDance-Seed/Seed-X-PPO-7B-AWQ-Int4

modelscope地址:https://modelscope.cn/models/ByteDance-Seed/Seed-X-PPO-7B-AWQ-Int4/summary

hugging-face地址:https://huggingface.co/ByteDance-Seed/Seed-X-PPO-7B-AWQ-Int4

量化版本的在线地址没有,有的是7B版本的。

测试地址:https://huggingface.co/spaces/ByteDance-Seed/Seed-X
在这里插入图片描述

目前为止表现最好的模型了,翻译没问题,格式也保留了下来。支持28种语言,7B版本15G内存空间也还好,本机实际没跑过,可以试试量化版本8G。
https://www.modelscope.cn/models/ByteDance-Seed/Seed-X-PPO-7B-GPTQ-Int8

调用demo

from vllm import LLM, SamplingParams, BeamSearchParams

model_path = "ByteDance-Seed/Seed-X-PPO-7B"

model = LLM(model=model_path,
            max_num_seqs=512,
            tensor_parallel_size=8,
            enable_prefix_caching=True, 
            gpu_memory_utilization=0.95)

messages = [
    # without CoT
    "Translate the following English sentence into Chinese:\nMay the force be with you <zh>",
    # with CoT
    "Translate the following English sentence into Chinese and explain it in detail:\nMay the force be with you <zh>" 
]

# Beam search (We recommend using beam search decoding)
decoding_params = BeamSearchParams(beam_width=4,
                                   max_tokens=512)
# Greedy decoding
decoding_params = SamplingParams(temperature=0,
                                 max_tokens=512,
                                 skip_special_tokens=True)

results = model.generate(messages, decoding_params)
responses = [res.outputs[0].text.strip() for res in results]

print(responses)

总结

在我们比较复杂的场景下,语义理解+文本整合+文本翻译的基础上,几个老外的模型都不太行,可能是专门做翻译的,没有这种场景。

不过facebook的模型主打小巧和快速,如果只做翻译场景,应该是够用的。

在我们使用的场景下,个人还是比较推荐字节的,虽然腾讯混元的晚了两个月推出,但也不是越新出的能力就越强。(虽然我们最后的解决方案是用的 ByteDance-Seed 首翻+ Hunyuan-MT修正+deepseek 格式修复)

Seed-X-PPO-7B 支持28种语言, Hunyuan-MT-7B 支持31种语言,相差也不大。

而且两个模型的fp8的量化版本差不多都是8G,普通个人电脑就可以跑。

都说国内外AI差距,我感觉在翻译这个场景下,开源的国产模型应该是完胜了。

另外还有不少收费 百度飞桨、腾讯混元、有道翻译 (机翻)对于一般场景也都够用,个人用户每个月还有百万的免费额度。主要还是看场景。

文章参考

https://blog.youkuaiyun.com/qq_42746084/article/details/154947534

您可能感兴趣的与本文相关的镜像

Seed-Coder-8B-Base

Seed-Coder-8B-Base

文本生成
Seed-Coder

Seed-Coder是一个功能强大、透明、参数高效的 8B 级开源代码模型系列,包括基础变体、指导变体和推理变体,由字节团队开源

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

烤鸭的世界我们不懂

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值