【20211215】神经机器翻译

最新推荐文章于 2025-03-20 17:02:43 发布

Yang SiCheng

最新推荐文章于 2025-03-20 17:02:43 发布

阅读量1.4k

点赞数 1

分类专栏：小白学习文章标签：机器翻译自然语言处理深度学习

本文链接：https://blog.youkuaiyun.com/qq_41897800/article/details/121874902

版权

小白学习专栏收录该内容

25 篇文章

订阅专栏

本文通过英语-越南语和德语-英语的实际案例，详细介绍了神经机器翻译(NMT)的训练流程、遇到的问题及解决方案，并对比了不同模型的翻译效果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

原论文与代码Neural Machine Translation 地址，需要复现《Effective Approaches to Attention-based Neural Machine Translation》

1. 英语-越南语（133k）

原论文，越南语微软翻译，Iwslt15_en_vt，包含[‘vocab.en’, ‘train.vi’, ‘tst2013.en’, ‘vocab.vi’, ‘tst2012.vi’, ‘tst2013.vi’, ‘tst2012.en’, ‘train.en’]八个文件

直接读一下 'train.en’文件，发现很多：

'He &apos;s trebled his crop income'
'because he &apos;s now got toilets'
'I want to end by saying it &apos;s been the actions'

猜测应该是‘s的意思，这里进行以下整理：

lines_X = file_X.read().strip().replace('&amp; amp ; quot ;', '"')\
                .replace('&apos;', "'").replace('&quot;', '"')\
                .replace('&amp;', '&').replace('&#91;', '[')\
                .replace('&#93;', ']').replace('& amp ;', '&').split('\n')

来替换HTML的转移字符，并且跳过空行，处理成pairs的数据：

[...
['it s man made and can be overcome and eradicated by the actions of human beings . ', 'Nó là do con người và có thể ngăn chặn và diệt trừ bởi hành động của con người . '],
...]

但估计翻译的效果不会很好，因为就对于原本数据集的最后两个数据：

英语：
Didier Sornette : How we can predict the next financial crisis
The 2007-2008 financial crisis , you might think , was an unpredictable one-time crash . But Didier Sornette and his Financial Crisis Observatory have plotted a set of early warning signs for unstable , growing systems , tracking the moment when any bubble is about to pop .

越南语：
Paul Pholeros : Làm sao để bớt nghèo khổ ? Hãy sửa nhà		# Paul Pholeros: How to reduce poverty? Let's fix the house. 保罗·福勒罗斯：如何减少贫困？让我们修房子
Năm 1985 , kiến trúc sư Paul Pholeros được giao nhiệm vụ &quot; ngăn chặn người dân tiếp tục mắc bệnh &quot; từ người chủ trung tâm y tế là 1 người thổ dân trong cộng đồng người thổ dân ở Nam Úc . Nhận thức cốt lõi : thay vì dùng thuốc , hãy cải thiện môi trường sống địa phương . Trong diễn văn sáng ngời này , Pholeros mô tả các dự án mà Healthabitat - tổ chức mà ông đang quản lý để giúp giảm nghèo - thực hiện bằng những thay đổi trong thiết kế -- ở Úc và nước ngoài .		# 1985年，建筑师保罗·福勒罗斯被指派阻止南澳大利亚土著社区土著居民的卫生中心主任继续生病。核心意识：与其吃药，不如改善当地的栖息地。在这场明亮的演讲中，Pholeros描述了他为帮助澳大利亚和国外设计变革而实施的项目。

在官方网站的数据就是这样：train.en，train.vi，那只能继续做了

（1）法I

参考【PyTorch】6 法语英语翻译RNN实战——基于Attention的seq2seq模型、Attention可视化完成了一个小baseline：

'Câu chuyện này chưa kết thúc .'		# This story is not over.
this one is been . <EOS>
'Ông rút lui vào yên lặng .'		# He retreated into silence.
he was at at . . <EOS>
'Ông qua đời , bị lịch sử quật ngã .'		# He died, destroyed by history.
KeyError: 'quật'

增加unknown，能够修改MAX_LENGTH、n_iters数值，当MAX_LENGTH=300、n_iters=133000 × 10 时训练过程（NLLLoss）：

Reading lines...
Read 133168 sentence pairs
Trimmed to 133168 sentence pairs
运行时间 2m 27s (估计的剩余时间 3203m 24s) 当前iter=1000 完成进度0.077% Loss:5.5203
运行时间 3m 19s (估计的剩余时间 2158m 21s) 当前iter=2000 完成进度0.154% Loss:5.0535
...
运行时间 1262m 17s (估计的剩余时间 0m 58s) 当前iter=1299000 完成进度99.923% Loss:3.7528
运行时间 1263m 11s (估计的剩余时间 0m 0s) 当前iter=1300000 完成进度100.000% Loss:3.7462

请添加图片描述

对于以下五句话，此时翻译结果：

'Câu chuyện này chưa kết thúc .'		# This story is not over. (truth:This is not a finished story .)
this is the not . . <EOS>
'Ông rút lui vào yên lặng .'		# He retreated into silence.(truth:He retreated into silence .)
he was the he . <EOS>
'Ông qua đời , bị lịch sử quật ngã .'		# He died, destroyed by history.(truth:He died broken by history .)
he was through history , history . <EOS>
'Ông là ông của tôi .'		# You're my grandfather.(truth:He is my grandfather .)
he was my . <EOS>
'Tôi chưa bao giờ gặp ông ngoài đời .'		# I've never met you in real life.(truth:I never knew him in real life .)
i never never met him meet him . <EOS>

直观的展示一下评价指标，由于这个没有办法直观的计算交叉熵，就取min(单词索引、句子长度)个词计算交叉熵，将其平均作为一句话的交叉熵，最后Loss:5.3968(5.3950)，PPL:220.70095144082947(220.30508426513822)，PLEU:0.0，结果很差

尝试修改参数Max_len = 50，dropout = 0.2…效果也不好。

（2）法II

为了解决unknown的问题，参考【PyTorch】8 语言翻译Torchtext实战——英语和德语翻译、Attention模型、 Pytorch 1.8 安装
安装torctext：

 conda install torchtext

踩坑：

越南语的token问题，参考1和2，3，4，5
原来的torchtext.vocab.Vocab不能用了，现在的Vocab用法见此，最近的生成Vocab的方法可见官方最新的教程

batch=4时不能运行，batch=2时结果如下：

data process...                                                                         
10779 12849                                                                             
training...                                                                             
53319                                                                                   
evaluate...                                                                             
Epoch: 01 | Time: 79m 48s                                                               
        Train Loss: 0.735 | Train PPL:   2.084                                          
         Val. Loss: 0.741 |  Val. PPL:   2.098                                          
...
Epoch: 10 | Time: 78m 41s
        Train Loss: 0.620 | Train PPL:   1.858
         Val. Loss: 0.680 |  Val. PPL:   1.974
evaluate...
| Test Loss: 0.631 | Test PPL:   1.880 |

以上代码是弄错英语和德语token时候的结果，正确的迭代一次：

Epoch: 01 | Time: 107m 25s
        Train Loss: 5.011 | Train PPL: 150.087
         Val. Loss: 5.230 |  Val. PPL: 186.728
evaluate...
770
| Test Loss: 5.123 | Test PPL: 167.875 |		# model_Translate_4.pth

迭代十次，batch=3：

Epoch: 01 | Time: 71m 48s
        Train Loss: 5.107 | Train PPL: 165.135
         Val. Loss: 5.311 |  Val. PPL: 202.551
training...
35550
evaluate...
8880
Epoch: 02 | Time: 71m 10s
        Train Loss: 4.460 | Train PPL:  86.481
         Val. Loss: 5.114 |  Val. PPL: 166.268
training...
35550
evaluate...
8880
Epoch: 03 | Time: 69m 9s
        Train Loss: 4.209 | Train PPL:  67.264
         Val. Loss: 4.970 |  Val. PPL: 143.994
training...
35550
evaluate...
8880

Epoch: 04 | Time: 69m 58s
        Train Loss: 4.086 | Train PPL:  59.483
         Val. Loss: 4.926 |  Val. PPL: 137.836
training...
35550
evaluate...
8880
Epoch: 05 | Time: 68m 24s
        Train Loss: 4.012 | Train PPL:  55.260
         Val. Loss: 4.908 |  Val. PPL: 135.361
training...
35550
evaluate...
8880
Epoch: 06 | Time: 68m 40s
        Train Loss: 3.963 | Train PPL:  52.638
         Val. Loss: 4.844 |  Val. PPL: 126.949
training...
35550
evaluate...
8880
Epoch: 07 | Time: 68m 2s
        Train Loss: 3.925 | Train PPL:  50.658
         Val. Loss: 4.823 |  Val. PPL: 124.302
training...
35550
evaluate...
8880
Epoch: 08 | Time: 67m 31s
        Train Loss: 3.889 | Train PPL:  48.869
         Val. Loss: 4.835 |  Val. PPL: 125.825
training...
35550
evaluate...
8880
Epoch: 09 | Time: 68m 35s
        Train Loss: 3.867 | Train PPL:  47.806
         Val. Loss: 4.860 |  Val. PPL: 129.057
training...
35550
evaluate...
8880
Epoch: 10 | Time: 69m 16s
        Train Loss: 3.845 | Train PPL:  46.767
         Val. Loss: 4.828 |  Val. PPL: 124.944
evaluate...
510
| Test Loss: 4.778 | Test PPL: 118.853 |		# model_Translate_2.pth

迭代十次，batch=2：

Epoch: 01 | Time: 105m 17s
        Train Loss: 5.043 | Train PPL: 154.898
         Val. Loss: 5.228 |  Val. PPL: 186.510
training...
53320
evaluate...
13330
Epoch: 02 | Time: 79m 55s
        Train Loss: 4.397 | Train PPL:  81.231
         Val. Loss: 5.036 |  Val. PPL: 153.777
training...
53320
evaluate...
13330
Epoch: 03 | Time: 83m 0s
        Train Loss: 4.193 | Train PPL:  66.248
         Val. Loss: 4.944 |  Val. PPL: 140.263
training...
53320
evaluate...
13330
Epoch: 04 | Time: 82m 14s
        Train Loss: 4.099 | Train PPL:  60.289
         Val. Loss: 4.909 |  Val. PPL: 135.474
training...
53320
evaluate...
13330
Epoch: 05 | Time: 82m 49s
        Train Loss: 4.037 | Train PPL:  56.677
         Val. Loss: 4.878 |  Val. PPL: 131.420
training...
53320
evaluate...
13330
Epoch: 06 | Time: 83m 46s
        Train Loss: 3.990 | Train PPL:  54.056
         Val. Loss: 4.853 |  Val. PPL: 128.156
training...
53320
evaluate...
13330
Epoch: 07 | Time: 85m 3s
        Train Loss: 3.942 | Train PPL:  51.510
         Val. Loss: 4.850 |  Val. PPL: 127.755
training...
53320
evaluate...
13330

Epoch: 08 | Time: 83m 2s
        Train Loss: 3.916 | Train PPL:  50.189
         Val. Loss: 4.834 |  Val. PPL: 125.679
training...
53320
evaluate...
13330
Epoch: 09 | Time: 71m 7s
        Train Loss: 3.899 | Train PPL:  49.345
         Val. Loss: 4.893 |  Val. PPL: 133.352
training...
53320
evaluate...
13330
Epoch: 10 | Time: 70m 45s
        Train Loss: 3.891 | Train PPL:  48.953
         Val. Loss: 4.847 |  Val. PPL: 127.324
evaluate...
770
| Test Loss: 4.781 | Test PPL: 119.255 |		# model_Translate_3.pth

对于以下五句话，此时翻译结果：

'Câu chuyện này chưa kết thúc .'		# This story is not over. (truth:This is not a finished story .)
this story is .
'Ông rút lui vào yên lặng .'		# He retreated into silence.(truth:He retreated into silence .)
and it &apos s . .
'Ông qua đời , bị lịch sử quật ngã .'		# He died, destroyed by history.(truth:He died broken by history .)
and he was the . .
'Ông là ông của tôi .'		# You're my grandfather.(truth:He is my grandfather .)
he my grandfather .
'Tôi chưa bao giờ gặp ông ngoài đời .'		# I've never met you in real life.(truth:I never knew him in real life .)
i never have the outside .

在NVIDIA A100上重新训练，batch=128，嵌入层和隐层数量均增加：

Epoch: 10 | Time: 4m 22s
        Train Loss: 3.702 | Train PPL:  40.528
         Val. Loss: 4.826 |  Val. PPL: 124.745
evaluate...
10
| Test Loss: 4.832 | Test PPL: 125.438 |
this is very very .
he goes on the wheelchair .
, there &apos s a , , there &apos s a . .
he &apos my grandfather grandfather grandfather grandfather .
i &apos never been never .
# agian ...
Epoch: 01 | Time: 4m 49s
        Train Loss: 3.695 | Train PPL:  40.254
         Val. Loss: 4.585 |  Val. PPL:  98.043
...
Epoch: 10 | Time: 4m 22s
        Train Loss: 3.171 | Train PPL:  23.821
         Val. Loss: 4.619 |  Val. PPL: 101.385
| Test Loss: 4.760 | Test PPL: 116.739 |
, this is not very funny .
he &apos s a <unk> .
he died , , , , .
he &apos grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather gran
dfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grand
father grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandf
ather grandfather grandfather
i never ve never been <unk> .

过拟合了，而且效果不好。

现在batch调到很小训练都有内存问题，于是尝试DDP，参考此

问题：

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1631630839582/work/torch/lib/c10d/ │ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8

~~解决办法：重开一个tmux~~ 悬而未决，放弃，之后有时间看看这个和这个

CUDA_VISIBLE_DEVICES="0,6" python -m torch.distributed.launch --nproc_per_node 2 main.py

2. 德语-英语（4.5M）

Wmt16_de_en，commoncrawl.de-en(2399123对)，europarl-v7.de-en(1920201对)

在A100上(40536MiB)，batch=128会内存不够的错，所以修改只用europarl-v7.de-en这一个数据集，min_freq=30，BATCH_SIZE = 96结果如下：

len of data:1782890
20950 36141
Epoch: 01 | Time: 156m 37s
        Train Loss: 3.676 | Train PPL:  39.477
         Val. Loss: 4.821 |  Val. PPL: 124.048
Epoch: 02 | Time: 166m 32s
        Train Loss: 3.018 | Train PPL:  20.441
         Val. Loss: 4.911 |  Val. PPL: 135.773
Epoch: 03 | Time: 164m 44s
        Train Loss: 2.852 | Train PPL:  17.314
         Val. Loss: 4.810 |  Val. PPL: 122.768
Epoch: 04 | Time: 173m 0s
        Train Loss: 2.751 | Train PPL:  15.665
         Val. Loss: 4.800 |  Val. PPL: 121.556
Epoch: 05 | Time: 165m 34s
        Train Loss: 2.676 | Train PPL:  14.534
         Val. Loss: 4.807 |  Val. PPL: 122.314
Epoch: 06 | Time: 150m 3s                                                                                                                                                   [26/1905]
        Train Loss: 2.623 | Train PPL:  13.775
         Val. Loss: 4.810 |  Val. PPL: 122.765
Epoch: 07 | Time: 151m 46s
        Train Loss: 2.576 | Train PPL:  13.144
         Val. Loss: 4.852 |  Val. PPL: 127.980
Epoch: 08 | Time: 152m 2s
        Train Loss: 2.547 | Train PPL:  12.771
         Val. Loss: 4.873 |  Val. PPL: 130.769
Epoch: 09 | Time: 149m 23s
        Train Loss: 2.512 | Train PPL:  12.328
         Val. Loss: 4.850 |  Val. PPL: 127.678
Epoch: 10 | Time: 162m 51s
        Train Loss: 2.489 | Train PPL:  12.044
         Val. Loss: 4.841 |  Val. PPL: 126.571
evaluate...
920
| Test Loss: 4.834 | Test PPL: 125.668 |

改小了模型（ENC_EMB_DIM…ATTN_DIM每个都为上面的一半）的结果如下所示：

Epoch: 01 | Time: 163m 24s
        Train Loss: 3.893 | Train PPL:  49.077
         Val. Loss: 4.880 |  Val. PPL: 131.603
Epoch: 02 | Time: 178m 20s
        Train Loss: 3.204 | Train PPL:  24.628
         Val. Loss: 4.860 |  Val. PPL: 129.012
Epoch: 03 | Time: 169m 32s
        Train Loss: 3.044 | Train PPL:  20.989
         Val. Loss: 4.902 |  Val. PPL: 134.551
Epoch: 04 | Time: 180m 45s
        Train Loss: 2.956 | Train PPL:  19.223
         Val. Loss: 4.854 |  Val. PPL: 128.302
Epoch: 05 | Time: 166m 24s
        Train Loss: 2.897 | Train PPL:  18.125
         Val. Loss: 4.841 |  Val. PPL: 126.576
Epoch: 06 | Time: 157m 6s
        Train Loss: 2.853 | Train PPL:  17.344
         Val. Loss: 4.814 |  Val. PPL: 123.206
Epoch: 07 | Time: 162m 6s
        Train Loss: 2.820 | Train PPL:  16.777
         Val. Loss: 4.848 |  Val. PPL: 127.505
Epoch: 08 | Time: 158m 49s
        Train Loss: 2.792 | Train PPL:  16.315
         Val. Loss: 4.850 |  Val. PPL: 127.752
Epoch: 09 | Time: 164m 58s
        Train Loss: 2.767 | Train PPL:  15.918
         Val. Loss: 4.830 |  Val. PPL: 125.264
Epoch: 10 | Time: 165m 25s
        Train Loss: 2.751 | Train PPL:  15.664
         Val. Loss: 4.830 |  Val. PPL: 125.154

总之很不好就对了…听说有人用tensorflow的那个模型效果很好，所以我准备试一试那个模型。

'a fire restant repair cement for fire places, ovens, open fireplaces etc.'		# (truth:feuerfester Reparaturkitt für Feuerungsanlagen, Öfen, offene Feuerstellen etc.)
proposed disposing <unk> <unk> <unk> <unk> the <unk> the patent <unk> fall <unk>
'Construction and repair of highways and...'		#  (truth:Der Bau und die Reparatur der Autostraßen...)
notably to <unk> to therapies to <unk> the to <unk> , <unk>
'An announcement must be commercial character.'		# (truth:die Mitteilungen sollen den geschäftlichen kommerziellen Charakter tragen.)
specific nonsense time extremist Baltic need the to time time extremist 17 ourselves , <unk>
'Goods and services advancement through the P.O.Box system is NOT ALLOWED.'		# (truth:der Vertrieb Ihrer Waren und Dienstleistungen durch das Postfach-System WIRD NICHT ZUGELASSEN.)
Dutch to statement important I <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>
'Deliveries (spam) and other improper information deleted.'		# (truth:die Werbeversande (Spam) und andere unkorrekte Informationen werden gelöscht.)
<unk> to particular items deal , <unk>

一些想法（法II.1）

参考此github，表面上法II能够达到很小的PPl，可以绘制出attention并且得到很好的翻译效果：

Epoch: 10 | Time: 0m 28s
        Train Loss: 1.463 | Train PPL:   4.318
         Val. Loss: 3.299 |  Val. PPL:  27.098
| Test Loss: 3.187 | Test PPL:  24.207 |
src = ['ein', 'schwarzer', 'hund', 'und', 'ein', 'gefleckter', 'hund', 'kämpfen', '.']
trg = ['a', 'black', 'dog', 'and', 'a', 'spotted', 'dog', 'are', 'fighting']
predicted trg = ['a', 'black', 'dog', 'and', 'a', 'spotted', 'dog', 'fighting', '.', '<eos>']
BLEU score = 29.20

我拿法II跑这个数据集，依然能取得类似的结果…

但尽管Multi30k上Test PPL比较好，但对于一般句子（以上的五句）的词语原训练集中并没有出现，都是<unk> ，所以这样跑出来是没有意义的，模型效果确实与数据集有关系，注意到Multi30k上约90%的句子都是A、An、The开头。

Epoch: 10 | Time: 0m 35s
        Train Loss: 1.571 | Train PPL:   4.811
         Val. Loss: 3.388 |  Val. PPL:  29.619
evaluate...
10
| Test Loss: 3.432 | Test PPL:  30.929 |
<unk> <unk> sidewalk <unk> sidewalk <unk> on on <unk> <unk> <unk> <unk> <unk> <unk>
<unk> the <unk> the <unk> the <unk> the <unk> a <unk>
<unk> <unk> <unk> <unk> <unk> a <unk>
<unk> the <unk> <unk> child <unk> <unk> a <unk>
<unk> operating <unk> the <unk> <unk> <unk> a <unk>

4. Tensorflow

官方安装地址是不行的！！！需要安装1.4的版本

重新安装Tensorflow：

conda install tensorflow==1.4

踩大坑，这个是CPU版本的，问了别人跑了半个小时就跑完了，我跑了三个小时一半都没有跑完，应该是：

conda install tensorflow-gpu==1.14

踩坑，不能运行

pip install tensorflow-gpu==1.4.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

还是不能运行

安装Miniconda，官网地址，下载Miniconda3-latest-Linux-x86_64，

最后安装配置了一些库，改了一下路径之后能够运行

参考此github完成训练

最终结果对比：

>'Câu chuyện này chưa kết thúc .' 
truth:This is not a finished story .
Microsoft Translator:This story is not over.
Model_1_vi_en:this is the not . . <EOS> 
Model_2_vi_en:this story is .
Model_3_vi_en:This story isn &apos;t going to end .

>'Ông là ông của tôi .'
truth:He is my grandfather .
Microsoft Translator:You're my grandfather.
Model_1_vi_en:he was my . <EOS>
Model_2_vi_en:he my grandfather .
Model_3_vi_en:He was my grandfather .

>'Tôi chưa bao giờ gặp ông ngoài đời .'
truth:I never knew him in real life .
Microsoft Translator:I've never met you in real life.
Model_1_vi_en:i never never met him meet him . <EOS>
Model_2_vi_en:i never have the outside .
Model_3_vi_en:I never met him outside of life .

5. 评价指标

原论文使用两种类型的BLEU：(a)tokenized[1] BLEU，与现有的NMT工作相比较；(b)NIST[2] BLEU，与WMT结果相比较。

[1] Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
[2] I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.