Neural Machine Tranlation, NMT
原论文与代码Neural Machine Translation 地址,需要复现《Effective Approaches to Attention-based Neural Machine Translation》
1. 英语-越南语(133k)
原论文,越南语微软翻译,Iwslt15_en_vt,包含[‘vocab.en’, ‘train.vi’, ‘tst2013.en’, ‘vocab.vi’, ‘tst2012.vi’, ‘tst2013.vi’, ‘tst2012.en’, ‘train.en’]八个文件
直接读一下 'train.en’文件,发现很多:
'He 's trebled his crop income'
'because he 's now got toilets'
'I want to end by saying it 's been the actions'
猜测应该是‘s的意思,这里进行以下整理:
lines_X = file_X.read().strip().replace('& amp ; quot ;', '"')\
.replace(''', "'").replace('"', '"')\
.replace('&', '&').replace('[', '[')\
.replace(']', ']').replace('& amp ;', '&').split('\n')
来替换HTML的转移字符,并且跳过空行,处理成pairs的数据:
[...
['it s man made and can be overcome and eradicated by the actions of human beings . ', 'Nó là do con người và có thể ngăn chặn và diệt trừ bởi hành động của con người . '],
...]
但估计翻译的效果不会很好,因为就对于原本数据集的最后两个数据:
英语:
Didier Sornette : How we can predict the next financial crisis
The 2007-2008 financial crisis , you might think , was an unpredictable one-time crash . But Didier Sornette and his Financial Crisis Observatory have plotted a set of early warning signs for unstable , growing systems , tracking the moment when any bubble is about to pop .
越南语:
Paul Pholeros : Làm sao để bớt nghèo khổ ? Hãy sửa nhà # Paul Pholeros: How to reduce poverty? Let's fix the house. 保罗·福勒罗斯:如何减少贫困?让我们修房子
Năm 1985 , kiến trúc sư Paul Pholeros được giao nhiệm vụ " ngăn chặn người dân tiếp tục mắc bệnh " từ người chủ trung tâm y tế là 1 người thổ dân trong cộng đồng người thổ dân ở Nam Úc . Nhận thức cốt lõi : thay vì dùng thuốc , hãy cải thiện môi trường sống địa phương . Trong diễn văn sáng ngời này , Pholeros mô tả các dự án mà Healthabitat - tổ chức mà ông đang quản lý để giúp giảm nghèo - thực hiện bằng những thay đổi trong thiết kế -- ở Úc và nước ngoài . # 1985年,建筑师保罗·福勒罗斯被指派阻止南澳大利亚土著社区土著居民的卫生中心主任继续生病。核心意识:与其吃药,不如改善当地的栖息地。在这场明亮的演讲中,Pholeros描述了他为帮助澳大利亚和国外设计变革而实施的项目。
在官方网站的数据就是这样:train.en,train.vi,那只能继续做了
(1)法I
参考【PyTorch】6 法语英语翻译RNN实战——基于Attention的seq2seq模型、Attention可视化完成了一个小baseline:
'Câu chuyện này chưa kết thúc .' # This story is not over.
this one is been . <EOS>
'Ông rút lui vào yên lặng .' # He retreated into silence.
he was at at . . <EOS>
'Ông qua đời , bị lịch sử quật ngã .' # He died, destroyed by history.
KeyError: 'quật'
增加unknown,能够修改MAX_LENGTH、n_iters数值,当MAX_LENGTH=300、n_iters=133000 × 10 时训练过程(NLLLoss):
Reading lines...
Read 133168 sentence pairs
Trimmed to 133168 sentence pairs
运行时间 2m 27s (估计的剩余时间 3203m 24s) 当前iter=1000 完成进度0.077% Loss:5.5203
运行时间 3m 19s (估计的剩余时间 2158m 21s) 当前iter=2000 完成进度0.154% Loss:5.0535
...
运行时间 1262m 17s (估计的剩余时间 0m 58s) 当前iter=1299000 完成进度99.923% Loss:3.7528
运行时间 1263m 11s (估计的剩余时间 0m 0s) 当前iter=1300000 完成进度100.000% Loss:3.7462
对于以下五句话,此时翻译结果:
'Câu chuyện này chưa kết thúc .' # This story is not over. (truth:This is not a finished story .)
this is the not . . <EOS>
'Ông rút lui vào yên lặng .' # He retreated into silence.(truth:He retreated into silence .)
he was the he . <EOS>
'Ông qua đời , bị lịch sử quật ngã .' # He died, destroyed by history.(truth:He died broken by history .)
he was through history , history . <EOS>
'Ông là ông của tôi .' # You're my grandfather.(truth:He is my grandfather .)
he was my . <EOS>
'Tôi chưa bao giờ gặp ông ngoài đời .' # I've never met you in real life.(truth:I never knew him in real life .)
i never never met him meet him . <EOS>
直观的展示一下评价指标,由于这个没有办法直观的计算交叉熵,就取min(单词索引、句子长度)个词计算交叉熵,将其平均作为一句话的交叉熵,最后Loss:5.3968(5.3950),PPL:220.70095144082947(220.30508426513822),PLEU:0.0,结果很差
尝试修改参数Max_len = 50,dropout = 0.2…效果也不好。
(2)法II
为了解决unknown的问题,参考【PyTorch】8 语言翻译Torchtext实战——英语和德语翻译、Attention模型、 Pytorch 1.8 安装
安装torctext:
conda install torchtext
踩坑:
batch=4时不能运行,batch=2时结果如下:
data process...
10779 12849
training...
53319
evaluate...
Epoch: 01 | Time: 79m 48s
Train Loss: 0.735 | Train PPL: 2.084
Val. Loss: 0.741 | Val. PPL: 2.098
...
Epoch: 10 | Time: 78m 41s
Train Loss: 0.620 | Train PPL: 1.858
Val. Loss: 0.680 | Val. PPL: 1.974
evaluate...
| Test Loss: 0.631 | Test PPL: 1.880 |
以上代码是弄错英语和德语token时候的结果,正确的迭代一次:
Epoch: 01 | Time: 107m 25s
Train Loss: 5.011 | Train PPL: 150.087
Val. Loss: 5.230 | Val. PPL: 186.728
evaluate...
770
| Test Loss: 5.123 | Test PPL: 167.875 | # model_Translate_4.pth
迭代十次,batch=3:
Epoch: 01 | Time: 71m 48s
Train Loss: 5.107 | Train PPL: 165.135
Val. Loss: 5.311 | Val. PPL: 202.551
training...
35550
evaluate...
8880
Epoch: 02 | Time: 71m 10s
Train Loss: 4.460 | Train PPL: 86.481
Val. Loss: 5.114 | Val. PPL: 166.268
training...
35550
evaluate...
8880
Epoch: 03 | Time: 69m 9s
Train Loss: 4.209 | Train PPL: 67.264
Val. Loss: 4.970 | Val. PPL: 143.994
training...
35550
evaluate...
8880
Epoch: 04 | Time: 69m 58s
Train Loss: 4.086 | Train PPL: 59.483
Val. Loss: 4.926 | Val. PPL: 137.836
training...
35550
evaluate...
8880
Epoch: 05 | Time: 68m 24s
Train Loss: 4.012 | Train PPL: 55.260
Val. Loss: 4.908 | Val. PPL: 135.361
training...
35550
evaluate...
8880
Epoch: 06 | Time: 68m 40s
Train Loss: 3.963 | Train PPL: 52.638
Val. Loss: 4.844 | Val. PPL: 126.949
training...
35550
evaluate...
8880
Epoch: 07 | Time: 68m 2s
Train Loss: 3.925 | Train PPL: 50.658
Val. Loss: 4.823 | Val. PPL: 124.302
training...
35550
evaluate...
8880
Epoch: 08 | Time: 67m 31s
Train Loss: 3.889 | Train PPL: 48.869
Val. Loss: 4.835 | Val. PPL: 125.825
training...
35550
evaluate...
8880
Epoch: 09 | Time: 68m 35s
Train Loss: 3.867 | Train PPL: 47.806
Val. Loss: 4.860 | Val. PPL: 129.057
training...
35550
evaluate...
8880
Epoch: 10 | Time: 69m 16s
Train Loss: 3.845 | Train PPL: 46.767
Val. Loss: 4.828 | Val. PPL: 124.944
evaluate...
510
| Test Loss: 4.778 | Test PPL: 118.853 | # model_Translate_2.pth
迭代十次,batch=2:
Epoch: 01 | Time: 105m 17s
Train Loss: 5.043 | Train PPL: 154.898
Val. Loss: 5.228 | Val. PPL: 186.510
training...
53320
evaluate...
13330
Epoch: 02 | Time: 79m 55s
Train Loss: 4.397 | Train PPL: 81.231
Val. Loss: 5.036 | Val. PPL: 153.777
training...
53320
evaluate...
13330
Epoch: 03 | Time: 83m 0s
Train Loss: 4.193 | Train PPL: 66.248
Val. Loss: 4.944 | Val. PPL: 140.263
training...
53320
evaluate...
13330
Epoch: 04 | Time: 82m 14s
Train Loss: 4.099 | Train PPL: 60.289
Val. Loss: 4.909 | Val. PPL: 135.474
training...
53320
evaluate...
13330
Epoch: 05 | Time: 82m 49s
Train Loss: 4.037 | Train PPL: 56.677
Val. Loss: 4.878 | Val. PPL: 131.420
training...
53320
evaluate...
13330
Epoch: 06 | Time: 83m 46s
Train Loss: 3.990 | Train PPL: 54.056
Val. Loss: 4.853 | Val. PPL: 128.156
training...
53320
evaluate...
13330
Epoch: 07 | Time: 85m 3s
Train Loss: 3.942 | Train PPL: 51.510
Val. Loss: 4.850 | Val. PPL: 127.755
training...
53320
evaluate...
13330
Epoch: 08 | Time: 83m 2s
Train Loss: 3.916 | Train PPL: 50.189
Val. Loss: 4.834 | Val. PPL: 125.679
training...
53320
evaluate...
13330
Epoch: 09 | Time: 71m 7s
Train Loss: 3.899 | Train PPL: 49.345
Val. Loss: 4.893 | Val. PPL: 133.352
training...
53320
evaluate...
13330
Epoch: 10 | Time: 70m 45s
Train Loss: 3.891 | Train PPL: 48.953
Val. Loss: 4.847 | Val. PPL: 127.324
evaluate...
770
| Test Loss: 4.781 | Test PPL: 119.255 | # model_Translate_3.pth
对于以下五句话,此时翻译结果:
'Câu chuyện này chưa kết thúc .' # This story is not over. (truth:This is not a finished story .)
this story is .
'Ông rút lui vào yên lặng .' # He retreated into silence.(truth:He retreated into silence .)
and it &apos s . .
'Ông qua đời , bị lịch sử quật ngã .' # He died, destroyed by history.(truth:He died broken by history .)
and he was the . .
'Ông là ông của tôi .' # You're my grandfather.(truth:He is my grandfather .)
he my grandfather .
'Tôi chưa bao giờ gặp ông ngoài đời .' # I've never met you in real life.(truth:I never knew him in real life .)
i never have the outside .
在NVIDIA A100上重新训练,batch=128,嵌入层和隐层数量均增加:
Epoch: 10 | Time: 4m 22s
Train Loss: 3.702 | Train PPL: 40.528
Val. Loss: 4.826 | Val. PPL: 124.745
evaluate...
10
| Test Loss: 4.832 | Test PPL: 125.438 |
this is very very .
he goes on the wheelchair .
, there &apos s a , , there &apos s a . .
he &apos my grandfather grandfather grandfather grandfather .
i &apos never been never .
# agian ...
Epoch: 01 | Time: 4m 49s
Train Loss: 3.695 | Train PPL: 40.254
Val. Loss: 4.585 | Val. PPL: 98.043
...
Epoch: 10 | Time: 4m 22s
Train Loss: 3.171 | Train PPL: 23.821
Val. Loss: 4.619 | Val. PPL: 101.385
| Test Loss: 4.760 | Test PPL: 116.739 |
, this is not very funny .
he &apos s a <unk> .
he died , , , , .
he &apos grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather gran
dfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grand
father grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandfather grandf
ather grandfather grandfather
i never ve never been <unk> .
过拟合了,而且效果不好。
现在batch调到很小训练都有内存问题,于是尝试DDP,参考此
问题:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1631630839582/work/torch/lib/c10d/ │ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
解决办法:重开一个tmux 悬而未决,放弃,之后有时间看看这个和这个
CUDA_VISIBLE_DEVICES="0,6" python -m torch.distributed.launch --nproc_per_node 2 main.py
(3)更多…
Global attention 可参考【PyTorch】11 聊天机器人实战——Cornell Movie-Dialogs Corpus电影剧本数据集处理、利用Global attention实现Seq2Seq模型
2. 德语-英语(4.5M)
Wmt16_de_en,commoncrawl.de-en(2399123对),europarl-v7.de-en(1920201对)
在A100上(40536MiB),batch=128会内存不够的错,所以修改只用europarl-v7.de-en这一个数据集,min_freq=30,BATCH_SIZE = 96结果如下:
len of data:1782890
20950 36141
Epoch: 01 | Time: 156m 37s
Train Loss: 3.676 | Train PPL: 39.477
Val. Loss: 4.821 | Val. PPL: 124.048
Epoch: 02 | Time: 166m 32s
Train Loss: 3.018 | Train PPL: 20.441
Val. Loss: 4.911 | Val. PPL: 135.773
Epoch: 03 | Time: 164m 44s
Train Loss: 2.852 | Train PPL: 17.314
Val. Loss: 4.810 | Val. PPL: 122.768
Epoch: 04 | Time: 173m 0s
Train Loss: 2.751 | Train PPL: 15.665
Val. Loss: 4.800 | Val. PPL: 121.556
Epoch: 05 | Time: 165m 34s
Train Loss: 2.676 | Train PPL: 14.534
Val. Loss: 4.807 | Val. PPL: 122.314
Epoch: 06 | Time: 150m 3s [26/1905]
Train Loss: 2.623 | Train PPL: 13.775
Val. Loss: 4.810 | Val. PPL: 122.765
Epoch: 07 | Time: 151m 46s
Train Loss: 2.576 | Train PPL: 13.144
Val. Loss: 4.852 | Val. PPL: 127.980
Epoch: 08 | Time: 152m 2s
Train Loss: 2.547 | Train PPL: 12.771
Val. Loss: 4.873 | Val. PPL: 130.769
Epoch: 09 | Time: 149m 23s
Train Loss: 2.512 | Train PPL: 12.328
Val. Loss: 4.850 | Val. PPL: 127.678
Epoch: 10 | Time: 162m 51s
Train Loss: 2.489 | Train PPL: 12.044
Val. Loss: 4.841 | Val. PPL: 126.571
evaluate...
920
| Test Loss: 4.834 | Test PPL: 125.668 |
改小了模型(ENC_EMB_DIM…ATTN_DIM每个都为上面的一半)的结果如下所示:
Epoch: 01 | Time: 163m 24s
Train Loss: 3.893 | Train PPL: 49.077
Val. Loss: 4.880 | Val. PPL: 131.603
Epoch: 02 | Time: 178m 20s
Train Loss: 3.204 | Train PPL: 24.628
Val. Loss: 4.860 | Val. PPL: 129.012
Epoch: 03 | Time: 169m 32s
Train Loss: 3.044 | Train PPL: 20.989
Val. Loss: 4.902 | Val. PPL: 134.551
Epoch: 04 | Time: 180m 45s
Train Loss: 2.956 | Train PPL: 19.223
Val. Loss: 4.854 | Val. PPL: 128.302
Epoch: 05 | Time: 166m 24s
Train Loss: 2.897 | Train PPL: 18.125
Val. Loss: 4.841 | Val. PPL: 126.576
Epoch: 06 | Time: 157m 6s
Train Loss: 2.853 | Train PPL: 17.344
Val. Loss: 4.814 | Val. PPL: 123.206
Epoch: 07 | Time: 162m 6s
Train Loss: 2.820 | Train PPL: 16.777
Val. Loss: 4.848 | Val. PPL: 127.505
Epoch: 08 | Time: 158m 49s
Train Loss: 2.792 | Train PPL: 16.315
Val. Loss: 4.850 | Val. PPL: 127.752
Epoch: 09 | Time: 164m 58s
Train Loss: 2.767 | Train PPL: 15.918
Val. Loss: 4.830 | Val. PPL: 125.264
Epoch: 10 | Time: 165m 25s
Train Loss: 2.751 | Train PPL: 15.664
Val. Loss: 4.830 | Val. PPL: 125.154
总之很不好就对了…听说有人用tensorflow的那个模型效果很好,所以我准备试一试那个模型。
'a fire restant repair cement for fire places, ovens, open fireplaces etc.' # (truth:feuerfester Reparaturkitt für Feuerungsanlagen, Öfen, offene Feuerstellen etc.)
proposed disposing <unk> <unk> <unk> <unk> the <unk> the patent <unk> fall <unk>
'Construction and repair of highways and...' # (truth:Der Bau und die Reparatur der Autostraßen...)
notably to <unk> to therapies to <unk> the to <unk> , <unk>
'An announcement must be commercial character.' # (truth:die Mitteilungen sollen den geschäftlichen kommerziellen Charakter tragen.)
specific nonsense time extremist Baltic need the to time time extremist 17 ourselves , <unk>
'Goods and services advancement through the P.O.Box system is NOT ALLOWED.' # (truth:der Vertrieb Ihrer Waren und Dienstleistungen durch das Postfach-System WIRD NICHT ZUGELASSEN.)
Dutch to statement important I <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>
'Deliveries (spam) and other improper information deleted.' # (truth:die Werbeversande (Spam) und andere unkorrekte Informationen werden gelöscht.)
<unk> to particular items deal , <unk>
一些想法(法II.1)
参考此github,表面上法II能够达到很小的PPl,可以绘制出attention并且得到很好的翻译效果:
Epoch: 10 | Time: 0m 28s
Train Loss: 1.463 | Train PPL: 4.318
Val. Loss: 3.299 | Val. PPL: 27.098
| Test Loss: 3.187 | Test PPL: 24.207 |
src = ['ein', 'schwarzer', 'hund', 'und', 'ein', 'gefleckter', 'hund', 'kämpfen', '.']
trg = ['a', 'black', 'dog', 'and', 'a', 'spotted', 'dog', 'are', 'fighting']
predicted trg = ['a', 'black', 'dog', 'and', 'a', 'spotted', 'dog', 'fighting', '.', '<eos>']
BLEU score = 29.20
我拿法II跑这个数据集,依然能取得类似的结果…
但尽管Multi30k上Test PPL比较好,但对于一般句子(以上的五句)的词语原训练集中并没有出现,都是<unk> ,所以这样跑出来是没有意义的,模型效果确实与数据集有关系,注意到Multi30k上约90%的句子都是A、An、The开头。
Epoch: 10 | Time: 0m 35s
Train Loss: 1.571 | Train PPL: 4.811
Val. Loss: 3.388 | Val. PPL: 29.619
evaluate...
10
| Test Loss: 3.432 | Test PPL: 30.929 |
<unk> <unk> sidewalk <unk> sidewalk <unk> on on <unk> <unk> <unk> <unk> <unk> <unk>
<unk> the <unk> the <unk> the <unk> the <unk> a <unk>
<unk> <unk> <unk> <unk> <unk> a <unk>
<unk> the <unk> <unk> child <unk> <unk> a <unk>
<unk> operating <unk> the <unk> <unk> <unk> a <unk>
4. Tensorflow
官方安装地址是不行的!!!需要安装1.4的版本
重新安装Tensorflow:
conda install tensorflow==1.4
踩大坑,这个是CPU版本的,问了别人跑了半个小时就跑完了,我跑了三个小时一半都没有跑完,应该是:
conda install tensorflow-gpu==1.14
踩坑,不能运行
pip install tensorflow-gpu==1.4.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
还是不能运行
安装Miniconda,官网地址,下载Miniconda3-latest-Linux-x86_64
,
最后安装配置了一些库,改了一下路径之后能够运行
参考此github完成训练
最终结果对比:
>'Câu chuyện này chưa kết thúc .'
truth:This is not a finished story .
Microsoft Translator:This story is not over.
Model_1_vi_en:this is the not . . <EOS>
Model_2_vi_en:this story is .
Model_3_vi_en:This story isn 't going to end .
>'Ông là ông của tôi .'
truth:He is my grandfather .
Microsoft Translator:You're my grandfather.
Model_1_vi_en:he was my . <EOS>
Model_2_vi_en:he my grandfather .
Model_3_vi_en:He was my grandfather .
>'Tôi chưa bao giờ gặp ông ngoài đời .'
truth:I never knew him in real life .
Microsoft Translator:I've never met you in real life.
Model_1_vi_en:i never never met him meet him . <EOS>
Model_2_vi_en:i never have the outside .
Model_3_vi_en:I never met him outside of life .
5. 评价指标
原论文使用两种类型的BLEU:(a)tokenized[1] BLEU,与现有的NMT工作相比较;(b)NIST[2] BLEU,与WMT结果相比较。
[1] Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
[2] I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
(1)BLEU
(2)PPL
[1],其来源于[2],就是cross entropy次的指数。
总结
代码可见github:https://github.com/YoungSeng/Big_Data_Analytics_B
最终肯定会用tensorflow这个模型了,但是前面几个为什么这么差呢?