环境: Ubuntu 12.04, Kaldi
深度学习在NLP上的应用(具体可参考这篇文章 http://licstar.net/archives/328) 中提到一个概念:词向量 (英文为distributed representation, word representation, word embeding中任一个)。在Mikolov 的 RNNLM中有涉及到到词向量的训练,其中Kaldi中有实现示例。
1. 切换到Kaldi目录/u01/kaldi/tools,未找到rnnlm目录。 可能是版本有些旧了, 直接从网上下载这个目录
svn co https://svn.code.sf.net/p/kaldi/code/trunk/tools/rnnlm-hs-0.1b
2.
cd rnnlm-hs-01.b
make
生成rnnlm执行文件
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm
RNNLM based on WORD VECTOR estimation toolkit v 0.1b
Options:
Parameters for training:
-train
Use text data from to train the model
-valid
Use text data from to perform validation and control learning rate
-test
Use text data from to compute logprobs with an existing model
-rnnlm
Use to save the resulting language model
-hidden
Set size of hidden layer; default is 100
-bptt
Set length of BPTT unfolding; default is 3; set to 0 to disable truncation
-bptt-block
Set period of BPTT unfolding; default is 10; BPTT is performed each bptt+bptt_block steps
-gen
Sampling mode; number of sentences to sample, default is 0 (off); enter negative number for interactive mode
-threads
Use threads (default 1)
-min-count
This will discard words that appear less than times; default is 0
-alpha
Set the starting learning rate; default is 0.1
-maxent-alpha
Set the starting learning rate for maxent; default is 0.1
-reject-threshold
Reject nnet and reload nnet from previous epoch if the relative entropy improvement on the validation set is below this threshold (default 0.997)
-stop
Stop training when the relative entropy improvement on the validation set is below this threshold (default 1.003); see also -retry
-retry
Stop training iff N retries with halving learning rate have failed (default 2)
-debug
Set the debug mode (default = 2 = more info during training)
-direct-size
Set the size of hash for maxent parameters, in millions (default 0 = maxent off)
-direct-order
Set the order of n-gram features to be used in maxent (default 3)
-beta1
L2 regularisation parameter for RNNLM weights (default 1e-6)
-beta2
L2 regularisation parameter for maxent weights (default 1e-6)
-recompute-counts
Recompute train words counts, useful for fine-tuning (default = 0 = use counts stored in the vocab file)
Examples:
./rnnlm -train data.txt -valid valid.txt -rnnlm result.rnnlm -debug 2 -hidden 200
3. 使用kaldi中的wsj示例
下载一个包含wsj的 git clone https://github.com/foundintranslation/Kaldi.git
将其中的cp wsj/s1 /u01/kaldi/egs/wsj/ -Rf
发现其中的wsj数据源是要用dvd光盘上的,没法获得,这条路走不通。
4. 到网站http://www.fit.vutbr.cz/~imikolov/rnnlm/下载
这两个文件,其中有程序和示例, 解压Basic_examples,里面有数据文件data
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls /u01/jerry/simple-examples/data
ptb.char.test.txt ptb.char.train.txt ptb.char.valid.txt ptb.test.txt ptb.train.txt ptb.valid.txt README
开始训练词向量
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm -train /u01/jerry/simple-examples/data/ptb.train.txt -valid /u01/jerry/simple-examples/data/ptb.valid.txt -rnnlm result.rnnlm -debug2 -hidden 100
Vocab size: 10000
Words in train file: 929589
Starting training using file /u01/jerry/simple-examples/data/ptb.train.txt
Iteration 0 Valid Entropy 9.457519
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 28.40k Iteration 1 Valid Entropy 8.416857
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 28.18k Iteration 2 Valid Entropy 8.203366
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.98k Iteration 3 Valid Entropy 8.090350
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.25k Iteration 4 Valid Entropy 8.026399
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.35k Iteration 5 Valid Entropy 7.979509
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.43k Iteration 6 Valid Entropy 7.949336
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.35k Iteration 7 Valid Entropy 7.931067 Decay started
Alpha: 0.050000 ME-alpha: 0.050000 Progress: 99.11% Words/thread/sec: 28.55k Iteration 8 Valid Entropy 7.827513
Alpha: 0.025000 ME-alpha: 0.025000 Progress: 99.11% Words/thread/sec: 28.37k Iteration 9 Valid Entropy 7.759574
Alpha: 0.012500 ME-alpha: 0.012500 Progress: 99.11% Words/thread/sec: 28.45k Iteration 10 Valid Entropy 7.714383
Alpha: 0.006250 ME-alpha: 0.006250 Progress: 99.11% Words/thread/sec: 28.51k Iteration 11 Valid Entropy 7.684731
Alpha: 0.003125 ME-alpha: 0.003125 Progress: 99.11% Words/thread/sec: 28.64k Iteration 12 Valid Entropy 7.668839 Retry 1/2
Alpha: 0.001563 ME-alpha: 0.001563 Progress: 99.11% Words/thread/sec: 28.25k Iteration 13 Valid Entropy 7.668437 Retry 2/2
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls -l
total 8184
-rw-rw-r-- 1 jerry jerry 11358 Aug 25 15:08 LICENSE
-rw-rw-r-- 1 jerry jerry 407 Aug 25 15:08 Makefile
-rw-rw-r-- 1 jerry jerry 8325 Aug 25 15:08 README.txt
-rw-rw-r-- 1 jerry jerry 109943 Aug 25 18:05 result.rnnlm
-rw-rw-r-- 1 jerry jerry 8040020 Aug 25 18:05 result.rnnlm.nnet
-rwxrwxr-x 1 jerry jerry 142501 Aug 25 15:08 rnnlm
-rw-rw-r-- 1 jerry jerry 33936 Aug 25 15:08 rnnlm.c
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$
vi
result.rnnlm
42068
the 50770
45020
N 32481
of 24400
to 23638
a 21196
in 18000
and 17474
's 9784
深度学习在NLP上的应用(具体可参考这篇文章 http://licstar.net/archives/328) 中提到一个概念:词向量 (英文为distributed representation, word representation, word embeding中任一个)。在Mikolov 的 RNNLM中有涉及到到词向量的训练,其中Kaldi中有实现示例。
1. 切换到Kaldi目录/u01/kaldi/tools,未找到rnnlm目录。 可能是版本有些旧了, 直接从网上下载这个目录
svn co https://svn.code.sf.net/p/kaldi/code/trunk/tools/rnnlm-hs-0.1b
2.
cd rnnlm-hs-01.b
make
生成rnnlm执行文件
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm
RNNLM based on WORD VECTOR estimation toolkit v 0.1b
Options:
Parameters for training:
-train
Use text data from to train the model
-valid
Use text data from to perform validation and control learning rate
-test
Use text data from to compute logprobs with an existing model
-rnnlm
Use to save the resulting language model
-hidden
Set size of hidden layer; default is 100
-bptt
Set length of BPTT unfolding; default is 3; set to 0 to disable truncation
-bptt-block
Set period of BPTT unfolding; default is 10; BPTT is performed each bptt+bptt_block steps
-gen
Sampling mode; number of sentences to sample, default is 0 (off); enter negative number for interactive mode
-threads
Use threads (default 1)
-min-count
This will discard words that appear less than times; default is 0
-alpha
Set the starting learning rate; default is 0.1
-maxent-alpha
Set the starting learning rate for maxent; default is 0.1
-reject-threshold
Reject nnet and reload nnet from previous epoch if the relative entropy improvement on the validation set is below this threshold (default 0.997)
-stop
Stop training when the relative entropy improvement on the validation set is below this threshold (default 1.003); see also -retry
-retry
Stop training iff N retries with halving learning rate have failed (default 2)
-debug
Set the debug mode (default = 2 = more info during training)
-direct-size
Set the size of hash for maxent parameters, in millions (default 0 = maxent off)
-direct-order
Set the order of n-gram features to be used in maxent (default 3)
-beta1
L2 regularisation parameter for RNNLM weights (default 1e-6)
-beta2
L2 regularisation parameter for maxent weights (default 1e-6)
-recompute-counts
Recompute train words counts, useful for fine-tuning (default = 0 = use counts stored in the vocab file)
Examples:
./rnnlm -train data.txt -valid valid.txt -rnnlm result.rnnlm -debug 2 -hidden 200
下载一个包含wsj的 git clone https://github.com/foundintranslation/Kaldi.git
将其中的cp wsj/s1 /u01/kaldi/egs/wsj/ -Rf
发现其中的wsj数据源是要用dvd光盘上的,没法获得,这条路走不通。
4. 到网站http://www.fit.vutbr.cz/~imikolov/rnnlm/下载
这两个文件,其中有程序和示例, 解压Basic_examples,里面有数据文件data
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls /u01/jerry/simple-examples/data
ptb.char.test.txt ptb.char.train.txt ptb.char.valid.txt ptb.test.txt ptb.train.txt ptb.valid.txt README
开始训练词向量
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm -train /u01/jerry/simple-examples/data/ptb.train.txt -valid /u01/jerry/simple-examples/data/ptb.valid.txt -rnnlm result.rnnlm -debug2 -hidden 100
Vocab size: 10000
Words in train file: 929589
Starting training using file /u01/jerry/simple-examples/data/ptb.train.txt
Iteration 0 Valid Entropy 9.457519
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 28.40k Iteration 1 Valid Entropy 8.416857
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 28.18k Iteration 2 Valid Entropy 8.203366
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.98k Iteration 3 Valid Entropy 8.090350
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.25k Iteration 4 Valid Entropy 8.026399
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.35k Iteration 5 Valid Entropy 7.979509
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.43k Iteration 6 Valid Entropy 7.949336
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.35k Iteration 7 Valid Entropy 7.931067 Decay started
Alpha: 0.050000 ME-alpha: 0.050000 Progress: 99.11% Words/thread/sec: 28.55k Iteration 8 Valid Entropy 7.827513
Alpha: 0.025000 ME-alpha: 0.025000 Progress: 99.11% Words/thread/sec: 28.37k Iteration 9 Valid Entropy 7.759574
Alpha: 0.012500 ME-alpha: 0.012500 Progress: 99.11% Words/thread/sec: 28.45k Iteration 10 Valid Entropy 7.714383
Alpha: 0.006250 ME-alpha: 0.006250 Progress: 99.11% Words/thread/sec: 28.51k Iteration 11 Valid Entropy 7.684731
Alpha: 0.003125 ME-alpha: 0.003125 Progress: 99.11% Words/thread/sec: 28.64k Iteration 12 Valid Entropy 7.668839 Retry 1/2
Alpha: 0.001563 ME-alpha: 0.001563 Progress: 99.11% Words/thread/sec: 28.25k Iteration 13 Valid Entropy 7.668437 Retry 2/2
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls -l
total 8184
-rw-rw-r-- 1 jerry jerry 11358 Aug 25 15:08 LICENSE
-rw-rw-r-- 1 jerry jerry 407 Aug 25 15:08 Makefile
-rw-rw-r-- 1 jerry jerry 8325 Aug 25 15:08 README.txt
-rw-rw-r-- 1 jerry jerry 109943 Aug 25 18:05 result.rnnlm
-rw-rw-r-- 1 jerry jerry 8040020 Aug 25 18:05 result.rnnlm.nnet
-rwxrwxr-x 1 jerry jerry 142501 Aug 25 15:08 rnnlm
-rw-rw-r-- 1 jerry jerry 33936 Aug 25 15:08 rnnlm.c
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$
42068
the 50770
45020
N 32481
of 24400
to 23638
a 21196
in 18000
and 17474
's 9784
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/16582684/viewspace-1257524/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/16582684/viewspace-1257524/
本文介绍如何使用Kaldi工具中的RNNLM组件训练词向量,并提供了详细的步骤及示例,包括安装配置、参数说明及运行过程。
911

被折叠的 条评论
为什么被折叠?



