[sphinx]中文语言模型训练

最新推荐文章于 2025-07-12 17:07:00 发布

weixin_33966095

最新推荐文章于 2025-07-12 17:07:00 发布

阅读量464

点赞数

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/lijieqiong/p/4810863.html

本文介绍了短词组语言模型的训练方法，并通过使用Sphinx官方教程进行实例操作，包括文本准备、生成词频分析文件、创建arpa文件等步骤。接着，详细解释了如何利用训练好的语言模型进行特定字串的识别。最后，提出了下一步计划，涉及全词串的分词训练以及声学模型的录制训练。

一，不用分词的短词组语言模型训练

参考资源：http://cmusphinx.sourceforge.net/wiki/tutoriallm sphinx官方教程

1）文本准备

生成文本文件，内含一行一个的单词。头尾有<s> </s>标记，如下所示，其中单词前后都有空格。文件为utf-8格式，文件名为test.txt。

<s> 苏菲 </s>
<s> 百事 </s>
<s> 雀巢 </s>
<s> 宝洁 </s>
<s> 壳牌 </s>
<s> 统一 </s>
<s> 高通 </s>
<s> 科勒 </s>

2）上传此文件到服务器上，生成词频分析文件

text2wfreq < test.txt | wfreq2vocab > test.vocab

中间过程如下：

text2wfreq : Reading text from standard input...
wfreq2vocab : Will generate a vocabulary containing the most
              frequent 20000 words. Reading wfreq stream from stdin...
text2wfreq : Done.
wfreq2vocab : Done.

结果文件为test.vocab,其中格式为：

## Vocab generated by v2 of the CMU-Cambridge Statistcal
## Language Modeling toolkit.
##
## Includes 178 words ##
</s>
<s>
一号店
上好佳
上海滩
丝塔芙
丝芙兰

3）生成arpa文件

text2idngram -vocab test.vocab -idngram test.idngram < test.txt
idngram2lm -vocab_type 0 -idngram test.idngram -vocab test.vocab -arpa test.lm

第一条命令中间过程为

text2idngram
Vocab                  : test.vocab
Output idngram         : test.idngram
N-gram buffer size     : 100
Hash table size        : 2000000
Temp directory         : cmuclmtk-Mtadbf
Max open files         : 20
FOF size               : 10
n                      : 3
Initialising hash table...
Reading vocabulary... 
Allocating memory for the n-gram buffer...
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.

Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-Mtadbf/1
Merging 1 temporary files...

2-grams occurring:      N times         > N times       Sug. -spec_num value
      0                                             351             364
      1                             348               3              13
      2                               2               1              11
      3                               0               1              11
      4                               0               1              11
      5                               0               1              11
      6                               0               1              11
      7                               0               1              11
      8                               0               1              11
      9                               0               1              11
     10                               0               1              11

3-grams occurring:      N times         > N times       Sug. -spec_num value
      0                                             525             540
      1                             522               3              13
      2                               3               0              10
      3                               0               0              10
      4                               0               0              10
      5                               0               0              10
      6                               0               0              10
      7                               0               0              10
      8                               0               0              10
      9                               0               0              10
     10                               0               0              10
text2idngram : Done.

结果文件为test.idngram,其中格式为

^@^@^@^A^@^@^@^B^@^@^@^C^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@^D^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@^E^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@^F^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@^G^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@^H^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@  ^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@
@
@

第二条命令，中间过程为产生很多warning，但是最后显示done，这里语言模型应该是有问题了。

Warning : P(2) = 0 (0 / 177)
ncount = 1
Warning : P(2) = 0 (0 / 177)
ncount = 1
Warning : P(2) = 0 (0 / 177)
ncount = 1
Warning : P(2) = 0 (0 / 177)
ncount = 1
。。。。。。。

Writing out language model...
ARPA-style 3-gram will be written to test.lm
idngram2lm : Done.

结果文件为test.lm,打开查看内容

This is a CLOSED-vocabulary model
  (OOVs eliminated from training data and are forbidden in test data)
Good-Turing discounting was applied.
1-gram frequency of frequency : 174
2-gram frequency of frequency : 348 2 0 0 0 0 0
3-gram frequency of frequency : 522 3 0 0 0 0 0
1-gram discounting ratios : 0.99
2-gram discounting ratios : 0.00
3-gram discounting ratios : 0.00
This file is in the ARPA-standard format introduced by Doug Paul.

此处意思是只有1-gram，缺乏2-gram和3-gram，事实上翻看后面这个lm中的内容，列出的2-gram对和3-gram，是以行为分界。

二使用语言模型

使用sphinx官网自带的中文声学模型，和中文词典，以及此处训练得到的语言模型。识别特定的一些字串。此处有160个单词，和这160个单词的发音得到的词典，以及包含这些词的一个庞大丰富的声学模型，所以按照逻辑，识别过程找到对应的每个字后，再依据这个语言模型中不同字的组合形成的词语，能识别出正确的词组。

windows上安装了pocketsphinx，使用如下：

pocketsphinx_continuous.exe -inmic yes -lm test.lm -dict test.dic -hmm zh_broadcastnews_ptm256_8000

此处，-lm引入的模型是直接生成的lm后缀的模型，而武林秘籍中是先把lm模型转为dmp模型，再在此处使用，不知道问题是否在这里。

三 nextplan

1）使用全部词串，词串都经过分词，训练语言模型，然后和固有声学模型一起使用

在线分词工具，先不论性能好坏，如下可直接用：

php分词系统演示： http://www.phpbone.com/phpanalysis/demo.php?ac=done

SCWS中文分词： http://www.xunsearch.com/scws/demo.php

NLPIR 中科院计算机所NLP: http://ictclas.nlpir.org/nlpir/ (只想说这就是我心目中的NLP有趣的方式）

这个结果还需要做处理，当下不太实用。

2）录制300个句子，训练声学模型，和对应的语言模型一起使用。

转载于:https://www.cnblogs.com/lijieqiong/p/4810863.html

[sphinx]中文语言模型训练

一，不用分词的短词组语言模型训练

二 使用语言模型

二使用语言模型