我们使用/icwb2-data.rar/training/msr_training.utf8 用以训练模型,这个词库里包含已分词汇约2000000个。
使用经典的字符标注模型,首先需要确定标注集,在前面的介绍中,我们使用的是{B,E}的二元集合,研究表明基于四类标签的字符标注模型明显优于两类标签,原因是两类标签过于简单而损失了部分信息。四类标签的集合是 {B,E,M,S},其含义如下:
B:一个词的开始
E:一个词的结束
M:一个词的中间
S:单字成词
举例:你S现B在E应B该E去S幼B儿M园E了S
用四类标签为msr_training.utf8做好标记后,就可以开始用统计的方法构建一个HMM。我们打算构建一个2-gram(bigram)语言模型,也即一个1阶HMM,每个字符的标签分类只受前一个字符分类的影响。现在,我们需要求得HMM的状态转移矩阵 A 以及混合矩阵 B。其中:
Aij = P(Cj|Ci) = P(Ci,Cj) / P(Ci) = Count(Ci,Cj) / Count(Ci)
Bij = P(Oj|Ci) = P(Oj,Ci) / P(Ci) = Count(Oj,Ci) / Count(Ci)
公式中C = {B,E,M,S},O = {字符集合},Count代表频率。在计算Bij时,由于数据的稀疏性,很多字符未出现在训练集中,这导致概率为0的结果出现在B中,为了修补这个问题,我们采用加1的数据平滑技术,即:
Bij = P(Oj|Ci) = (Count(Oj,Ci) + 1)/ Count(Ci)
这并不是一种最好的处理技术,因为这有可能低估或高估真实概率,更加科学的方法是使用复杂一点的Good—Turing技术,这项技术的的原始版本是图灵当年和他的助手Good在破解德国密码机时发明的。
我完成的分词器暂时没有用这个技术,只是简单的为没出现的词赋一个很小的值来实现简单的情况模拟,后续会尝试使用Good-Turing技术,试下效果怎么样。
求得的PI跟矩阵A如下:
private static double[] PI = new double[] {0.529918835192331, 0.0, 0.0, 0.470081164807669}; //B, M, E, S
private static double[][] A = new double[][] {
{0.0, 0.17142595344427136, 0.8285740465557286, 0.0}, //B
{0.0, 0.49616531193870117, 0.5038346880612988, 0.0}, //M
{0.4741680643477776, 0.0, 0.0, 0.5258319356522224}, //E
{0.5927662448712697, 0.0, 0.0, 0.4072328569272888} //S
};
相关训练代码:
public static void buildPiAndMatrixA() {
/**
* count matrix:
* ALL B M E S
* B * * * * *
* M * * * * *
* E * * * * *
* S * * * * *
*
* NOTE:
* count[2][0] is the total number of complex words
* count[3][0] is the total number of single words
*/
long[][] count = new long[4][5];
try {
BufferedReader br=new BufferedReader(new InputStreamReader(new FileInputStream("icwb2-data/training/msr_training.utf8"),"UTF-8"));
String line = null;
String last = null;
while ((line = br.readLine()) != null) {
String[] words = line.split(" ");
for (int i=0; i<words.length; i++) {
String word = words[i].trim();
int length = word.length();
if (length < 1)
continue;
if (length == 1) {
count[3][0]++;
if (last != null) {
if (last.length() == 1)
count[3][4]++;
else
count[2][4]++;
}
} else {
count[2][0]++;
count[0][0]++;
if (length > 2) {
count[1][0] += length-2;
count[0][2]++;
if (length-2 > 1) {
count[1][2] += length-3;
}
count[1][3]++;
} else {
count[0][3]++;
}
if (last != null) {
if (last.length() == 1) {
count[3][1]++;
} else {
count[2][1]++;
}
}
}
last = word;
}
//System.out.println("Finish " + words.length + " words ...");
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
for (int i=0; i<count.length; i++)
System.out.println(Arrays.toString(count[i]));
System.out.println(" ===== So Pi array is: ===== ");
double[] pi = new double[4];
long allWordCount = count[2][0] + count[3][0];
pi[0] = (double)count[2][0] / allWordCount;
pi[3] = (double)count[3][0] / allWordCount;
System.out.println(Arrays.toString(pi));
System.out.println(" ===== And A matrix is: ===== ");
double[][] A = new double[4][4];
for (int i=0; i<A.length; i++)
for (int j=0; j<A[i].length; j++)
A[i][j] = (double)count[i][j+1]/ count[i][0];
for (int i=0; i<A.length; i++)
System.out.println(Arrays.toString(A[i]));
}
矩阵中出现的概率为0的元素表明B-B, B-S, M-B, M-S, E-M, E-E, S-M, S-E这8种组合是不可能出现的。这是合乎逻辑的。
矩阵B内容比较大,格式如下:
1.0 3.174041419653506E-6 0.0016687522763828306 1.0633038755839245E-4 4.7610621294802586E-6
1.0 2.313791818894887E-6 5.738203710859319E-4 4.8589628196792625E-5 2.313791818894887E-6
1.0 7.935103549133765E-7 0.005299062150111528 6.348082839307012E-6 7.935103549133765E-7
1.0 1.7881026800082968E-6 0.005061224635763484 4.470256700020742E-6 8.940513400041484E-7
矩阵B训练代码:
public static void buildMatrixB(String charMapFile, String charMapCharset, String matrixBFileName) {
/**
* Chinese Character count => 5167
*
* count matrix:
* ALL C1 C2 C3 CN C5168
* B * * * * * 1/ALL+5168
* M * * * * * 1/ALL+5168
* E * * * * * 1/ALL+5168
* S * * * * * 1/ALL+5168
*
* NOTE:
* count[0][0] is the total number of begin count
* count[0][0] is the total number of middle count
* count[2][0] is the total number of end count
* count[3][0] is the total number of single cound
*
* B row -> 4
* B col -> 5169
*/
long[][] matrixBCount = new long[4][5169];
for (int row = 0; row < matrixBCount.length; row++) {
Arrays.fill(matrixBCount[row], 1);
matrixBCount[row][0] = 5168;
}
Map<Character, Integer> dict = new HashMap<Character, Integer>();
Utils.readDict(charMapFile, charMapCharset, dict, null);
try {
BufferedReader br=new BufferedReader(new InputStreamReader(new FileInputStream("icwb2-data/training/msr_training.utf8"),"UTF-8"));
String line = null;
while ((line = br.readLine()) != null) {
String[] words = line.split(" ");
for (int i=0; i<words.length; i++) {
String word = words[i].trim();
if (word.length() < 1)
continue;
if (word.length() == 1) {
int index = dict.get(word.charAt(0));
matrixBCount[3][0]++;
matrixBCount[3][index]++;
} else {
for (int j=0; j<word.length(); j++) {
int index = dict.get(word.charAt(j));
if (j == 0) {
matrixBCount[0][0]++;
matrixBCount[0][index]++;
} else if (j == word.length()-1) {
matrixBCount[2][0]++;
matrixBCount[2][index]++;
} else {
matrixBCount[1][0]++;
matrixBCount[1][index]++;
}
}
}
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(" ===== matrixBCount ===== ");
for (int i=0; i<matrixBCount.length; i++)
System.out.println(Arrays.toString(matrixBCount[i]));
System.out.println(" ========= B matrix =========");
double[][] B = new double[matrixBCount.length][matrixBCount[0].length];
for (int row = 0; row < B.length; row++) {
for (int col = 0; col < B[row].length; col++) {
B[row][col] = (double) matrixBCount[row][col] / matrixBCount[row][0];
if (col < 50) {
System.out.print(B[row][col] + " ");
}
}
System.out.println("");
}
try {
PrintWriter bOut = new PrintWriter(new File(matrixBFileName));
for (int row = 0; row < B.length; row++) {
for (int col = 0; col < B[row].length; col++) {
bOut.print(B[row][col] + " ");
}
bOut.println("");
bOut.flush();
}
bOut.close();
System.out.println("Finish write B to file " + matrixBFileName);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
有了矩阵PI,A跟B,我们就可以写入一个观察序列,用隐含马尔可夫模型跟Viterbi算法获得一个隐藏序列(分词结果)了。如果你对隐含马尔可夫模型还有什么疑问,请参考52nlp的博文:
http://www.52nlp.cn/hmm-learn-best-practices-one-introduction
以下是我用Java自己实现的Viterbi算法,有一点需要注意的是,Java的double能表示最小的值约为1E-350,如果一个文本串很长,例如大于200个字符,算得的结果值很有可能会小于double的最小值,而由于精度问题变为0,这样最终的计算结果就失去意义了。当然,如果能保证输入串的长度比较短,可以不care这个,但为了程序的健壮性,我这里在计算某一列的结果值小于1E-250时,将停止使用double而改用Java提供的高精度类BigDecimal,虽然计算速度会比double慢很多(尤其随着串越来越长),但总比变为0,结果失去意义要强一些。但即便是这样,这个函数也期望输入串不长于200,否则就有可能在1S内得不到最终计算结果。
public static String Viterbi(double[] PI, double[][] A, double[][] B, int[] sentences) {
StringBuilder ret = new StringBuilder();
double[][] matrix = new double[PI.length][sentences.length];
int[][] past = new int[PI.length][sentences.length];
int supplementStartColumn = -1;
BigDecimal[][] supplement = null; //new BigDecimal[][];
for (int row=0; row<matrix.length; row++)
matrix[row][0] = PI[row] * B[row][sentences[0]];
for (int col=1; col<sentences.length; col++) {
if (supplementStartColumn > -1) { //Use supplement BigDecimal matrix
for (int row=0; row<matrix.length; row++) {
BigDecimal max = new BigDecimal(0d);
int last = -1;
for (int r=0; r<matrix.length; r++) {
BigDecimal value = supplement[r][col-1-supplementStartColumn].multiply(new BigDecimal(A[r][row])).multiply(new BigDecimal(B[row][sentences[col]]));
if (value.compareTo(max) > 0) {
max = value;
last = r;
}
}
supplement[row][col-supplementStartColumn] = max;
past[row][col] = last;
}
} else {
boolean switchSupplement = false;
for (int row=0; row<matrix.length; row++) {
double max = 0;
int last = -1;
for (int r=0; r<matrix.length; r++) {
double value = matrix[r][col-1] * A[r][row] * B[row][sentences[col]];
if (value > max) {
max = value;
last = r;
}
}
matrix[row][col] = max;
past[row][col] = last;
if (max < 1E-250)
switchSupplement = true;
}
//Really small data, should switch to supplement BigDecimal matrix now, or we will loose accuracy soon
if (switchSupplement) {
supplementStartColumn = col;
supplement = new BigDecimal[PI.length][sentences.length - supplementStartColumn];
for (int row=0; row<matrix.length; row++) {
supplement[row][col - supplementStartColumn] = new BigDecimal(matrix[row][col]);
}
}
}
}
int index = -1;
if (supplementStartColumn > -1) {
BigDecimal max = new BigDecimal(0d);
int column = supplement[0].length-1;
for (int row=0; row<supplement.length; row++) {
if (supplement[row][column].compareTo(max) > 0) {
max = supplement[row][column];
index = row;
}
}
} else {
double max = 0;
for (int row=0; row<matrix.length; row++)
if (matrix[row][sentences.length-1] > max) {
max = matrix[row][sentences.length-1];
index = row;
}
}
/*for (int i=0; i<matrix.length; i++)
System.out.println(Arrays.toString(matrix[i]));*/
ret.append(getDesc(index));
for (int col=sentences.length-1; col>=1; col--) {
index = past[index][col];
ret.append(getDesc(index));
}
return ret.reverse().toString();
}
测试一下,对如下串进行分词:
String str2 = "这并不是一种最好的处理技术,因为这有可能低估或高估真实概率,更加科学的方法是使用复杂一点的Good—Turing技术,这项技术的原始版本是图灵当年和他的助手Good在破解德国密码机时发明的。";
结果为:
Switch to BigDecimal at length = 79
words.length=95 timecost:63
这/并/不是/一种/最好/的/处理/技术/,/因为/这有/可能/低估/或/高估/真实/概率/,/更加/科学/的/方法/是/使用/复杂/一点/的/Good—Turing技术/,/这项/技术/的/原始/版本/是/图灵/当年/和/他/的/助手/Good/在/破解/德国/密码/机时/发明/的/。/
timeCost: 63