逆向最大区配算法

最新推荐文章于 2024-11-02 17:05:17 发布

iceshirley

最新推荐文章于 2024-11-02 17:05:17 发布

阅读量1.1k

点赞数

CC 4.0 BY-SA版权

文章标签：算法 input 数据结构 string token c

本文链接：https://blog.youkuaiyun.com/iceshirley/article/details/1475598

一、定义

逆向mm算法：假设词典里面中最长的词条所包含的字数为L，则从待分析的字符串中取出L个词，比较词典，如果不存在，则去掉最后一个字，在与词典比较，如此反复循环。直到满足条件为止。

二、实现过程

构造一个MMChineseAnalyzer类，继承org.apache.lucene.analysis.Analyzer，需要实现public TokenStream tokenStream(String field, Reader reader)方法。在构建一个MMChineseTokenizer，继承org.apache.lucene.analysis.Tokenizer。在MMChineseAnalyzer的构造方法：

public MMChineseAnalyzer(){
  dic=new HashSet<String>();
  loadStopword();
  loadDictionary();
}用于哈希表插入，删除，查找的时间复杂度为常数级的，故采用此数据结构。dic用来存放词典文件。loadStopword();方法载入stopword，loadDictionary()用来存放词典文件。

在MMChineseTokenizer中，需要实现public Token next()方法。构造函数为 public MMChineseTokenizer(Reader in,HashSet dic){
input=in;
this.dic=dic;
}dic为词典文件，类型为hashset，input为要分析的字符串

在next（）方法中，我们将input存放到ioBuffer中，然后取出一个字符c，判断c是什么字符类型，如果c属于汉字

if (cUnicodeBlock == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS) {

tokenType = "chinese";

int start = bufferIndex;

char[] temp = new char[7];//词典中最大的词条长度为7

temp[0] = c;

for (int i = 1; i < 7; i++, start++) {

char temp1 = ioBuffer[start];

if (start > ioBuffer.length) {

break;

}

if (cUnicodeBlock.toString().equalsIgnoreCase(

Character.UnicodeBlock.of(temp1).toString())) {

temp[i] = temp1;

} else {

break;

}

String temp2 = new String(temp);

temp2 = temp2.trim();

int length = temp2.length();

//算法的关键

while (true) {

if (dic.contains(temp2)) {

word.append(temp2);

offset = start;

bufferIndex = start;

break LABLE;

} else {

if (length == 1) {

word.append(temp2);

offset = start ;

bufferIndex = start;

break LABLE;

}

temp2 = temp2.substring(0, --length);

start--;

}

temp存放的是每次取出的词条，while循环是判断如果词条在词典中，则结束循环，word为StringBuffer类型。如果词典里面没有，则去掉最后一个字temp2 = temp2.substring(0, --length);反复循环，直至length等于一的时候，这个时候只有一个词条只有一个词，便添加到word中，循环结束

如果c是拉丁字符 isSameUnicodeBlock是判断当前字符与下一个字符是不是属于一个字符集。

else if (cUnicodeBlock == Character.UnicodeBlock.BASIC_LATIN) {

tokenType = "english";

if (Character.isWhitespace(c)) {

if (word.length() != 0)

break;

} else {

word.append(c);

nextChar = ioBuffer[bufferIndex];

nextCharUnicodeBlock = Character.UnicodeBlock.of(nextChar);

boolean isSameUnicodeBlock = cUnicodeBlock.toString().equalsIgnoreCase(nextCharUnicodeBlock.toString());

if (word.length() != 0 && (!isSameUnicodeBlock)) {

break;

}

这样就写好了分词器，

测试代码为“今天是个难忘的日子中华人民共和国成立于1949年10月1号 ”

结果为：

(今天是,0,3,type=chinese)
(个,3,4,type=chinese)
(难忘,4,6,type=chinese)
(日子,7,9,type=chinese)
(中华人民共和国,9,16,type=chinese)
(成立,16,18,type=chinese)
(1949,19,23,type=latin)
(年,23,24,type=chinese)
(10,24,26,type=latin)
(月,26,27,type=chinese)
(1,27,28,type=latin)
(号,28,29,type=chinese)