搜索引擎中中文词组分词的实现

最新推荐文章于 2022-08-30 09:35:28 发布

转载最新推荐文章于 2022-08-30 09:35:28 发布 · 1.4k 阅读

文章标签：

#搜索引擎 #token #null #扩展 #算法 #email

本文介绍如何扩展Lucene.net标准分词器以支持中文词组分词，包括加载词典、截取连续中文字段及实现中文分词算法等关键步骤。

Lucene.net标准分词器在英文分词中有非常好的体验。比喻说：在邮件，IP地址，符号处理方面，它都处理得非常好。只是很遗憾，它不支持中文词组分词。于是，我就通过修改里面的核心代码让它扩展，支持中文的分词。

目标：使它能够增加对中文词组的切词。

效果：

原句：“我是中国人！I am chiness!Email:youpeizun126@126.com;IP:172.17.34.168”

切词效果：

我/是/中国人/中国/中/国/人/Email/youpeizun126@126.com/IP/172.17.34.168

所要完成的任务：

1．装载词库

2．截取一段连续的中文字段

3．进行连续的分词.

下面是设计扩展Lucene.net标准分词器的支持中文词组分词的流程图.

接下来,我把扩展Lucene.net标准分词器所写的核心代码,主要包含三个函数,它们分别实现装载词典,载取连续中文字段,中文词组分词算法功能.

#region 加载中文词典

public void LoadDirectory(string path)

{

if(!File.Exists("words.txt"))

return;

TextReader tr_words=new StreamReader("words.txt",System.Text.Encoding.Default);

System.Diagnostics.Debug.Write("begin read words");

if(directory==null)

{

directory=new System.Collections.Hashtable();

try

{

string word=null;

while((word=tr_words.ReadLine())!=null)

{

try

{

if(directory[word]==null)

{

directory.Add(word,word);

}

catch(SystemException ex_)

{

}

catch(SystemException ex)

{

}

#endregion

}

#region 截取一段连续中文字段

private void InitChinessText()

{

textlengh=0;

cn_index=0;

chinesstext[0]=token.image;

textlengh++;

cn_start=token.beginColumn;

isCnToken=true;

bool isCN= true;

while(isCN&&textlengh<255)

{ token=token_source.GetNextToken();

if(token.kind!=0)

{

isCN=Char.GetUnicodeCategory(token.image,0).Equals(System.Globalization.UnicodeCategory.OtherLetter);

}

else

isCN=false;

if(isCN)

{

chinesstext[textlengh]=token.image;

textlengh++;

}

else

{

cn_end_token=token;

}

if(textlengh>=4)

{

wordlengh=4;

}

else

wordlengh=textlengh;

}

#endregion

#region 实现中文分词算法

private string GetNextTokenText()

{ string text=null;

if(wordlengh==4)

{

text=chinesstext[cn_index]+chinesstext[cn_index+1]+chinesstext[cn_index+2]+chinesstext[cn_index+3];

if(directory[text]!=null)

{

}

wordlengh--;

}

if(wordlengh==3)

{

text=chinesstext[cn_index]+chinesstext[cn_index+1]+chinesstext[cn_index+2];

wordlengh--;

if(directory[text]!=null)

{

goto return_;

}

if(wordlengh==2)

{

text=chinesstext[cn_index]+chinesstext[cn_index+1];

wordlengh--;

if(directory[text]!=null)

{

goto return_;

}

if(wordlengh==1)

{

text=chinesstext[cn_index];

cn_index++;

if((textlengh-cn_index)>=4)

{

wordlengh=4;

}

else

if((textlengh-cn_index)==0)

{

isCnToken=false;

jj_ntk=cn_end_token.kind;

token=new Token();

token.next=cn_end_token;

}

else

{

wordlengh=textlengh-cn_index;

}

return_:

return text;

}

#endregion

结束,谢谢你的阅读.