java实现中文分词

最新推荐文章于 2025-05-25 09:42:37 发布

转载最新推荐文章于 2025-05-25 09:42:37 发布 · 543 阅读

2 ·

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/chenrenshui/p/7273551.html

文章标签：

#java #数据库

IK Analyzer是基于lucene实现的分词开源框架

下载路径：http://so.youkuaiyun.com/so/search/s.do?q=IKAnalyzer2012.jar&t=doc&o=&s=all&l=null

需要在项目中引入：

IKAnalyzer2012.jar

lucene-core-3.6.0.jar

实现的两种方法：

使用(lucene)实现：

 1 import java.io.IOException;
 2 import java.io.StringReader;
 3 import org.wltea.analyzer.core.IKSegmenter;
 4 import org.wltea.analyzer.core.Lexeme;
 5 
 6 public class Fenci1 {
 7     public static void main(String[] args) throws IOException{
 8         String text="你好，我的世界！";  
 9         StringReader sr=new StringReader(text);  
10         IKSegmenter ik=new IKSegmenter(sr, true);  
11         Lexeme lex=null;  
12         while((lex=ik.next())!=null){  
13             System.out.print(lex.getLexemeText()+"，");  
14         } 
15     }
16 
17 }

使用(IK Analyzer)实现:

 1 import java.io.IOException;
 2 import java.io.StringReader;
 3 import org.apache.lucene.analysis.Analyzer;
 4 import org.apache.lucene.analysis.TokenStream;
 5 import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
 6 import org.wltea.analyzer.lucene.IKAnalyzer;
 7 
 8 public class Fenci {
 9     public static void main(String[] args) throws IOException {
11             String text="你好，我的世界!";  
12             //创建分词对象  
13             Analyzer anal=new IKAnalyzer(true);       
14             StringReader reader=new StringReader(text);  
15             //分词  
16             TokenStream ts=anal.tokenStream("", reader);  
17             CharTermAttribute term=ts.getAttribute(CharTermAttribute.class);  
18             //遍历分词数据  
19             while(ts.incrementToken()){  
20                 System.out.print(term.toString()+"，");  
21             }  
22             reader.close();  
23             System.out.println(); 
24     }
25 
26 }