一、不包含原来分词器StopAnalyZer信息
Code 1:
public class MyStopAnalyzer extends Analyzer {
//一组数组产生分词对象
private CharArraySet stopWordsSet;//自己要屏蔽的一些单词
//构造函数
public MyStopAnalyzer(String[] stopWords) {
//将String数组转换为<span style="font-family: Arial, Helvetica, sans-serif;">CharArraySet</span>
stopWordsSet = StopFilter.makeStopSet(stopWords, true); //CharArraySet
}
@Override
protected TokenStreamComponents createComponents(String arg0) {
//创建一个分词器
LetterTokenizer letterTokenizer = new LetterTokenizer();
//创建一系列分词过滤器
LowerCaseFilter lowerCaseFilter = new LowerCaseFilter(letterTokenizer);
StopFilter stopFilter = new StopFilter(lowerCaseFilter, stopWordsSet);
//TokenStream包装类
return new TokenStreamComponents(letterTokenizer, stopFilter);
}
}
Code 2 测试方法1:
public static void displayToken(String str,Analyzer a){
try {
//1.TokenStream 把一个字符串建成一个Token流 Token通过传入的分词器对它进行分词
//此处第一个参数filename没有实际意义
TokenStream tokenStream = a.tokenStream("content", new StringReader(str));
//将流里面的数据取出来 每个属性相当于一个网(流走 网也跟着走)
//2,创建一个属性,这个属性会添加流中,随着这个TokenStream增加
CharTermAttribute cta = tokenStream.addAttribute(CharTermAttribute.class);
//3.调用reset 否则会出现错误
tokenStream.reset();
//对流进行遍历
while(tokenStream.incrementToken()){
System.out.print("["+cta+"]");
}
System.out.println("\n");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Code 3 Junit测试方法:
@Test
public void testMyStopAnalyzer(){
String[] stopWords = {"I","you","hate"};
MyStopAnalyzer a1 = new MyStopAnalyzer(stopWords);
Analyzer a2 = new StopAnalyzer();
String text = "How are you I hate you";
AnalyzerUtils.displayToken(text, a1);
AnalyzerUtils.displayToken(text, a2);
}
二、原来分词器StopAnalyZer包含的默认过滤单词
System.out.println(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of]
三、包含原来分词器StopAnalyZer信息
public MyStopAnalyzer(String[] stopWords) {
//System.out.println(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
stopWordsSet = StopFilter.makeStopSet(stopWords, true); //CharArraySet
stopWordsSet.addAll(StopAnalyzer.ENGLISH_STOP_WORDS_SET); //将默认的也加进来
}