结论先行:不能直接给 `StandardAnalyzer` 本身加 CharFilter,因为 `StandardAnalyzer` 是 final 类,其 `createComponents` 方法只返回固定的 `Tokenizer + TokenFilter` 链,且没有预留 CharFilter 的入口。
---
正确做法:自己写一个 自定义 Analyzer
```java
public final class MyAnalyzer extends Analyzer {
private final CharFilterFactory[] charFilters;
private final TokenFilterFactory[] tokenFilters;
public MyAnalyzer(CharFilterFactory[] charFilters,
TokenFilterFactory[] tokenFilters) {
this.charFilters = charFilters;
this.tokenFilters = tokenFilters;
}
@Override
protected Reader initReader(String fieldName, Reader reader) {
// 把 CharFilter 一层层包裹 Reader
for (CharFilterFactory f : charFilters) {
reader = f.create(reader);
}
return reader;
}
@Override
protected TokenStreamComponents createComponents(String fieldName) {
// 标准分词器 + 你想要的 TokenFilter
StandardTokenizer src = new StandardTokenizer();
TokenStream tok = src;
for (TokenFilterFactory f : tokenFilters) {
tok = f.create(tok);
}
return new TokenStreamComponents(src, tok);
}
}
```
---
使用示例
```java
CharFilterFactory html = new HTMLStripCharFilterFactory(Map.of());
TokenFilterFactory lower = new LowerCaseFilterFactory(Map.of());
Analyzer analyzer = new MyAnalyzer(new CharFilterFactory[]{html},
new TokenFilterFactory[]{lower});
```
---
Elasticsearch 里怎么做?
在索引设置里直接写:
```json
"analyzer": {
"my_std": {
"tokenizer": "standard",
"char_filter": ["html_strip"],
"filter": ["lowercase"]
}
}
```
ES 会在底层生成一个 与上面自定义 Analyzer 等价的实例。
---
一句话总结:
> 想给 `StandardAnalyzer` 加 CharFilter,只能 继承 Analyzer 并自己实现 `initReader`;`StandardAnalyzer` 本身不可扩展。