1. 自定义 `TokenFilter` 把 词性(POS) 作为 payload 贴到每个 token;
2. 建索引时启用 `with_positions_offsets_payloads`;
3. 查询阶段用 `TermVectors` API 把 payload 原样取出来。
整段代码 零依赖(除了 lucene-core、lucene-analysis-common),直接 `main` 方法即可跑。
---
1. Maven 依赖(Lucene 9.8)
```xml
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>9.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analysis-common</artifactId>
<version>9.8.0</version>
</dependency>
```
---
2. 自定义 TokenFilter:POS-payload
```java
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.tokenattributes.*;
import java.io.IOException;
/**
* 把每个词后面简单拼一个词性标记当 payload。
* (真实场景可用 OpenNLP 的 POSTagger,这里为了演示用硬编码)
*/
public final class PosPayloadFilter extends TokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final PayloadAttribute payAtt = addAttribute(PayloadAttribute.class);
private final PositionIncrementAttribute posIncAtt = addAttribute(PositionIncrementAttribute.class);
protected PosPayloadFilter(TokenStream input) {
super(input);
}
@Override
public boolean incrementToken() throws IOException {
if (!input.incrementToken()) return false;
String word = termAtt.toString();
// 简单规则:名词/动词/其它
String pos;
if (word.endsWith("ing")) pos = "v";
else if (word.endsWith("s")) pos = "n";
else pos = "x";
byte[] payloadBytes = pos.getBytes(StandardCharsets.UTF_8);
payAtt.setPayload(new BytesRef(payloadBytes));
return true;
}
}
```
---
3. 建索引:打开 term-vector + payload
```java
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
public class PayloadIndexDemo {
public static void main(String[] args) throws Exception {
Directory dir = new ByteBuffersDirectory();
Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
StandardTokenizer src = new StandardTokenizer();
TokenStream tok = new PosPayloadFilter(src); // 关键:加在标准分词器后面
return new TokenStreamComponents(src, tok);
}
};
IndexWriterConfig cfg = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(dir, cfg);
FieldType ft = new FieldType(TextField.TYPE_NOT_STORED);
ft.setStoreTermVectors(true);
ft.setStoreTermVectorPositions(true);
ft.setStoreTermVectorOffsets(true);
ft.setStoreTermVectorPayloads(true); // <-- 核心开关
ft.freeze();
Document doc = new Document();
doc.add(new Field("body", "running cars fast", ft));
writer.addDocument(doc);
writer.commit();
writer.close();
}
}
```
---
4. 查询阶段:把 payload 读回来
```java
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.util.BytesRef;
public class PayloadReaderDemo {
public static void main(String[] args) throws Exception {
Directory dir = new ByteBuffersDirectory(); // 这里复用上面建好的目录
IndexReader reader = DirectoryReader.open(dir);
Terms terms = reader.getTermVector(0, "body");
if (terms != null) {
TermsEnum te = terms.iterator();
PostingsEnum pe = null;
while (te.next() != null) {
String term = te.term().utf8ToString();
pe = te.postings(pe, PostingsEnum.POSITIONS | PostingsEnum.PAYLOADS);
while (pe.nextPosition() != DocIdSetIterator.NO_MORE_DOCS) {
BytesRef payload = pe.getPayload();
System.out.printf("term=%s pos=%d payload=%s%n",
term,
pe.startOffset(),
payload == null ? "null" : payload.utf8ToString());
}
}
}
reader.close();
}
}
```
---
5. 运行结果(示例)
```
term=running pos=0 payload=v
term=cars pos=8 payload=n
term=fast pos=14 payload=x
```
每个 token 后面都带了一个单字节词性 payload,证明写入/读取链路全部打通。
---
一句话总结
> 只要在自定义 `TokenFilter` 里给 `PayloadAttribute` 赋值,并在 FieldType 里打开 `setStoreTermVectorPayloads(true)`,就能把任意二进制/字符串 payload 跟每个分词一起存进索引,再通过 `TermsEnum/PostingsEnum` 原样读出。