【lucene】自定义tokenfilter 自带payload

1. 自定义 `TokenFilter` 把 词性(POS) 作为 payload 贴到每个 token;  

2. 建索引时启用 `with_positions_offsets_payloads`;  

3. 查询阶段用 `TermVectors` API 把 payload 原样取出来。

 

整段代码 零依赖(除了 lucene-core、lucene-analysis-common),直接 `main` 方法即可跑。

 

---

 

1. Maven 依赖(Lucene 9.8)

 

```xml

<dependency>

  <groupId>org.apache.lucene</groupId>

  <artifactId>lucene-core</artifactId>

  <version>9.8.0</version>

</dependency>

<dependency>

  <groupId>org.apache.lucene</groupId>

  <artifactId>lucene-analysis-common</artifactId>

  <version>9.8.0</version>

</dependency>

```

 

---

 

2. 自定义 TokenFilter:POS-payload

 

```java

import org.apache.lucene.analysis.*;

import org.apache.lucene.analysis.tokenattributes.*;

 

import java.io.IOException;

 

/**

 * 把每个词后面简单拼一个词性标记当 payload。

 * (真实场景可用 OpenNLP 的 POSTagger,这里为了演示用硬编码)

 */

public final class PosPayloadFilter extends TokenFilter {

  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

  private final PayloadAttribute payAtt = addAttribute(PayloadAttribute.class);

  private final PositionIncrementAttribute posIncAtt = addAttribute(PositionIncrementAttribute.class);

 

  protected PosPayloadFilter(TokenStream input) {

    super(input);

  }

 

  @Override

  public boolean incrementToken() throws IOException {

    if (!input.incrementToken()) return false;

 

    String word = termAtt.toString();

    // 简单规则:名词/动词/其它

    String pos;

    if (word.endsWith("ing")) pos = "v";

    else if (word.endsWith("s")) pos = "n";

    else pos = "x";

 

    byte[] payloadBytes = pos.getBytes(StandardCharsets.UTF_8);

    payAtt.setPayload(new BytesRef(payloadBytes));

    return true;

  }

}

```

 

---

 

3. 建索引:打开 term-vector + payload

 

```java

import org.apache.lucene.analysis.standard.StandardTokenizer;

import org.apache.lucene.document.*;

import org.apache.lucene.index.*;

import org.apache.lucene.store.*;

 

public class PayloadIndexDemo {

  public static void main(String[] args) throws Exception {

    Directory dir = new ByteBuffersDirectory();

    Analyzer analyzer = new Analyzer() {

      @Override

      protected TokenStreamComponents createComponents(String fieldName) {

        StandardTokenizer src = new StandardTokenizer();

        TokenStream tok = new PosPayloadFilter(src); // 关键:加在标准分词器后面

        return new TokenStreamComponents(src, tok);

      }

    };

 

    IndexWriterConfig cfg = new IndexWriterConfig(analyzer);

    IndexWriter writer = new IndexWriter(dir, cfg);

 

    FieldType ft = new FieldType(TextField.TYPE_NOT_STORED);

    ft.setStoreTermVectors(true);

    ft.setStoreTermVectorPositions(true);

    ft.setStoreTermVectorOffsets(true);

    ft.setStoreTermVectorPayloads(true); // <-- 核心开关

    ft.freeze();

 

    Document doc = new Document();

    doc.add(new Field("body", "running cars fast", ft));

    writer.addDocument(doc);

    writer.commit();

    writer.close();

  }

}

```

 

---

 

4. 查询阶段:把 payload 读回来

 

```java

import org.apache.lucene.index.*;

import org.apache.lucene.store.*;

import org.apache.lucene.util.BytesRef;

 

public class PayloadReaderDemo {

  public static void main(String[] args) throws Exception {

    Directory dir = new ByteBuffersDirectory(); // 这里复用上面建好的目录

    IndexReader reader = DirectoryReader.open(dir);

 

    Terms terms = reader.getTermVector(0, "body");

    if (terms != null) {

      TermsEnum te = terms.iterator();

      PostingsEnum pe = null;

      while (te.next() != null) {

        String term = te.term().utf8ToString();

        pe = te.postings(pe, PostingsEnum.POSITIONS | PostingsEnum.PAYLOADS);

        while (pe.nextPosition() != DocIdSetIterator.NO_MORE_DOCS) {

          BytesRef payload = pe.getPayload();

          System.out.printf("term=%s pos=%d payload=%s%n",

              term,

              pe.startOffset(),

              payload == null ? "null" : payload.utf8ToString());

        }

      }

    }

    reader.close();

  }

}

```

 

---

 

5. 运行结果(示例)

 

```

term=running pos=0 payload=v

term=cars pos=8 payload=n

term=fast pos=14 payload=x

```

 

每个 token 后面都带了一个单字节词性 payload,证明写入/读取链路全部打通。

 

---

 

一句话总结  

 

> 只要在自定义 `TokenFilter` 里给 `PayloadAttribute` 赋值,并在 FieldType 里打开 `setStoreTermVectorPayloads(true)`,就能把任意二进制/字符串 payload 跟每个分词一起存进索引,再通过 `TermsEnum/PostingsEnum` 原样读出。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值