SQLite源代码分析----------分词器②

最新推荐文章于 2025-06-05 09:16:49 发布

原创

最新推荐文章于 2025-06-05 09:16:49 发布 · 680 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#sqlite

本文继续探讨SQLite的Tokenizer，重点介绍Porter_Tokenizer模块，它结合波特词干算法对英语单词进行词根化处理，提高全文搜索的准确性。通过代码分析，解释了Porter分词器如何处理输入，以及与“simple”分词器的区别。

2021SC@SDUSC

文章目录

- 简介
- 代码分析

简介

承接上文SQLite源代码分析----------分词器①，接下来我们介绍Tokenizer的另外一个模块：Porter_Tokenizer；
除了“simple”分词器之外，FTS源代码还提供了一个使用波特词干算法（porter stemming algorithm）的分词器。此分词器使用相同的规则将输入文档分隔为术语，包括将所有术语折叠为小写，但也使用波特词干算法将相关的英语单词简化为公共词根。例如，使用与上面一段相同的输入文档，Porter分词器提取以下标记：“right now thei veri frustrat”。尽管其中一些术语甚至不是英语单词，但在某些情况下，使用它们构建全文索引比简单标记器产生的更容易理解的输出更有用。使用波特标记器，文档不仅匹配全文查询，如"MATCH ‘Frustrated’"，还匹配查询如 “MATCH ‘Frustration’”，因为”Frustration“这个词被波特词干算法简化为”frustrat“就像”Frustrated“一样。因此，在使用波特分词器时，FTS不仅能够找到查询术语的精确匹配，而且能够找到与类似的英语术语匹配的词。
举例说明“simple”和“porter”分词器之间的区别：

-- Create a table using the simple tokenizer. Insert a document into it.
CREATE VIRTUAL TABLE simple USING fts3(tokenize=simple);
INSERT INTO simple VALUES('Right now they''re very frustrated');

-- The first of the following two queries matches the document stored in
-- table "simple". The second does not.
SELECT * FROM simple WHERE simple MATCH 'Frustrated';
SELECT * FROM simple WHERE simple MATCH 'Frustration';

-- Create a table using the porter tokenizer. Insert the same document into it
CREATE VIRTUAL TABLE porter USING fts3(tokenize=porter)