如何在使用Elasticsearch 时自定义analyzer

二进制修理工

于 2025-02-07 17:35:25 发布

阅读量316

点赞数 3

CC 4.0 BY-SA版权

分类专栏： Elasticsearch相关文章标签： elasticsearch 全文检索搜索引擎

本文链接：https://blog.youkuaiyun.com/Coastlise/article/details/145499461

Elasticsearch相关专栏收录该内容

1 篇文章

订阅专栏

如何在使用Elasticsearch 时自定义analyzer

在 Elasticsearch 中，analyzer 用于定义如何分析和处理文本数据。你可以通过自定义 analyzer 来满足特定的搜索需求。自定义 analyzer 通常涉及以下几个步骤：

定义 Tokenizer：分词器用于将文本分割成词元（tokens）。
定义 Token Filters：词元过滤器用于对词元进行进一步处理，例如转换大小写、去除停用词等。
定义 Character Filters：字符过滤器用于在分词之前对文本进行预处理，例如去除 HTML 标签、替换特定字符等。
组合成 Analyzer：将 tokenizer、token filters 和 character filters 组合成一个自定义的 analyzer。

以下是一个在 Elasticsearch 中自定义 analyzer 的示例：

示例：自定义 Analyzer

假设我们想要创建一个自定义的 analyzer，该 analyzer 使用 standard tokenizer，并添加 lowercase 和 asciifolding token filters。

1. 创建索引时定义自定义 Analyzer

你可以在创建索引时通过 settings 部分定义自定义的 analyzer。

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}

2. 使用自定义 Analyzer

在上述示例中，我们定义了一个名为 my_custom_analyzer 的自定义 analyzer，并将其应用于 content 字段。

示例：使用 Java API 定义自定义 Analyzer

如果你使用的是 Java API，可以通过 Settings.Builder 来定义自定义的 analyzer。

import org.elasticsearch.action.admin.indices.create.CreateIndexRequest;
import org.elasticsearch.action.admin.indices.create.CreateIndexResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.XContentType;

import java.io.IOException;

public class CustomAnalyzerExample {

    public static void main(String[] args) throws IOException {
        RestHighLevelClient client = new RestHighLevelClient(/* your client configuration */);

        CreateIndexRequest request = new CreateIndexRequest("my_index");
        request.settings(Settings.builder()
                .put("analysis.analyzer.my_custom_analyzer.type", "custom")
                .put("analysis.analyzer.my_custom_analyzer.tokenizer", "standard")
                .putList("analysis.analyzer.my_custom_analyzer.filter", "lowercase", "asciifolding"));

        request.mapping("{\n" +
                "  \"properties\": {\n" +
                "    \"content\": {\n" +
                "      \"type\": \"text\",\n" +
                "      \"analyzer\": \"my_custom_analyzer\"\n" +
                "    }\n" +
                "  }\n" +
                "}", XContentType.JSON);

        CreateIndexResponse createIndexResponse = client.indices().create(request, RequestOptions.DEFAULT);
        System.out.println("Index created: " + createIndexResponse.isAcknowledged());

        client.close();
    }
}

自定义 Tokenizer 和 Token Filter

如果你需要更复杂的分词逻辑，可以自定义 tokenizer 和 token filter。这通常涉及编写自定义的 Java 类，并将其打包为插件。

示例：自定义 Token Filter

假设我们想要创建一个自定义的 token filter，该 token filter 将所有词元转换为大写。

创建自定义 Token Filter 类

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import java.io.IOException;

public class UpperCaseFilter extends TokenFilter {
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    protected UpperCaseFilter(TokenStream input) {
        super(input);
    }

    @Override
    public final boolean incrementToken() throws IOException {
        if (input.incrementToken()) {
            char[] buffer = termAtt.buffer();
            int length = termAtt.length();
            for (int i = 0; i < length; i++) {
                buffer[i] = Character.toUpperCase(buffer[i]);
            }
            return true;
        } else {
            return false;
        }
    }
}

创建插件

将自定义的 token filter 打包为 Elasticsearch 插件，并在 Elasticsearch 中安装该插件。

在索引中使用自定义 Token Filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "uppercase_filter": {
          "type": "uppercase"
        }
      },
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding", "uppercase_filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}