使用 ElasticsearchRetriever 进行灵活的搜索和分析

最新推荐文章于 2025-12-13 23:12:58 发布

原创最新推荐文章于 2025-12-13 23:12:58 发布 · 323 阅读

CC 4.0 BY-SA版权

文章标签：

在现代数据驱动的应用中，搜索和信息检索是核心需求之一。Elasticsearch 是一个分布式、RESTful 的搜索和分析引擎，用于支持关键字搜索、向量搜索、混合搜索和复杂的过滤操作。在这篇文章中，我们将深入探讨如何利用 ElasticsearchRetriever 进行灵活的搜索操作，并提供实际的代码示例来展示其强大功能。

技术背景介绍

Elasticsearch 是一个强大的搜索引擎，能够处理多租户架构和无架构的 JSON 文档。通过 ElasticsearchRetriever，我们可以通过 Query DSL 灵活地访问 Elasticsearch 的所有功能。无论是关键词匹配、向量检索、模糊匹配还是复杂过滤，ElasticsearchRetriever 都能提供支持。

核心原理解析

ElasticsearchRetriever 是通过封装 Elasticsearch 的 Query DSL 来实现的。通过这种方式，我们可以根据不同的查询需求来调整搜索策略，实现高效的信息检索。

代码实现演示

下面我们通过一个实际的示例来演示如何使用 ElasticsearchRetriever 进行不同类型的搜索。

环境配置

from elasticsearch import Elasticsearch
# 链接到本地的 Elasticsearch 实例
es_url = "http://localhost:9200"
es_client = Elasticsearch(hosts=[es_url])

创建索引并索引数据

from elasticsearch.helpers import bulk
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_community.embeddings import DeterministicFakeEmbedding

# 定义示例数据
index_name = "test-langchain-retriever"
text_field = "text"
dense_vector_field = "fake_embedding"
num_characters_field = "num_characters"
texts = ["foo", "bar", "world", "hello world", "hello", "foo bar", "bla bla foo"]

# 创建示例索引
def create_index(es_client, index_name, text_field, dense_vector_field, num_characters_field):
    es_client.indices.create(
        index=index_name,
        mappings={"properties": {
            text_field: {"type": "text"},
            dense_vector_field: {"type": "dense_vector"},
            num_characters_field: {"type": "integer"},
        }},
    )

# 索引数据
def index_data(es_client, index_name, text_field, dense_vector_field, embeddings, texts, refresh=True):
    create_index(es_client, index_name, text_field, dense_vector_field, num_characters_field)
    vectors = embeddings.embed_documents(list(texts))
    requests = [{
        "_op_type": "index",
        "_index": index_name,
        "_id": i,
        text_field: text,
        dense_vector_field: vector,
        num_characters_field: len(text),
    } for i, (text, vector) in enumerate(zip(texts, vectors))]
    bulk(es_client, requests)
    if refresh:
        es_client.indices.refresh(index=index_name)

embeddings = DeterministicFakeEmbedding(size=3)
index_data(es_client, index_name, text_field, dense_vector_field, embeddings, texts)

各类检索示例

向量检索

from langchain_elasticsearch import ElasticsearchRetriever

# 向量查询函数
def vector_query(search_query: str) -> Dict:
    vector = embeddings.embed_query(search_query)
    return {
        "knn": {
            "field": dense_vector_field,
            "query_vector": vector,
            "k": 5,
            "num_candidates": 10,
        }
    }

# 实例化向量检索器
vector_retriever = ElasticsearchRetriever.from_es_params(
    index_name=index_name,
    body_func=vector_query,
    content_field=text_field,
    url=es_url,
)

# 进行检索
results = vector_retriever.invoke("foo")
print(results)

关键词匹配 (BM25)

# BM25 查询函数
def bm25_query(search_query: str) -> Dict:
    return {
        "query": {
            "match": {
                text_field: search_query,
            },
        },
    }

# 实例化 BM25 检索器
bm25_retriever = ElasticsearchRetriever.from_es_params(
    index_name=index_name,
    body_func=bm25_query,
    content_field=text_field,
    url=es_url,
)

# 进行检索
results = bm25_retriever.invoke("foo")
print(results)

应用场景分析

ElasticsearchRetriever 能够在多个场景下应用，例如构建智能搜索引擎、实现交互式问答系统、处理自然语言理解任务等。无论是增强现有的搜索能力还是构建新的应用，Elasticsearch 的灵活性和强大性能都能满足需求。

实践建议

选择合适的索引策略：根据数据量和查询复杂度选择适合的索引策略以保证性能。
组合多种检索方式：通过混合搜索等方法，充分利用不同检索技术的优势。
优化嵌入模型：在实际应用中，使用高质量的嵌入模型可以显著提高检索效果。

如果遇到问题欢迎在评论区交流。

—END—