Elasticsearch权威指南：使用Top Hits聚合实现字段折叠分组查询

廉咏燃

于 2025-06-11 09:07:54 发布

阅读量226

点赞数 5

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/gitblog_00676/article/details/148577101

Elasticsearch权威指南：使用Top Hits聚合实现字段折叠分组查询

elasticsearch-definitive-guide The Definitive Guide to Elasticsearch 项目地址: https://gitcode.com/gh_mirrors/el/elasticsearch-definitive-guide

什么是字段折叠

字段折叠(Field Collapsing)是搜索场景中一个常见需求，它允许我们将搜索结果按照某个特定字段进行分组展示。比如在博客系统中，我们可能需要按照作者姓名分组显示最相关的博客文章。

实现原理

Elasticsearch通过组合使用terms聚合和top_hits聚合来实现这一功能：

terms聚合：负责按照指定字段值进行分组
top_hits聚合：在每个分组内返回最相关的文档

准备工作

索引映射设置

要实现有效的分组，分组字段必须是未经分析的原始值。我们通常使用多字段(multi-field)映射：

PUT /my_index/_mapping/blogpost
{
  "properties": {
    "user": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

user.name：用于全文搜索的分析字段
user.name.raw：用于分组的原始字段

测试数据准备

PUT /my_index/user/1
{
  "name": "John Smith",
  "email": "john@smith.com",
  "dob": "1970/10/24"
}

PUT /my_index/blogpost/2
{
  "title": "Relationships",
  "body": "It's complicated...",
  "user": {
    "id": 1,
    "name": "John Smith"
  }
}

PUT /my_index/user/3
{
  "name": "Alice John",
  "email": "alice@john.com",
  "dob": "1979/01/04"
}

PUT /my_index/blogpost/4
{
  "title": "Relationships are cool",
  "body": "It's not complicated at all...",
  "user": {
    "id": 3,
    "name": "Alice John"
  }
}

执行分组查询

下面是一个完整的字段折叠查询示例：

GET /my_index/blogpost/_search 
{
  "size" : 0,
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "relationships" }},
        { "match": { "user.name": "John" }}
      ]
    }
  },
  "aggs": {
    "users": {
      "terms": {
        "field": "user.name.raw",
        "order": { "top_score": "desc" }
      },
      "aggs": {
        "top_score": { "max": { "script": "_score" }},
        "blogposts": { "top_hits": { "_source": "title", "size": 5 }}
      }
    }
  }
}

关键参数说明

"size": 0：不返回常规搜索结果，只返回聚合结果
query：筛选标题包含"relationships"且用户名为"John"的博客
terms聚合：按用户原始姓名分组
top_score子聚合：计算每个分组中最高文档得分，用于排序
top_hits子聚合：返回每个分组中最相关的5篇博客标题

结果解析

响应结果主要包含两部分：

空hits数组：因为我们设置了"size":0
聚合结果：
- 每个用户一个分组桶(bucket)
- 每个桶包含：
  - 用户姓名(key)
  - 文档数量(doc_count)
  - 该用户最相关的博客列表(blogposts.hits)
  - 该组最高得分(top_score)