检索高亮
高亮 功能可以在搜索结果中标记出匹配的文本片段。
1.什么是片段
Elasticsearch 的高亮 片段(fragment
)是指从原始文本中提取的、包含搜索关键词的一小段文字。它的目的是让用户快速看到匹配内容在原文中的位置,而不是返回整个字段内容。
检索返回的片段由 fragment_size
和 number_of_fragments
共同控制。
特性 |
|
---|---|
fragment_size |
|
number_of_fragments |
|
片段内容 |
|
2.案例实战
2.1 测试数据准备
首先,我们创建一个名为 blog_posts
的索引,并插入一些测试数据:
PUT /blog_posts
{
"mappings": {
"properties": {
"title": { "type": "text" },
"content": { "type": "text" },
"author": { "type": "keyword" },
"views": { "type": "integer" },
"publish_date": { "type": "date" },
"tags": { "type": "keyword" }
}
}
}
POST /blog_posts/_bulk
{"index":{}}
{"title":"Elasticsearch Basics","content":"Learn the basics of Elasticsearch and how to perform simple queries.","author":"John Doe","views":1500,"publish_date":"2023-01-15","tags":["search","database"]}
{"index":{}}
{"title":"Advanced Search Techniques","content":"Explore advanced search techniques in Elasticsearch including aggregations and filters.","author":"Jane Smith","views":3200,"publish_date":"2023-02-20","tags":["search","advanced"]}
{"index":{}}
{"title":"Data Analytics with ELK","content":"How to use the ELK stack for data analytics and visualization.","author":"John Doe","views":2800,"publish_date":"2023-03-10","tags":["analytics","elk"]}
{"index":{}}
{"title":"Elasticsearch Performance Tuning","content":"Tips and tricks for optimizing Elasticsearch performance in production environments.","author":"Mike Johnson","views":4200,"publish_date":"2023-04-05","tags":["performance","optimization"]}
{"index":{}}
{"title":"Kibana Dashboard Guide","content":"Creating effective dashboards in Kibana for monitoring and analysis.","author":"Jane Smith","views":1900,"publish_date":"2023-05-12","tags":["kibana","visualization"]}
2.2 基础高亮语法
默认情况下,不指定参数。但其实,此时 fragment_size
默认为
100
100
100,number_of_fragments
默认为
5
5
5。
GET /blog_posts/_search
{
"query": {
"match": {
"content": "Elasticsearch"
}
},
"highlight": {
"fields": {
"content": {}
}
}
}
2.3 自定义高亮标签
GET /blog_posts/_search
{
"query": {
"match": {
"content": "techniques"
}
},
"highlight": {
"pre_tags": ["<strong>"],
"post_tags": ["</strong>"],
"fields": {
"content": {
"fragment_size": 150,
"number_of_fragments": 3
}
}
}
}
2.4 多字段高亮
GET /blog_posts/_search
{
"query": {
"multi_match": {
"query": "search",
"fields": ["title", "content"]
}
},
"highlight": {
"fields": {
"title": {},
"content": {
"fragment_size": 100,
"number_of_fragments": 2
}
}
}
}
2.5 返回完整字段内容
设置 number_of_fragments
为
0
0
0,适用于短文本,如标题。
GET /blog_posts/_search
{
"query": { "match": { "title": "Kibana" } },
"highlight": {
"fields": {
"title": { "number_of_fragments": 0 }
}
}
}
2.6 长文本多匹配点场景
模拟长文本字段:
POST /blog_posts/_update/Nmgc2ZcB9mA5oeTvZT0A
{
"doc": {
"content": "Elasticsearch is a tool. Elasticsearch is fast. Elasticsearch scales well. Repeat: Elasticsearch is a tool."
}
}
GET /blog_posts/_search
{
"query": { "match": { "content": "Elasticsearch" } },
"highlight": {
"fields": {
"content": {
"fragment_size": 30,
"number_of_fragments": 3
}
}
}
}
展示前 3 3 3 个匹配点(忽略第 4 4 4 个重复匹配)。
2.7 限制片段数量和大小
每个片段限制在 30 30 30 字符以内,只返回 2 2 2 个片段(即使实际匹配更多)。
GET /blog_posts/_search
{
"query": { "match": { "content": "Elasticsearch" } },
"highlight": {
"fields": {
"content": {
"fragment_size": 30,
"number_of_fragments": 2
}
}
}
}
如果不设置 number_of_fragments
参数,默认值为
5
5
5,此处会全部返回。
GET /blog_posts/_search
{
"query": { "match": { "content": "Elasticsearch" } },
"highlight": {
"fields": {
"content": {
"fragment_size": 30
}
}
}
}
3.为什么有时片段长度会超过 fragment_size
即使设置了 fragment_size
,实际返回的片段长度可能会略大,原因包括:
- 单词完整性保护:Elasticsearch 不会在单词中间截断,因此会扩展到下一个空格或标点符号。
// 设 fragment_size=20,匹配词为"Elasticsearch" "This is a test with Elasticsearch and other words" → 可能返回:"test with <em>Elasticsearch</em> and other"(实际27字符)
- 高亮标签占用长度:HTML 高亮标签(如
<em>
)会增加额外字符,但这些 不计入fragment_size
。 - 边界扩展策略:为保证上下文可读性,Elasticsearch 可能会稍微扩展片段范围。
3.1 示例分析
插入测试数据。
PUT /test/_doc/1
{
"text": "Elasticsearch is a distributed search engine. It is built on top of Lucene. Elasticsearch provides powerful full-text search capabilities. Many companies use Elasticsearch for log analytics."
}
3.1.1 查询1:严格限制片段大小
GET /test/_search
{
"query": { "match": { "text": "Elasticsearch" } },
"highlight": {
"fields": {
"text": {
"fragment_size": 20,
"number_of_fragments": 2
}
}
}
}
可能返回:
"highlight": {
"text": [
"<em>Elasticsearch</em> is a", // 实际21字符(含空格)
"use <em>Elasticsearch</em> for" // 实际22字符
]
}
说明:虽然 fragment_size=20
,但为保证单词完整,实际略超。
3.1.2 查询2:观察片段扩展
GET /test/_search
{
"query": { "match": { "text": "search" } },
"highlight": {
"fields": {
"text": {
"fragment_size": 10,
"number_of_fragments": 1
}
}
}
}
返回示例:
"highlight": {
"text": [
"distributed <em>search</em> engine" // 实际25字符(远超10)
]
}
原因:匹配词 "search"
前后需要保留最小上下文,无法严格截断。
3.2 关键结论
- 片段是围绕匹配词的文本片段,用于展示关键词上下文。
fragment_size
是目标值,实际可能因以下原因超出:- 保持单词完整
- 包含高亮标签
- 最小上下文保留
- 需要严格限制时,可用
"type": "plain"
+"boundary_scanner": "chars"
(可能破坏单词完整性)。"highlight": { "fields": { "text": { "fragment_size": 20, "number_of_fragments": 1, "type": "plain", // 禁用智能处理 "boundary_scanner": "chars" // 按字符而非单词截断 } } }
这种设计是为了平衡 精准控制 和 可读性。如果您的应用对片段长度敏感,建议在后端做二次处理。