ElasticSearch 备考 -- Enrich processor-优快云博客

本文链接：https://blog.youkuaiyun.com/hengzhepa/article/details/141779937

一、题目

在集群上有二个索引index_a 和 index_b，

index_a 中有字段 field_a、title、date

index_b 中有字段 field_a、author、publisher

要求 reindex 一个索引 index_c，包含index_a中的所有数据，且增加字段 autho、publisherr 关联 index_b 索引中 field_a 相同的字段。

二、思考

通过分析题目，发现此题主要考点是 Ingest pipeline 中的 Enrich processor 和 Reindx 两个大点。首先，我们来分析一下，题目要求将 index_a 中的字段与index_b中的 field_a 字段进行匹配，匹配后将 index_b 中的字段组合成新的索引 index_c。这个操作其实很像数据库的表关联操作 join。那为什么要使用 Enrich 不是别的操作呢。

来看一下官方给的 enrich 处理流程图，这里其实有三个索引分别为source index、incomming index 和 target index，enrich中文翻译为丰富，从图可以看出是通过建策略（policy）和处理器（processor）操作后将 source index、incomming index 进行匹配最丰富终组合为 target index的内容，这里我们将操作拆分为三个步骤

创建enrich policy
执行 policy
创建 pipeline 应用 enrich processor 指定应用的 policy

三、解题

Step 0、准备步骤

准备索引index_a、index_b和相应数据

# 创建索引index_a
# DELETE index_a
PUT index_a
{
  "mappings":{
    "properties":{
      "field_a":{
        "type":"keyword"
      },
      "title":{
        "type":"keyword"
      },
      "date":{
        "type":"date",
        "format": ["yyyy-MM-dd"]
      }
    }
  }
}

# 写入数据
POST index_a/_bulk
{"index":{"_id":1}}
{"field_a":"000", "title":"elasticsearch in action", "date":"2017-07-01"}

# 创建索引index_b
# DELETE index_b
PUT index_b
{
  "mappings": {
    "properties": {
      "field_a": {
        "type": "keyword"
      },
      "author": {
        "type": "keyword"
      },
      "publisher": {
        "type": "keyword"
      }
    }
  }
}
 
# 写入数据
POST index_b/_bulk
{"index":{"_id":1}}
{"field_a":"aaa", "author":"jerry", "publisher":"qinghua

Step 1、创建enrich policy

参数说明：

indices：可以指定至少一个源索引
match_field：匹配的字段名称
enrich_fields：匹配索引中对应要获取的丰富的字段集合

注意点：

put 请求类型，_enrich/policy/{policy_name}
请求类型 match，如果是地理位置信息匹配使用 geo_match

# DELETE _enrich/policy/data-policy
PUT /_enrich/policy/data-policy
{
  "match": {
    "indices": "index_b",
    "match_field": "field_a",
    "enrich_fields": ["author","publisher"]
  }
}

Step 2、执行创建的 policy

注意点：

put 请求类型
在自定义创建策略后增加 _execute，注意带下划线

PUT _enrich/policy/data-policy/_execute

Step 3、创建ingest pipeline

因为匹配后的字段为 json 对象类型，通过append方式将对象中某个属性值拉平到主维度属性，然后使用remove删除冗余匹配的json对象

参数说明：

1）enrich 参数说明：

policy_name：指定要使用的策略名称
filed：为输入文档数据要匹配的字段。
traget_field：匹配后字段名
max_matches：要包含在配置的目标字段下的最大匹配文档数。如果 max _ match 大于1，target_field 将转换为 json 数组，否则 target_field 将转换为 json 对象。为了避免文档变得过大，允许的最大值是128。

2）append 参数说明：

filed：拼接字段名称
value：拼接字段value，

注意点：字段值引用使用 {{xxx}} 格式

3）remove 为要删除的字段

# DELETE /_ingest/pipeline/data_lookup
PUT /_ingest/pipeline/data_lookup
{
  "processors": [
    {
      "enrich": {
        "policy_name": "data-policy",
        "field": "field_a",
        "target_field": "field_a_target",
        "max_matches": "1"
      }
    },
    {
      "append": {
        "field": "publisher",
        "value": "{{field_a_target.publisher}}"
      }
    },
    {
      "append": {
        "field": "author",
        "value": "{{field_a_target.author}}"
      }
    },
    {
     "remove": {
       "field": "field_a_target"
     }
    }
  ]
}

Step 4、使用 reindex + pipeline 创建新的索引index_c

参数说明

source：执行源index
dest：执行目标index，通过 pipeline 指定reindex 所使用的数据管道

注意：

source和dest中需要写index
pipeline写在目标中
是post 类型操作
reindex中新的索引可以不用预先创建不会报错，如果为了定义精确mapping类型可以提前创建

# DELETE index_c
POST _reindex
{
  "source": {"index": "index_a"},
  "dest": {"index": "index_c", "pipeline": "data_lookup"}
}

Step 4、查询 index_c 验证结果

GET index_c/_search

四、总结

1、enrich 有点像 SQL的表连接操作，目的是形成一张内容更为丰富的数据结构。

2、enrich 脑海里记住那张图就好容易理解了，记住三个步骤：1）创建policy，2）执行policy，3）创建ingest pipeline应用 enrich。

参考资料：

送一波福利

福利一

有需要内推JD的同学，可以私信或留言，我帮您内推，流程快！！！

福利二