【ElasticSearch】检索word pdf插件ingest attachment的管道配置

单附件:

1.创建管道single_attachment

定义文本抽取管道

PUT /_ingest/pipeline/attachment
{
    "description": "Extract attachment information",
    "processors": [
        {
            "attachment": {
                "field": "content",
                "ignore_missing": true
            }
        },
        {
            "remove": {
                "field": "content"
            }
        }
    ]
}

2.创建index

创建索引库

PUT /knowbase
{
  "mappings": {
    "properties": {
      "esId": {
        "type": "keyword"
      },
      "assortId": {
        "type": "long"
      },
      "title":{
        "type": "text",
        "analyzer": "ik_max_word",
        "copy_to": "all"
      },
      "articleContent":{
        "type": "text",
        "analyzer": "ik_max_word",
        "copy_to": "all"
      },
      "viewNum": {
        "type": "long"
      },
      "version": {
        "type": "long"
      },
      "label": {
        "type": "keyword"
      },
      "code": {
        "type": "keyword"
      },
      "tenantId": {
        "type": "keyword"
      },
      "releaseTime": {
        "type": "date"
      },
      "createBy": {
        "type": "keyword",
        "index": false
      },
      "createTime": {
        "type": "date"
      },
      "all":{
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "attachment": {
        "properties": {
          "content":{
            "type": "text",
            "analyzer": "ik_smart",
            "copy_to": "all"
          }
        }
      }
    }
  }
}

3.索引数据

插入word文档

POST /knowbase/_doc?pipeline=attachment
{"name":"知识库文档2.0",
"type":"word",
"content":"文档base64编码"}

另:文件转base64编码 base64.guru

4.查询

查询所有

GET /knowbase/_search
{
	"query": {
		"match_all": {}
	}
}

多附件:

1.创建管道single_attachment

定义文本抽取管道-多附件

PUT /_ingest/pipeline/attachment
{
  "description": "Extract attachment information",
  "processors": [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "_ingest._value.content",
            "target_field": "_ingest._value.attachment",
            "remove_binary": true
          }
        }
      }
    }
  ]
}

需要注意的是,多附件的情况下,field 和 target_field 必须要写成 _ingest._value.*,否则不能匹配正确的字段。
从 es 8.0 版本开始,需要删除二进制文件内容,只需要为 attachment 添加一个属性 remove_binary 为 true,就不需要像上面那样单独写一个 remove 处理器了。

2.创建index

创建知识库索引 - 多附件

PUT /knowbase
{
  "mappings": {
    "properties": {
      "esId": {
        "type": "keyword"
      },
      "assortId": {
        "type": "long"
      },
      "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "copy_to": "all"
      },
      "articleContent": {
        "type": "text",
        "analyzer": "ik_max_word",
        "copy_to": "all"
      },
      "viewNum": {
        "type": "long"
      },
      "version": {
        "type": "long"
      },
      "label": {
        "type": "keyword"
      },
      "code": {
        "type": "keyword"
      },
      "tenantId": {
        "type": "keyword"
      },
      "releaseTime": {
        "type": "date"
      },
      "createBy": {
        "type": "keyword",
        "index": false
      },
      "createTime": {
        "type": "date"
      },
      "all": {
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "attachments": {
        "properties": {
          "attachment": {
            "properties": {
              "content":{
                "type": "text",
                "analyzer": "ik_smart",
                "copy_to": "all"
              }
            }}
          }
      }
    }
  }
}

3.索引数据

插入word文档 - 多附件

POST /knowbase/_doc?pipeline=attachment
{"name":["知识库文档2.0","test知识库文档2.0"],
"type":"word",
"attachments":[{"content":"文档base64编码"},{"content":"文档base64编码"}]
}

4.查询

查询所有

GET /knowbase/_search
{
	"query": {
		"match_all": {}
	}
}

查询结果:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "knowbase",
        "_id": "KYfN9oUBJjbl-1BDeeXL",
        "_score": 1,
        "_source": {
          "name": [
            "知识库文档2.0",
            "test知识库文档2.0"
          ],
          "attachments": [
            {
              "attachment": {
                "date": "2023-01-28T02:12:00Z",
                "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
                "author": "tong shaoqing",
                "modifier": "tong shaoqing",
                "modified": "2023-01-28T02:12:00Z",
                "language": "lt",
                "content": """文档内容""",
                "content_length": 30
              }
            },
            {
              "attachment": {
                "date": "2023-01-28T02:12:00Z",
                "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
                "author": "tong shaoqing",
                "modifier": "tong shaoqing",
                "modified": "2023-01-28T02:12:00Z",
                "language": "lt",
                "content": """文档内容""",
                "content_length": 30
              }
            }
          ],
          "type": "word"
        }
      }
    ]
  }
}

参考:

https://blog.youkuaiyun.com/catoop/article/details/124611260
https://www.jianshu.com/p/774e5ed120ba

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

沙糖橘

您的鼓励将是我创作的最大动力。

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值