es创建索引（mapping和setting）

原创已于 2025-06-25 09:27:56 修改 · 1.9k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#elasticsearch #数据库

于 2024-03-31 19:13:06 首次发布

Elasticsearch 专栏收录该内容

6 篇文章

订阅专栏

文章介绍了如何在Elasticsearch中创建person_news索引，包括字段定义（如companyName、newsTitle等）、数据结构（如nested嵌套字段）以及使用不同分词器（ik_max_word、ik_smart和standard）处理中文文本。通过示例展示了如何插入数据和分析分词效果。

1、首先定义一个索引，如下

PUT /person_news
{
  "settings": {
    "index": {
      "number_of_shards": "3",
      "number_of_replicas": "0",
      "max_result_window": "2000000000"
    }
  },
  "mappings": {
    "properties": {
      "companyName": {
        "type": "text",
	"analyzer": "ik_max_word",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "newsSource": {
        "type": "keyword"
      },
      "newsContent": {
        "type": "text",
	"analyzer": "ik_max_word"
      },
      "newsTitle": {
        "type": "text",
	"analyzer": "ik_max_word",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "labels": {
        "type": "keyword"
      },
      "personInfo": {
        "type": "nested",
        "properties": {
          "personName": {
            "type": "keyword"
          },
          "age": {
            "type": "integer"
          }
        }
      },
      "hotPoint": {
        "type": "long"
      }
    }
  }
}

person_news 这个索引是新闻和人相关的索引，companyName公司名称，定义了text类型，分词器采用的是ik分词，同时定义子字段类型为keyword，表示不分词（可以用来聚合和精准匹配）；
newsSource 新闻来源，不分词；
newsContent 新闻内容，分词；
newsTitle 新闻标题，分词，同时建立子字段为keyword类型（同上companyName）；
labels 标签，不分词（这里我准备给这个字段存储的是一个数组类型，就是一个新闻有多个标签，详见下文插入文档）；
personInfo 新闻中的人物对象信息，采用的是nested结构，是一个数组对象，对象里面有personName和age字段；
hotPoint 新闻的热点值，通常通过此字段给新闻排序；
2、插入数据

PUT person_news/_doc/1
{
  "companyName": "中国恒大有限责任公司",
  "newsSource": "新华社",
  "newsContent": "今日中国证监会对中国恒大董事长许家印罚款4000万，并对其做出终身不能入市的处罚规定，其公司其他高管夏海钧也被做出相应处罚",
  "newsTitle": "恒大许家印被罚",
  "labels": [
    "恒大",
    "许家印"
  ],
  "personInfo": [
    {
      "personName": "许家印",
      "age": 60
    },
    {
      "personName": "夏海钧",
      "age": 59
    }
  ],
  "hotPoint": 1
}

PUT person_news/_doc/2
{
  "companyName": "阿里巴巴有限责任公司",
  "newsSource": "新华社",
  "newsContent": "今日阿里公司集团董事长张勇卸任，由蔡崇信接任",
  "newsTitle": "阿里张勇卸任",
  "labels": [
    "阿里",
    "蔡崇信",
    "张勇"
  ],
  "personInfo": [
    {
      "personName": "张勇",
      "age": 60
    },
    {
      "personName": "蔡崇信",
      "age": 54
    }
  ],
  "hotPoint": 2
}

PUT person_news/_doc/3
{
  "companyName": "中国恒大有限责任公司",
  "newsSource": "路透社",
  "newsContent": "中国恒大董事长传闻跳楼，恒大资产负债高达几万亿，传闻阿里张勇将对恒大进行投资，进军房地产，具体消息恒大高管夏海钧予以否认",
  "newsTitle": "恒大董事长许家印",
  "labels": [
    "恒大",
    "张勇"
  ],
  "personInfo": [
    {
      "personName": "张勇",
      "age": 54
    },
    {
      "personName": "夏海钧",
      "age": 59
    }
  ],
  "hotPoint": 3
}

3、可以通过kibana的DSL语句，查看文本采用某个分词器的效果(采用的是ik_max_word最大粒度分词)

GET /person_news/_analyze
{
  "analyzer": "ik_max_word",
  "text": "中国恒大有限责任公司"
}

结果如下：

{
  "tokens" : [
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "恒",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "大有",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "有限责任",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "有限",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "责任",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "公司",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

采用ik_smart智能分词

{
  "tokens" : [
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "恒",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "大",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "有限责任",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "公司",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

使用es自带的默认分词器，分词效果如下（会把每个中文分成一个个的汉字）

GET /person_news/_analyze
{
  "analyzer": "standard",
  "text": "中国恒大有限责任公司"
}

{
  "tokens" : [
    {
      "token" : "中",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "国",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "恒",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "大",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "有",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "限",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "责",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "任",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "公",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "司",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    }
  ]
}