Elasticsearch之字段类型

最新推荐文章于 2024-10-15 16:22:55 发布

土豆TM

最新推荐文章于 2024-10-15 16:22:55 发布

阅读量926

点赞数

CC 4.0 BY-SA版权

分类专栏： Elasticsearch elasticserch实战教程文章标签： Elasticsearch datatype

本文链接：https://blog.youkuaiyun.com/tudou201601/article/details/83245854

elasticserch实战教程同时被 2 个专栏收录

4 篇文章

订阅专栏

Elasticsearch

3 篇文章

订阅专栏

本文详细介绍了Elasticsearch中的各种字段类型，包括核心类型如字符串(string)、文本(text)、关键字(keyword)、数字(Numeric)、日期(Date)等；复合类型如数组(Array)、对象(Object)、嵌套(Nested)；地理类型(Geo)如地理坐标(Geo-points)、地理图形(Geo-Shape)；特定类型如IP、Completion、Token count等。理解这些字段类型对于有效使用Elasticsearch至关重要。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

要想使用Elasticsearch，就必须掌握mapping中支持的字段类型，了解每种类型的使用场景，才能做到游刃有余。

核心类型
复合类型
地理类型 ( Geo )
特定类型
多字段

一映射参数

参数说明

analyzer	分析器应用于analyzed字符串字段，无论是在索引时间还是在搜索时间（除非被search_analyzer ）。默认为默认索引分析器或standard分析器。
boost	映射字段级查询时间提升。接受一个浮点数，默认为1.0 。
coerce	尝试将字符串转换为数字并截断整数的分数。接受true（默认）和false。
eager_global_ordinals	是否应该全新加载全局序号？ true或false （默认）。对于经常用于（重要）术语聚合的字段，启用此功能是一个好主意。
enabled	是否应该为对象字段指定的JSON值进行解析和索引（true，默认）或完全忽略（false）
enable_position_increments	Indicates if position increments should be counted. Set to false if you don’t want to count tokens removed by analyzer filters (like stop). Defaults to true.
dynamic	新属性是否应动态添加到现有对象。接受 true (默认)， false 和 strict。
doc_values	该字段是否应该以多列的方式存储在磁盘上，以便以后可以将其用于排序，聚合或脚本？接受true（默认）或false。
fielddata	是否可以使用内存中的字段数据进行排序，聚合或脚本编写？ true或false （默认）。
fielddata_frequency_filter	此为高级设置，允许在fielddata启用时决定哪些值加载到内存中。默认情况下，所有值都被加载。
fields	多字段允许以多种方式将相同的字符串值索引到不同的目的，例如用于搜索的一个字段和用于排序和聚合的多字段，或由不同分析器分析的相同字符串值。
format	可以解析的日期格式。默认是strict_date_optional_time\|\|epoch_millis。
ignore_above	Do not index any string longer than this value. Defaults to 2147483647 so that all values would be accepted.
ignore_malformed	如果为真，则格式错误的数字将被忽略。如果为false（默认），格式错误的数字会引发异常并拒绝整个文档。
ignore_z_value	If true (default) three dimension points will be accepted (stored in source) but only latitude and longitude values will be indexed; the third dimension is ignored. If false, geo-points containing any more than latitude and longitude (two dimensions) values throw an exception and reject the whole document.
include_in_all	字段值是否应包含在_all字段中？接受true或false 。如果index设置为no ，或者如果父object字段将include_in_all，默认设置为false。否则默认为true 。
index	应该可以搜索该字段吗？接受true （默认）或false 。
index_options	索引中应存储哪些信息，以便搜索和突出显示。默认为positions 。
index_prefixes	If enabled, term prefixes of between 2 and 5 characters are indexed into a separate field. This allows prefix searches to run more efficiently, at the expense of a larger index. Accepts an index-prefix configuration block
index_phrases	If enabled, two-term word combinations (shingles) are indexed into a separate field. This allows exact phrase queries to run more efficiently, at the expense of a larger index. Note that this works best when stopwords are not removed, as phrases containing stopwords will not use the subsidiary field and will fall back to a standard phrase query. Accepts true or false (default).
locale	由于月份没有一致的名字或者缩写，locale 一般应用在日期解析上。默认为 ROOT locale,
normalizer	[experimental] This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.How to pre-process the keyword prior to indexing. Defaults to null, meaning the keyword is kept as-is.
norms	在评分查询时是否应考虑字段长度。接受true （默认）或false 。
null_value	接受上面列出的任何真实或虚假的价值。该值替换任何显式空值。默认为null，这意味着该字段被视为丢失。
position_increment_gap	应该在字符串数组的每个元素之间插入的假项目位置的数量。默认为在分析器上配置的position_increment_gap ，默认为100 。 100被选中，因为它阻止了匹配术语与字段值之间的合理大小的间隔（小于100）的短语查询。
properties	对象内的字段，可以是任何数据类型，包括对象。可以将新属性添加到现有对象。
store	字段值是否应与_source字段分开存储和检索。接受true或false （默认）。
search_analyzer	analyzer应在搜索时使用在analyzed领域。默认为analyzer设置。
search_quote_analyzer	在遇到短语时应在搜索时使用的分析器。默认为search_analyzer设置。
similarity	应该使用哪种评分算法或相似度。默认为BM25 。
split_queries_on_whitespace	Whether full text queries should split the input on whitespace when building a query for this field. Accepts true or false (default).
term_vector	是否应为analyzed字段存储术语向量。默认为no 。

数据类型和参数对照表

其中Percolator，Join，Alias，Array没有参数

	Binary	Range	Boolean	Date	IP	Keyword	Nested	Numeric	Object	Text	Token count
analyzer										Y	Y
boost		Y	Y	Y	Y	Y		Y		Y	Y
coerce		Y						Y
eager_global_ordinals						Y				Y
enabled									Y
enable_position_increments											Y
dynamic							Y		Y
doc_values	Y		Y	Y	Y	Y		Y			Y
fielddata										Y
fielddata_frequency_filter										Y
fields						Y				Y
format				Y
ignore_above						Y
ignore_malformed				Y				Y
ignore_z_value
include_in_all
index		Y	Y	Y	Y	Y		Y		Y	Y
index_options						Y				Y
index_prefixes										Y
index_phrases										Y
locale				Y
normalizer						Y
norms						Y				Y
null_value			Y	Y	Y	Y		Y			Y
position_increment_gap										Y
properties							Y		Y
store	Y	Y	Y	Y	Y	Y		Y		Y	Y
search_analyzer										Y
search_quote_analyzer										Y
similarity						Y				Y
split_queries_on_whitespace						Y
term_vector										Y

二字段类型

核心类型

字符串 ( string )

string字段不支持在5.x中创建的索引，这是因为text和keyword字段。在5.x中创建的索引中创建字符串字段将导致Elasticsearch尝试将string升级到相应的text或keyword字段。它将返回一个HTTP Warning请求头，告诉您该string已被弃用。此升级过程并不总是完美的，因为有一些string支持的组合功能，但不被text和keyword支持。因此，最好使用text或keyword 。
从2.x导入的索引仅支持string ，而不支持text或keyword 。为了简化从2.x Elasticsearch的迁移，将应用于从2.x导入的索引的text和keyword映射降级为string 。最终，低于5.x版本的长期索引需要及时重建，截止时间为升级到6.x之前，这种降级可以在您分配合理后时间平滑进行。

text

该字段用于索引全文文本，例如电子邮件的正文或产品的描述。 对这些字段进行analyzed ，即通过分析器将其转换成索引之前的各个术语列表。分析过程允许Elasticsearch搜索每个全文本字段中的单个单词。文本字段不用于排序，很少用于聚合（尽管重要的术语聚合是一个显着的例外）。

text字段接受以下参数：

keyword

用于索引结构化的数据的字段，比如 email 地址、主机名、状态码、邮政编码或者标签，通常用于过滤（查找所有状态是 published 的博客文章）、排序、聚合。keyword 字段只能通过精确值来搜索。

如果需要索引全文内容，比如 email 体或者产品描述，那么你可能需要使用 text 字段。

数字类型 ( Numeric )

long, integer, short, byte, double, float,half_float,scaled_float

日期类型 ( Date )

JSON没有date这种数据类型, 所以elasticsearch中的date可以是以下形式:

包含格式化日期的字符串，例如 “2015-01-01”或“2015/01/01 12:10:30”。
代表milliseconds-since-the-epoch的长整型数。
代表seconds-since-the-epoch的整型数。
在内部，日期将转换为UTC（如果指定了时区），并将其存储为表示milliseconds-since-the-epoch的长整型数。
日期格式可以自定义，但如果没有指定格式，则使用默认格式：
"strict_date_optional_time||epoch_millis"

布尔类型 ( Boolean )

布尔字段接受JSON true和false值，但也可以接受被解释为true或false的字符串和数字：

虚假的价值观	`false`， `"false"`
真正的价值观	`true`， `"true"`

Range类型

支持以下范围类型:

`integer_range`	32位有符号整型数范围。最小值是-231，最大值是231-1。
`float_range`	单精度32位IEEE 754浮点值范围。
`long_range`	64位有符号整型数范围。最小值是-263，最大值是263-1。
`double_range`	双精度64位IEEE 754浮点值范围。
`date_range`	以系统纪元经过的无符号64位整数毫秒表示的日期值范围。
`ip_range`	支持IPv4或IPv6(或混合)地址的ip值范围

二进制类型 ( binary )

binary（二进制）类型接受二进制值作为Base64编码字符串。该字段默认情况下不存储，不可搜索

复合类型

数组类型 ( Array )

在 Elasticsearch 中，没有特定的 array 类型。默认情况下，任何字段都可以包含0个或者更多值，但是，所有 array 中的值必须具有相同的数据类型，例如：

字符串数组：[“one”, “two”]
整数数组：[1,2]
数组的数组：[1, [2, 3]]，等价于[1,2,3]
对象数组：[ { “name”: “Mary”, “age”: 12 }, { “name”: “John”, “age”: 10 }]

当自动添加一个字段，array 的第一个值决定了字段的类型。所有接下来的值必须使用相同的数据类型或者必须至少能将他们转换为与它相同的类型：
数组不支持混合的数据类型：[10, “some string”]

数组可以包含 null 值，这些值可以由配置的 null_value 替换或完全跳过。一个空的 array [] 被视为不存在的字段-无值的字段。

对象类型 ( Object )

JSON文档本质上是分层的：文档包含内部对象，内部对象本身还包含内部对象。

curl -XPUT 'localhost:9200/my_index/my_type/1?pretty' -d'
{   // 1
  "region": "US",
  "manager": {  // 2
    "age":     30,
    "name": {   // 3
      "first": "John",
      "last":  "Smith"
    }
  }
}

在内部，这个文档被索引为一个简单的、扁平的键值对列表，如下所示：

{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John",
  "manager.name.last":  "Smith"
}

嵌套类型 (Nested )

nested 类型是一种对象类型的特殊版本，它允许索引对象数组，独立地索引每个对象。一般情况下可以代替Objet类型。

curl -XPUT 'localhost:9200/my_index/my_type/1?pretty' -d'
{
  "group" : "fans",
  "user" : [    // 1
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

在内部被转化成如下格式的文档：

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

地理类型 ( Geo )

地理坐标 ( Geo-points )

geo-points类型字段使用横纵坐标经伟度，使用方法

在边界框内或多边形内，找中心点和离中心点的一定距离的点。
从地理位置或从中心点的距离组成搜索文本
根据位置距离给搜索文本评分
根据得分进行搜索文本排序

具体参考：Geo-points datatype

地理图形 ( Geo-Shape )

geo_shape 用于描述复杂形状，如多边形
The geo_shape datatype facilitates the indexing of and searching with arbitrary geo shapes such as rectangles and polygons. It should be used when either the data being indexed or the queries being executed contain shapes other than just points.

具体参考：Geo-Shape datatype

特定类型

IP 类型

ip 用于描述 ipv4 和 ipv6 地址
具体参考：IP datatype

补全类型 ( Completion )

completion 提供自动完成提示

令牌计数类型 ( Token count )

token_count 用于统计字符串中的词条数量
具体参考： Token count datatype

附件类型 ( attachment )

参考 mapper-attachements 插件，支持将附件如Microsoft Office格式，Open Document格式，ePub，HTML等等索引为 attachment 数据类型。

抽取类型 ( Percolator )

该percolator字段类型解析JSON结构成查询本机查询和存储，从而使渗滤液查询可以用它来搭配提供的文件。
具体参考：Percolator datatype

join类型

该join数据类型是创建相同的索引文件中的父/子关系的特殊领域
具体参考：join datatype

Alias 类型

一个alias映射为索引中的一个字段定义的替代名称。别名可用于代替搜索请求中的目标字段
具体参考：Alias datatype

多字段

通常用于为不同目的用不同的方法索引同一个字段。例如，string 字段可以映射为一个 text 字段用于全文检索，同样可以映射为一个 keyword 字段用于排序和聚合。另外，你可以使用 standard analyzer，english analyzer，french analyzer 来索引一个 text 字段

这就是 muti-fields 的目的。大多数的数据类型通过fields参数来支持 muti-fields。