1.5 Elasticsearch-常见字段类型：text/keyword、date、geo_point

原创于 2025-10-18 11:51:08 发布 · 817 阅读

20 ·

CC 4.0 BY-SA版权

CC BY-NC-SA 3.0

文章标签：

#elasticsearch #大数据 #搜索引擎

elasticsearch 专栏收录该内容

25 篇文章

订阅专栏

在这里插入图片描述
1.5 Elasticsearch-常见字段类型：text/keyword、date、geo_point
——把“搜得到”拆成“怎么存、怎么搜、怎么算”

在 Elasticsearch 里，字段类型（field datatype）是映射（mapping）的最小决策单元，它同时决定了三条数据链路：

写入链：文档被如何分词、如何索引、如何压缩存储；
查询链：倒排表、BKD 树、geohash 前缀如何被检索；
聚合链：哪些数据结构（doc_values、global ordinals、geo_hash grid）会被加载到内存做统计。

下面用“踩过就忘不掉”的四个高频类型——text、keyword、date、geo_point——把“存、搜、算”一次性拆开讲清。所有示例基于 8.x 版本，7.x 同样适用。

text：全文搜索的“入口”，但不是“出口”

存储视角
├─ 倒排索引：被 analyzer 拆成 term 流，默认存 postings + term vectors（可关）
├─ 正排数据：doc_values 被强制禁用，无法做聚合/排序；fielddata 默认关闭，需显式开
└─ 存储成本： norms(.nvd)、positions(.pos)、payloads 占大头；开启 highlight 再 +50%

映射模板
PUT blog
{
“mappings”: {
“properties”: {
“title”: {
“type”: “text”,
“analyzer”: “ik_max_word”, // 写入分词
“search_analyzer”: “ik_smart”, // 查询分词
“fields”: {
“raw”: { “type”: “keyword” }, // 子字段，供排序/聚合
“suggest”: { “type”: “completion” }
}
}
}
}
}

查询视角

match：会对查询串再走一次 search_analyzer，生成同写入侧可对齐的 term 集合
match_phrase：在 term 对齐基础上加 slop 与 position 检查
multi_match：跨 text/keyword 混查时，keyword 字段会走 term query，等价“=”过滤

聚合视角
text 字段默认不能聚合；需要：
a) 开启 fielddata（堆内存爆炸风险）
b) 或者通过子字段 keyword 走 .keyword 后缀
生产环境永远选 b。

keyword：精确值“身份证”，也是聚合“现金”

存储视角
├─ 倒排：整串当单个 term 写 postings，无分词
├─ 正排：doc_values 默认开启，内存友好，可 hot 阶段常驻
└─ 存储成本：无 norms、无 positions，磁盘占用最低一档

典型场景

状态码、用户 ID、订单号、枚举值、tag 数组
wildcard、prefix、regexp、term、terms 查询
排序、聚合、脚本、父子 join、runtime field

映射技巧
“status”: {
“type”: “keyword”,
“ignore_above”: 256, // 超长文本直接丢弃，避免脏数据撑爆内存
“normalizer”: “lowercase” // 查询时统一转小写，保持“精确”的同时忽略大小写
}

数组支持
keyword 天然支持多值：{“tag”: [“java”, “search”]} 内部存成两个 term，与倒排表天然对齐。

date：时间戳的“统一口径”

内部存储

固定转成 UTC 毫秒 long（64-bit），用 BKD 树索引，压缩率极高，范围查询 O(log n)

映射模板
“create_time”: {
“type”: “date”,
“format”: “yyyy-MM-dd HH:mm:ss||epoch_millis||strict_date_optional_time”,
“ignore_malformed”: true, // 脏数据不抛异常，直接跳
“null_value”: “1970-01-01” // 显式空值兜底
}

查询写法

区间：{“range”: {“create_time”: {“gte”: “2025-10-18||-7d/d”}}} // 支持日期数学
聚合：date_histogram 默认用 1 毫秒桶，生产务必加 “fixed_interval”: “1h” 或 “calendar_interval”: “1M”

时区陷阱
Kibana 把浏览器时区带在请求头，ES 用 time_zone 参数转 UTC；脚本直调用必须自己带 time_zone，否则结果差 8 小时。

geo_point：经纬度的一键“圈地”

存储结构

内部双字段：lat/lon 各存 double + geohash 前缀 + BKD 树
支持 5 种输入格式：object、string、“lat,lon”、数组 [lon,lat]、WKT POINT

映射模板
“location”: {
“type”: “geo_point”,
“ignore_malformed”: true,
“ignore_z_value”: true // 带海拔数据时忽略 z 轴
}

查询全家桶

geo_distance：圆
geo_bounding_box：矩形
geo_polygon：多边形
geo_shape：当需要复杂图形时用 geo_shape 类型，但 geo_point 可借助 “geo_shape”: {“relation”: “within”} 与 geo_shape 字段互查

聚合玩法

geo_distance：按距中心点每 10 km 一个桶
geohash_grid：把地图切成网格，桶 key 就是 geohash 前缀，配合 “precision”: 7（≈152 m）
geo_centroid：求每个桶的重心，外卖“热力图”标配

性能提示

geo_point 字段默认 doc_values 开启，内存占用约为 2×8×doc_count 字节；十亿级文档需 16 GB，冷热分层可省内存
距离排序必须加 “unit”: “km” 与 “distance_type”: “arc”，默认 sloppy_arc 已足够精确，plane 模式误差大但快 3 倍

四者混用的“黄金组合”

场景：O2O 门店搜索

门店名称：text + ik_max_word，供用户模糊搜
门店 ID：keyword，精确查、做父子 join
创建时间：date，按天统计新店
坐标：geo_point，3 km 内优先展示

一条查询同时用到四种类型：
GET store/_search
{
“query”: {
“bool”: {
“must”: [
{ “match”: { “name”: “海底捞” } },
{ “range”: { “create_time”: { “gte”: “now-30d/d” } } },
{ “geo_distance”: { “distance”: “3km”, “location”: { “lat”: 31.2, “lon”: 121.5 } } }
],
“filter”: [
{ “term”: { “city_id”: “310100” } }
]
}
},
“aggs”: {
“per_day”: {
“date_histogram”: { “field”: “create_time”, “fixed_interval”: “1d” }
},
“hot_grid”: {
“geohash_grid”: { “field”: “location”, “precision”: 6 }
}
},
“sort”: [
{ “_geo_distance”: { “location”: { “lat”: 31.2, “lon”: 121.5 }, “order”: “asc”, “unit”: “km” } }
]
}