分词
将文本转换成一系列单词的过程,转换成的单词叫term or token
原理: 倒排索引(b+)
分词器的组成以及调用顺序
1.Character Filter 单词过滤器,对原始的文本进行处理
2.Tokenizer 将原始文本按照一定的规则切分成不同的单词
3.Token Filter 针对2过程处理的单词在进行加工,例如大小写转换等
1.指定analyzer进行测试
请求:
POST _analyze
{
"analyzer": "standard", //指定默认的分词器
"text": "hello world" //分词文本
}
返回值:
被分成两个单词.
{
"tokens": [ //分词结果
{
"token": "hello", //
"start_offset": 0, //起始偏移量
"end_offset": 5, //结束偏移量
"type": "<ALPHANUM>",
"position": 0 //位置
},
{
"token": "world",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
2.指定filed进行测试
请求
POST /test_index/_analyze // test_index 测试的索引.如果没有该索引或报错.
{
"field": "name", //测试的字段
"text": "hello world"
}
返回值
{
"tokens": [ //分词结果
{
"token": "hello", //
"start_offset": 0, //起始偏移量
"end_offset": 5, //结束偏移量
"type": "<ALPHANUM>",
"position": 0 //位置
},
{
"token": "world",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
3.指定自定义的分词器进行测试
POST _analyze
{
"tokenizer": "standard",
"filter": ["lowercase"], //自定义的分词器,全部转化成小写
"text": "heLLo World"
}
es 自带的分词器
Standard
- 按词切分,支持多语言 Tokenizer:statandar
- 小写处理 Token Filter: statandar,lowerCase,Stop(disabled by default)
POST _analyze
{
"analyzer": "standard", //指定默认的分词器
"text": "Hello World" //分词文本
}
最后返回hello , world
POST _analyze
{
"analyzer": "simple",
"text": "heLLo World 2018"
}
最后结果只含有字母,数字下划线等都没有了
POST _analyze
{
"analyzer": "whitespace",
"text": "heLLo World 2018"
}
结果为hello , World ,2018 不区分大小写
POST _analyze
{
"analyzer": "stop",
"text": "the heLLo World 2018"
}
返回 hello world
POST _analyze
{
"analyzer": "keyword",
"text": "the heLLo World 2018"
}
返回
{
"tokens": [
{
"token": "the heLLo World 2018",
"start_offset": 0,
"end_offset": 21,
"type": "word",
"position": 0
}
]
}
Pattern
- 通过正则表达式区分 Tokenize: Pattern Token Filters:lowercase,stop(disabled by deafult)
- 默认是\w+,即非字词的符号作为分隔符
POST _analyze
{
"analyzer": "pattern",
"text": "the heLLo 'World -2018"
}
返回
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "hello",
"start_offset": 4,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "world",
"start_offset": 11,
"end_offset": 16,
"type": "word",
"position": 2
},
{
"token": "2018",
"start_offset": 19,
"end_offset": 23,
"type": "word",
"position": 3
}
]
}
4.中文分词
- 难点
- 中文分词是将一句话切分成一个一个单独的词,在英文中,单词之间是以空格为自然分解符,汉字中词没有一个形式上的分解符.
- 上下文不同,分词结果迥异,比如交叉歧义问题.比如下面两种分词都合理
今天/民政局/发/放女朋友 今天/民政/局/发放/女/朋友
4.1 IK分词器
1.实现中英文单词的切分,支持ik_smart,ik_maxword等模式
2.可自定义词库,支持热更新分词字典
https://github.com/medcl/elasticsearch-analysis-ik
5.自定义分词
- 当自带的分词无法满足需求时,可以自定义分词.
- 通过自定义Character Filters,Tokenizer和Tokenizer Filters来实现
5.1 Character Filters
example
POST _analyze
{
"tokenizer": "keyword",
"char_filter": [
"html_strip" //去掉html中的符号
],
"text": "<p>I'm so <b>happy</b>!</p>"
}
- Tokenizer 自带的如下
- standard 按照标准单词进分割
- letter 按照非字符进行分割
- whitespace 按照空格进行分割
- UAX URL Email 按照standard分割,但是不会分割邮箱和url
example:
POST _analyze
{
"tokenizer": "uax_url_email",
"text": "www.baidu.com 1112@qq.com hello world"
}
result:
{
"tokens": [
{
"token": "www.baidu.com",
"start_offset": 0,
"end_offset": 13,
"type": "<URL>",
"position": 0
},
{
"token": "1112@qq.com",
"start_offset": 14,
"end_offset": 25,
"type": "<EMAIL>",
"position": 1
},
{
"token": "hello",
"start_offset": 26,
"end_offset": 31,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "world",
"start_offset": 32,
"end_offset": 37,
"type": "<ALPHANUM>",
"position": 3
}
]
}
- NGram 和 Edge NGram连词分割 类似百度谷歌等搜索
POST _analyze
{
"tokenizer": "ngram", //会一次查询出来你每个字后面的词,edge_ngram只会查询出第一个词后面的词
"text": "你好"
}
{
"tokens": [
{
"token": "你",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "你好",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "好",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 2
}
]
}
- Path Hierarchy按照文件路径进行分割
example:
POST _analyze
{
"tokenizer": "path_hierarchy",
"text": "/baidu/com"
}
result:
{
"tokens": [
{
"token": "/baidu",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "/baidu/com",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 0
}
]
}
5.2 Token Filter
- 对于tokenizer 输出的单词(term)进行增删改等操作
自带的,可以有多个,按照顺序执行
- lowercase 全部小写
- stop 删除stop words
- NGram和Edge NGram 连词分割
- Synonym 添加近义词term
POST _analyze
{
"tokenizer": "standard",
"filter": [
"stop",
"lowercase",
{
"type":"ngram",
"min_gram":4, 最小四个
"max_gram":4 最大四个
}
],
"text": "a hello world"
}
5.3 自定义分词
需要在索引的配置中设定
结构:
PUT test_index
{
"settings": {
"analysis": {
"char_filter": {},
"tokenizer": {},
"filter": {},
"analyzer": {}
}
}
}
创建一个自己的分词器
PUT test_index1
{
"settings": {
"analysis": {
"analyzer": {
"my_first_analyzer":{
"type":"custom",
"tokenizer":"standard",
"char_filter":[
"html_strip"
],
"filter":[
"lowercase",
"asciifolding"
]
}
}
}
}
}
自定义分词验证
POST /test_index1/_analyze
{
"analyzer": "my_first_analyzer",
"text": "<p>I'm so <b>happy</b>!</p>"
}
结果
{
"tokens": [
{
"token": "i'm",
"start_offset": 3,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "so",
"start_offset": 12,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "happy",
"start_offset": 18,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 2
}
]
}
6.分词使用说明
- 创建或者更新文档时(Index Time)候会对应的文档进行分词处理
- 查询时(Search Time),会对查询语句进行分词
6.1索引的分词通过配置Index
Mapping中每个字段的analyzer属性实现的,不指定的时候默认是standard
demo
PUT test_index2
{
"mappings": {
"doc":{
"properties": {
"title":{
"type":"text",
"analyzer":"whitespace" //指定分词器
}
}
}
}
}
6.2 查询时候指定分词
1.通过analyzer指定分词器
POST /test_index/_search
{
"query": {
"match": {
"message": {
"query": "hello",
"analyzer": "standard" //指定分词器
}
}
}
}
2.通过index mapping 设置search_analyzer实现
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"title":{
"type": "text",
"analyzer": "whitespace",
"search_analyzer": "standard"
}
}
}
}
}
7.官网
多看官方文档
https://www.elastic.co/guide/index.html