本文将首先将介绍如何在ES中使用中文分词器IK,然后对比一下IK分词器和默认的standard分词器的差别,最后给出一个Java访问ES的demo
1,安装中文分词器。
- 下载与ES版本相对应的IK版本。IK下载以及IK和ES版本对应关系见:https://github.com/medcl/elasticsearch-analysis-ik
- 解压下载的zip文件,进入解压文件的根目录,使用maven进行编译:mvn package
- 编译成功之后,进入target/release目录下,能看到一个elasticsearch-analysis-ik-xx.xx.xx.zip文件,解压该文件。
- 在ES的目录下的plugins目录中创建ik目录,把上一步中解压的所有文件拷贝到ik目录下。
- 重启ES,大功告成。
2,测试中文分词器
我们使用IK分词器和standard分词器来对同一句话进行分词,看看有何不同。
首先使用IK进行分词
GET /_analyze
{
"analyzer": "ik",
"text": "只有社会主义才能救中国"
}
输出结果如下,可见IK会根据中文语义来对句子进行token化。
{
"analyzer": "ik",
"text": "只有社会主义才能救中国"
}
输出结果如下,可见IK会根据中文语义来对句子进行token化。
{
"tokens": [
{
"token": "只有",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "社会主义",
"start_offset": 2,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
},
{
"token": "社会",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "主义",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 3
},
{
"token": "才能救",
"start_offset": 6,
"end_offset": 9,
"type": "CN_WORD",
"position": 4
},
{
"token": "才能",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 5
},
{
"token": "救",
"start_offset": 8,
"end_offset": 9,
"type": "CN_CHAR",
"position": 6
},
{
"token": "中国",
"start_offset": 9,
"end_offset": 11,
"type": "CN_WORD",
"position": 7
}
]
}
然后我们使用standard分词器,结果如下,可见standard分词器只会对汉字进行单个拆分
GET /_analyze
{
"analyzer": "standard",
"text": "只有社会主义才能救中国"
}
{
"analyzer": "standard",
"text": "只有社会主义才能救中国"
}
输出结果是:
{
"tokens": [
{
"token": "只",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "有",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "社",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "会",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "主",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "义",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "才",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "能",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 7
},
{
"token": "救",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 8
},
{
"token": "中",
"start_offset": 9,
"end_offset": 10,
"type": "<IDEOGRAPHIC>",
"position": 9