ICU Analysis Plugin
The ICU analysis plugin allows for unicode normalization, collation and folding. The plugin is called elasticsearch-analysis-icu.
The plugin includes the following analysis components:
ICU Normalization
Normalizes characters as explained here. It registers itself by default undericu_normalizer or icuNormalizer using the default settings. Allows for the name parameter to be provided which can include the following values: nfc, nfkc, andnfkc_cf. Here is a sample settings:
{
"index" : {
"analysis" : {
"analyzer" : {
"normalization" : {
"tokenizer" : "keyword",
"filter" : ["icu_normalizer"]
}
}
}
}
}
ICU Folding
Folding of unicode characters based on UTR#30. It registers itself under icu_folding andicuFolding names.
The filter also does lowercasing, which means the lowercase filter can normally be left out. Sample setting:
{
"index" : {
"analysis" : {
"analyzer" : {
"folding" : {
"tokenizer" : "keyword",
"filter" : ["icu_folding"]
}
}
}
}
}
Filtering
The folding can be filtered by a set of unicode characters with the parameterunicodeSetFilter. This is useful for a non-internationalized search engine where retaining a set of national characters which are primary letters in a specific language is wanted. See syntax for the UnicodeSet here.
The Following example excempt Swedish characters from the folding. Note that the filtered characters are NOT lowercased which is why we add that filter below.
{
"index" : {
"analysis" : {
"analyzer" : {
"folding" : {
"tokenizer" : "standard",
"filter" : ["my_icu_folding", "lowercase"]
}
}
"filter" : {
"my_icu_folding" : {
"type" : "icu_folding"
"unicodeSetFilter" : "[^åäöÅÄÖ]"
}
}
}
}
}
ICU Collation
Uses collation token filter. Allows to either specify the rules for collation (defined here) using the rules parameter (can point to a location or expressed in the settings, location can be relative to config location), or using the language parameter (further specialized by country and variant). By default registers under icu_collation or icuCollation and uses the default locale.
Here is a sample settings:
{
"index" : {
"analysis" : {
"analyzer" : {
"collation" : {
"tokenizer" : "keyword",
"filter" : ["icu_collation"]
}
}
}
}
}
And here is a sample of custom collation:
{
"index" : {
"analysis" : {
"analyzer" : {
"collation" : {
"tokenizer" : "keyword",
"filter" : ["myCollator"]
}
},
"filter" : {
"myCollator" : {
"type" : "icu_collation",
"language" : "en"
}
}
}
}
}
http://shop.paipai.com/799078779
http://shop.paipai.com/799078779
http://shop.paipai.com/799078779
http://shop.paipai.com/799078779
http://shop.paipai.com/799078779
http://shop.paipai.com/799078779
http://shop.paipai.com/799078779
http://shop.paipai.com/799078779
本文介绍 Elasticsearch 中的 ICU 分析插件,该插件支持 Unicode 正规化、排序及折叠等功能。文中详细解释了如何配置 ICU 正规化、折叠及排序等组件,并提供了实际设置示例。
7801

被折叠的 条评论
为什么被折叠?



