5、语言处理与自动补全技术探测
实现的效果
实现的最终效果如下图京东搜索相似,输入词的时候返回提示。同时输入拼音和首字母也会有相同的提示效果
输入汉字
输入拼音
输入首字母
5.1 自定义语料库
5.1.1 语料库映射OpenAPI
索引映射OpenAPI
-
定义索引(映射)接口
/** * 索引操作接口 */ public interface ElasticsearchIndexService { //新增索引+映射 public boolean addIndexAndMapping(CommonEntity commonEntity) throws Exception; }
-
定义索引(映射)实现
@Override public boolean addIndexAndMapping(CommonEntity commonEntity) throws Exception { boolean flag=false; //创建索引请求 CreateIndexRequest request=new CreateIndexRequest(commonEntity.getIndexName()); //获取下游业务参数 Map<String,Object> map =commonEntity.getMap(); //循环参数 for(Map.Entry<String,Object> entry:map.entrySet()){ //设置settings参数 if("settings".equals(entry.getKey()) && entry.getValue() instanceof Map && ((Map)entry.getValue()).size()>0){ request.settings(((Map)entry.getValue())); } //设置mapping参数 if("mapping".equals(entry.getKey()) && entry.getValue() instanceof Map && ((Map)entry.getValue()).size()>0){ request.mapping(((Map)entry.getValue())); } } //创建索引操作客户端 IndicesClient indicesClient=client.indices(); //创建响应对象 CreateIndexResponse response=indicesClient.create(request,RequestOptions.DEFAULT); flag=response.isAcknowledged(); return flag; }
-
新增控制器
/** * 索引操作控制器 */ @RestController @RequestMapping("v1/indices") public class ElasticsearchIndexController { private static final Logger logger = LoggerFactory.getLogger(ElasticsearchIndexController.class); @Autowired ElasticsearchIndexService elasticsearchIndexService; @PostMapping(value = "/add") public ResponseData addIndexAndMapping(@RequestBody CommonEntity commonEntity) { //构造返回下游业务数据 ResponseData rData = new ResponseData(); if (StringUtils.isEmpty(commonEntity.getIndexName())) { rData.setResultEnum(ResultEnum.param_isnull); return rData; } //增加索引(映射)是否成功 boolean isSuccess = false; try { //通过接口调用远程结构化查询方法 isSuccess = elasticsearchIndexService.addIndexAndMapping(commonEntity); //通过类型推断自动装箱(多个参数取交集) rData.setResultEnum(isSuccess, ResultEnum.success, null); //日志记录 logger.info(TipsEnum.create_index_success.getMessage()); } catch (Exception e) { //打印到控制台 e.printStackTrace(); //日志记录 logger.error(TipsEnum.create_index_fail.getMessage()); //构建错误返回信息 rData.setResultEnum(ResultEnum.error); } //返回 return rData; } }
-
开始新增映射
http://127.0.0.1:8888/v1/indices/add
参数
自定义分词器ik_pinyin_analyzer(ik和pinyin组合分词器)
tips
在创建映射前,需要安装拼音插件
{
"indexName": "product_completion_index",
"map": {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 2,
"analysis": {
"analyzer": {
"ik_pinyin_analyzer": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": "pinyin_filter"
}
},
"filter": {
"pinyin_filter": {
"type": "pinyin",
"keep_first_letter": true,
"keep_separate_first_letter": false,
"keep_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"lowercase": true,
"remove_duplicated_term": true
}
}
}
},
"mapping": {
"properties": {
"name": {
"type": "keyword"
},
"searchkey": {
"type": "completion",
"analyzer": "ik_pinyin_analyzer"
}
}
}
}
}
settings下面的为索引的设置信息,动态设置参数,遵循DSL写法
mapping下为映射的字段信息,动态设置参数,遵循DSL写法
属性 | 说明 |
---|---|
keep_first_letter | 启用此选项时,例如:刘德华> ldh,默认值: true |
keep_separate_first_letter | 启用该选项时,将保留第一个字母分开,例如: 刘德华> l,d,h,默认:假的,注意:查询结果 也许是太模糊,由于长期过频 |
limit_first_letter_length | 设置first_letter结果的最大长度,默认值:16 |
keep_full_pinyin | 当启用该选项,例如:刘德华> [ liu,de, hua],默认值:true |
keep_joined_full_pinyin | 当启用此选项时,例如:刘德华> [ liudehua], 默认值:false |
keep_none_chinese | 在结果中保留非中文字母或数字,默认值:true |
keep_none_chinese_together | 默认值:true,如:DJ音乐家- > DJ,yin,yue, jia,当设置为false,例如:DJ音乐家- > D,J, yin,yue,jia,注意:keep_none_chinese必 须先启动 |
keep_none_chinese_in_first_letter | 第一个字母保持非中文字母,例如:刘德华 AT2016- > ldhat2016,默认值:true |
keep_none_chinese_in_joined_full_pinyin | 保留非中文字母加入完整拼音,例如:刘德华 2016- > liudehua2016,默认:false |
none_chinese_pinyin_tokenize | 打破非中国信成单独的拼音项,如果他们拼音, 默认值:true,如: liudehuaaibaba13zhuanghan- > liu,de, hua,a,li,ba,ba,13,zhuang,han,注 意:keep_none_chinese和 keep_none_chinese_together应首先启用 |
keep_original | 当启用此选项时,也会保留原始输入,默认值: false |
lowercase | 小写非中文字母,默认值:true |
trim_whitespace | 默认值:true |
remove_duplicated_term | 当启用此选项时,将删除重复项以保存索引,例 如:de的> de,默认值:false,注意:位置相关 查询可能受影响 |
返回
5.1.2 语料库文档OpenAPI
-
定义批量新增文档接口
//批量新增文档 public RestStatus bulkAndDoc(CommonEntity commonEntity) throws Exception;
-
定义批量新增文档实现
@Override public RestStatus bulkAndDoc(CommonEntity commonEntity) throws Exception { //构建批量新增请求 BulkRequest bulkRequest = new BulkRequest(commonEntity.getIndexName()); //循环下游业务文档数据 for (int i = 0; i < commonEntity.getList().size(); i++) { bulkRequest.add(new IndexRequest().source(XContentType.JSON, SearchTools.mapToObjectGroup(commonEntity.getList().get(i)))); } //开始执行批量新增操作 BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT); return bulkResponse.status(); }
如上图,需要定义成箭头中的形式
所以上面SearchTools.mapToObjectGroup将map转成了数组 -
定义批量新增文档控制器
@PostMapping(value = "/batch") public ResponseData bulkAndDoc(@RequestBody CommonEntity commonEntity) { //构造返回下游业务数据 ResponseData rData = new ResponseData(); if (StringUtils.isEmpty(commonEntity.getIndexName()) || CollectionUtils.isEmpty(commonEntity.getList())) { rData.setResultEnum(ResultEnum.param_isnull); return rData; } //定义批量返回结果 RestStatus result = null; try { //通过接口调用批量新增方法 result = elasticsearchDocumentService.bulkAndDoc(commonEntity); //通过类型推断自动装箱(多个参数取交集) rData.setResultEnum(result, ResultEnum.success, null); //日志记录 logger.info(TipsEnum.batch_create_doc_success.getMessage()); } catch (Exception e) { //打印到控制台 e.printStackTrace(); //日志记录 logger.error(TipsEnum.batch_create_doc_fail.getMessage()); //构建错误返回信息 rData.setResultEnum(ResultEnum.error); } //返回 return rData; }
-
开始批量新增调用
http://127.0.0.1:8888/v1/docs/batch
参数
定义23个suggest词库(定义了两个小米手机,验证是否去重)
tips
学完聚合通过日志埋点、数据落盘进行维护
{
"indexName": "product_completion_index",
"list": [
{
"searchkey": "小米手机",
"name": "小米(MI)"
},
{
"searchkey": "小米10",
"name": "小米(MI)"
},
{
"searchkey": "小米电视",
"name": "小米(MI)"
},
{
"searchkey": "小米路由器",
"name": "小米(MI)"
},
{
"searchkey": "小米9",
"name": "小米(MI)"
},
{
"searchkey": "小米手机",
"name": "小米(MI)"
},
{
"searchkey": "小米耳环",
"name": "小米(MI)"
},
{
"searchkey": "小米8",
"name": "小米(MI)"
},
{
"searchkey": "小米10Pro",
"name": "小米(MI)"
},
{
"searchkey": "小米笔记本",
"name": "小米(MI)"
},
{
"searchkey": "小米摄像头",
"name": "小米(MI)"
},
{
"searchkey": "小米电饭煲",
"name": "小米(MI)"
},
{
"searchkey": "小米充电宝",
"name": "小米(MI)"
},
{
"searchkey": "adidas男鞋",
"name": "adidas男鞋"
},
{
"searchkey": "adidas女鞋",
"name": "adidas女鞋"
},
{
"searchkey": "adidas外套",
"name": "adidas外套"
},
{
"searchkey": "adidas裤子",
"name": "adidas裤子"
},
{
"searchkey": "adidas官方旗舰店",
"name": "adidas官方旗舰店"
},
{
"searchkey": "阿迪达斯袜子",
"name": "阿迪达斯袜子"
},
{
"searchkey": "阿迪达斯外套",
"name": "阿迪达斯外套"
},
{
"searchkey": "阿迪达斯运动鞋",
"name": "阿迪达斯运动鞋"
},
{
"searchkey": "耐克外套",
"name": "耐克外套"
},
{
"searchkey": "耐克运动鞋",
"name": "耐克运动鞋"
}
]
}
返回
查看
GET product_completion_index/_search
5.2 产品搜索与自动补全
- Term suggester :词条建议器。对给输入的文本进进行分词,为每个分词提供词项建议
- Phrase suggester :短语建议器,在term的基础上,会考量多个term之间的关系
- Completion Suggester,它主要针对的应用场景就是"Auto Completion"。
- Context Suggester:上下文建议器
GET product_completion_index/_search
{
"from": 0,
"size": 100,
"suggest": {
"czbk-suggest": {
"prefix": "小米",
"completion": {
"field": "searchkey",
"size": 20,
"skip_duplicates": true
}
}
}
}
5.2.1 汉字补全OpenAPI
-
定义自动补全接口
//自动补全(完成建议) public List<String> cSuggest(CommonEntity commonEntity) throws Exception;
-
定义自动补全实现
@Override public List<String> cSuggest(CommonEntity commonEntity) throws Exception { //定义返回 List<String> suggestList = new ArrayList<>(); //定义自动完成构建器 CompletionSuggestionBuilder completionSuggestionBuilder = SuggestBuilders.completionSuggestion(commonEntity.getSuggestFileld()); //定义搜索关键字 completionSuggestionBuilder.prefix(commonEntity.getSuggestValue()); //去重 completionSuggestionBuilder.skipDuplicates(true); //获取建议条数 completionSuggestionBuilder.size(commonEntity.getSuggestCount()); //定义返回字段 SearchRequest searchRequest = new SearchRequest().indices(commonEntity.getIndexName()).source(new SearchSourceBuilder().sort(new ScoreSortBuilder().order(SortOrder.DESC)).suggest(new SuggestBuilder().addSuggestion("czbk-suggest", completionSuggestionBuilder))); //定义查找响应 SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT); //定义完成建议对象 CompletionSuggestion completionSuggestion = response.getSuggest().getSuggestion("czbk-suggest"); //获取返回数据 List<CompletionSuggestion.Entry.Option> optionList = completionSuggestion.getEntries().get(0).getOptions(); //从optionList取出结果 if (!CollectionUtils.isEmpty(optionList)) { optionList.forEach(item -> { suggestList.add(item.getText().toString()); }); } return suggestList; }
-
定义自动补全控制器
@GetMapping(value = "/csuggest") public ResponseData cSuggest(@RequestBody CommonEntity commonEntity) { //构造返回下游业务数据 ResponseData rData = new ResponseData(); if (StringUtils.isEmpty(commonEntity.getIndexName()) || StringUtils.isEmpty(commonEntity.getSuggestFileld()) || StringUtils.isEmpty(commonEntity.getSuggestValue())) { rData.setResultEnum(ResultEnum.param_isnull); return rData; } //定义建议返回结果 List<String> result = null; try { //通过接口调用批量新增方法 result = elasticsearchDocumentService.cSuggest(commonEntity); //通过类型推断自动装箱(多个参数取交集) rData.setResultEnum(result, ResultEnum.success, result.size()); //日志记录 logger.info(TipsEnum.csuggest_get_doc_success.getMessage()); } catch (Exception e) { //打印到控制台 e.printStackTrace(); //日志记录 logger.error(TipsEnum.csuggest_get_doc_fail.getMessage()); //构建错误返回信息 rData.setResultEnum(ResultEnum.error); } //返回 return rData; }
-
自动补全调用验证
http://192.168.150.7:6666/v1/docs/csuggest
或者
http://localhost:6666/v1/docs/csuggest
参数
{
"indexName": "product_completion_index",
"suggestFileld": "searchkey",
"suggestValue": "小米",
"suggestCount": 13
}
- indexName索引名称
- suggestFileld:自动补全查找列
- suggestValue:自动补全输入的关键字
- suggestCount:自动补全返回个数(京东是13个)
返回
{
"code": "200",
"desc": "操作成功!",
"data": [
"小米10",
"小米10Pro",
"小米8",
"小米9",
"小米充电宝",
"小米手机",
"小米摄像头",
"小米电视",
"小米电饭煲",
"小米笔记本",
"小米耳环",
"小米路由器"
],
"count": 12
}
自动补全自动去重
5.2.2 拼音补全OpenAPI
使用拼音访问【小米】
http://localhost:8888/v1/docs/csuggest
参数
// 全拼访问
{
"indexName": "product_completion_index",
"suggestFileld": "searchkey",
"suggestValue": "xiaomi",
"suggestCount": 13
}
// 全拼访问(分隔)
{
"indexName": "product_completion_index",
"suggestFileld": "searchkey",
"suggestValue": "xiao mi",
"suggestCount": 13
}
// 首字母访问
{
"indexName": "product_completion_index",
"suggestFileld": "searchkey",
"suggestValue": "xm",
"suggestCount": 13
}
5.3 产品搜索与语言处理
5.3.1 什么是语言处理(拼写纠错)
场景描述
例如:错误输入"【adidaas官方旗舰店】 ”能够纠错为【adidas官方旗舰店】
5.3.2 语言处理OpenAPI
GET product_completion_index/_search
{
"suggest": {
"czbk-suggestion": {
"text": "adidaas官方旗舰店",
"phrase": {
"field": "name",
"size": 13
}
}
}
}
-
定义拼写纠错接口
// 拼写纠错 public String pSuggest(CommonEntity commonEntity) throws Exception;
-
定义拼写纠错实现
@Override public String pSuggest(CommonEntity commonEntity) throws Exception { //定义返回 String pSuggestString = new String(); //定义短语建议器的构建器 PhraseSuggestionBuilder phraseSuggestionBuilder = new PhraseSuggestionBuilder(commonEntity.getSuggestFileld()); //设置搜索关键字 phraseSuggestionBuilder.text(commonEntity.getSuggestValue()); //数量匹配 phraseSuggestionBuilder.size(1); //定义返回字段 SearchRequest searchRequest = new SearchRequest().indices(commonEntity.getIndexName()).source(new SearchSourceBuilder().sort(new ScoreSortBuilder().order(SortOrder.DESC)).suggest(new SuggestBuilder().addSuggestion("czbk-suggest", phraseSuggestionBuilder))); //定义查找响应 SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT); //定义短语建议对象 PhraseSuggestion phraseSuggestion = response.getSuggest().getSuggestion("czbk-suggest"); //获取返回数据 List<PhraseSuggestion.Entry.Option> optionList = phraseSuggestion.getEntries().get(0).getOptions(); //从optionList取出结果 if (!CollectionUtils.isEmpty(optionList)) { pSuggestString = optionList.get(0).getText().toString(); } return pSuggestString; }
-
定义拼写纠错控制器
@GetMapping(value = "/psuggest") public ResponseData pSuggest(@RequestBody CommonEntity commonEntity) { //构造返回下游业务数据 ResponseData rData = new ResponseData(); if (StringUtils.isEmpty(commonEntity.getIndexName()) || StringUtils.isEmpty(commonEntity.getSuggestFileld()) || StringUtils.isEmpty(commonEntity.getSuggestValue())) { rData.setResultEnum(ResultEnum.param_isnull); return rData; } //定义纠错返回结果 String result = null; try { //通过接口调用批量新增方法 result = elasticsearchDocumentService.pSuggest(commonEntity); //通过类型推断自动装箱(多个参数取交集) rData.setResultEnum(result, ResultEnum.success, null); //日志记录 logger.info(TipsEnum.psuggest_get_doc_success.getMessage()); } catch (Exception e) { //打印到控制台 e.printStackTrace(); //日志记录 logger.error(TipsEnum.psuggest_get_doc_fail.getMessage()); //构建错误返回信息 rData.setResultEnum(ResultEnum.error); } //返回 return rData; }
-
语言处理调用验证
http://192.168.150.7:6666/v1/docs/psuggest
或者
http://localhost:6666/v1/docs/psuggest参数
{ "indexName": "product_completion_index", "suggestFileld": "name", "suggestValue": "adidaas官方旗舰店" }
- indexName索引名称
- suggestFileld:自动补全查找列
- suggestValue:自动补全输入的关键字
返回
{ "code": "200", "desc": "操作成功!", "data": "adidas官方旗舰店" }
5.4 总结
- 需要一个搜索词库/语料库,不要和业务索引库在一起,方便维护和升级语料库
- 根据分词及其他搜索条件去语料库中查询若干条(京东13条、淘宝(天猫)10条、百度4条)记录
返回 - 为了提升准确率,通常都是前缀搜索
6、电商平台产品推荐
6.1 什么是搜索推荐
例如:关键词输入【阿迪达斯 耐克 外套 运动鞋 袜子】
汪~没有找到与“阿迪达斯 耐克 外套 运动鞋 袜子”相关的商品,为您推荐“ 阿迪达斯耐克运动鞋”的相关商品,或者试试:
6.2 产品推荐OpenAPI
GET product_completion_index/_search
{
"suggest": {
"czbk-suggestion": {
"text": "阿迪达斯 耐克 外套 运动鞋 袜子",
"term": {
"field": "name",
"min_word_length": 2,
"string_distance": "ngram",
"analyzer": "ik_smart"
}
}
}
}
注意的地方,查看官网
https://www.elastic.co/guide/en/elasticsearch/reference/7.4/search-suggesters.html#te
rm-suggester
-
定义搜索推荐接口
//搜索推荐 public String tSuggest(CommonEntity commonEntity) throws Exception;
-
定义搜索推荐实现
@Override public String tSuggest(CommonEntity commonEntity) throws Exception { //定义返回 String tSuggestString = new String(); //定义词条建议器的构建器 TermSuggestionBuilder termSuggestionBuilder = SuggestBuilders.termSuggestion(commonEntity.getSuggestFileld()); //定义搜索关键字 termSuggestionBuilder.text(commonEntity.getSuggestValue()); //设置分词 termSuggestionBuilder.analyzer("ik_smart"); //定义查询长度 termSuggestionBuilder.minWordLength(2); //设置查找算法 termSuggestionBuilder.stringDistance(TermSuggestionBuilder.StringDistanceImpl.NGRAM); //定义返回字段 SearchRequest searchRequest = new SearchRequest().indices(commonEntity.getIndexName()).source(new SearchSourceBuilder().sort(new ScoreSortBuilder().order(SortOrder.DESC)).suggest(new SuggestBuilder().addSuggestion("czbk-suggest", termSuggestionBuilder))); //定义查找响应 SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT); //定义term建议对象 TermSuggestion termSuggestion = response.getSuggest().getSuggestion("czbk-suggest"); //获取返回数据 List<TermSuggestion.Entry.Option> optionList = termSuggestion.getEntries().get(0).getOptions(); //从optionList取出结果 if (!CollectionUtils.isEmpty(optionList)) { tSuggestString = optionList.get(0).getText().toString(); } return tSuggestString; }
-
定义搜索推荐控制器
@GetMapping(value = "/tsuggest") public ResponseData tSuggest(@RequestBody CommonEntity commonEntity) { //构造返回下游业务数据 ResponseData rData = new ResponseData(); if (StringUtils.isEmpty(commonEntity.getIndexName()) || StringUtils.isEmpty(commonEntity.getSuggestFileld()) || StringUtils.isEmpty(commonEntity.getSuggestValue())) { rData.setResultEnum(ResultEnum.param_isnull); return rData; } //定义搜索推荐返回结果 String result = null; try { //通过接口调用批量新增方法 result = elasticsearchDocumentService.tSuggest(commonEntity); //通过类型推断自动装箱(多个参数取交集) rData.setResultEnum(result, ResultEnum.success, null); //日志记录 logger.info(TipsEnum.tsuggest_get_doc_success.getMessage()); } catch (Exception e) { //打印到控制台 e.printStackTrace(); //日志记录 logger.error(TipsEnum.tsuggest_get_doc_fail.getMessage()); //构建错误返回信息 rData.setResultEnum(ResultEnum.error); } //返回 return rData; }
-
语言处理调用验证
http://127.0.0.1:8888/v1/docs/tsuggest
参数
{ "indexName": "product_completion_index", "suggestFileld": "name", "suggestValue": "阿迪达斯 耐克 外套 运动鞋 袜子" }
- indexName索引名称
- suggestFileld:自动补全查找列
- suggestValue:自动补全输入的关键字
返回
{ "code": "200", "desc": "操作成功!", "data": "阿迪达斯外套" }
7、指标聚合与下钻分析
7.1 指标聚合与分类
什么是指标聚合(Metric)
聚合分析是数据库中重要的功能特性,完成对某个查询的数据集中数据的聚合计算,
如:找出某字段(或计算表达式的结果)的最大值、最小值,计算和、平均值等。
ES作为搜索引擎兼数据库,同样提供了强大的聚合分析能力。
对一个数据集求最大值、最小值,计算和、平均值等指标的聚合,在ES中称为指标聚合。
Metric聚合分析分为单值分析和多值分析两类
- 单值分析,只输出一个分析结果
min,max,avg,sum,cardinality(cardinality 求唯一值,即不重复的字段有多少(相当于mysql中的
distinct) - 多值分析,输出多个分析结果
stats,extended_stats,percentile,percentile_rank
7.2 指标聚合与下钻设计
语法
"aggregations" : {
"<aggregation_name>" : { <!--聚合的名字 -->
"<aggregation_type>" : { <!--聚合的类型 -->
<aggregation_body> <!--聚合体:对哪些字段进行聚合 -->
}
[,"meta" : { [<meta_data_body>] } ]? <!--元 -->
[,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里面在定义子聚合-->
}
[,"<aggregation_name_2>" : { ... } ]* <!--聚合的名字 -->
}
openAPI设计目标与原则:
- DSL调用与语法进行高度抽象,参数动态设计
- Open API通过结果转换器支持上百种组合调用
qurey,constant_score,match/matchall/filter/sort/size/frm/higthlight/_source/includes - 逻辑处理公共调用,提升API业务处理能力
- 保留原生API与参数的用法
7.2.1 基础框架搭建
7.2.2 单值分析API设计
-
Avg(平均值)
从聚合文档中提取的价格的平均值。
对所有文档进行avg聚合(DSL)POST product_list/_search { "size": 0, "aggs": { "czbk": { "avg": { "field": "price" } } } }
以上汇总计算了所有文档的平均值。
“size”: 0, 表示只查询文档聚合数量,不查文档,如查询50,size=50
aggs:表示是一个聚合
czbk:可自定义,聚合后的数据将显示在自定义字段中结果:
{ "took" : 1662, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 10000, "relation" : "gte" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "czbk" : { "value" : 920.1535462724372 } } }
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "aggs": { "czbk": { "avg": { "field": "price" } } } } }
对筛选后的文档聚合
POST product_list/_search { "size": 0, "query": { "match": { "onelevel": "手机通讯" } }, "aggs": { "czbk": { "avg": { "field": "price" } } } }
结果:
{ "took" : 159, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 10000, "relation" : "gte" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "czbk" : { "value" : 314.77633210684854 } } }
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "query": { "match": { "onelevel": "手机通讯" } }, "aggs": { "czbk": { "avg": { "field": "price" } } } } }
根据Script计算平均值:
es所使用的脚本语言是painless这是一门安全-高效的脚本语言,基于jvm的
#统计所有 POST product_list/_search?size=0 { "aggs": { "czbk": { "avg": { "script": { "source": "doc.evalcount.value" } } } } } 结果:"value" : 599929.110015995 #有条件 POST product_list/_search?size=0 { "query": { "match": { "onelevel": "手机通讯" } }, "aggs": { "czbk": { "avg": { "script": { "source": "doc.evalcount" } } } } } 结果:"value" : 600055.6935087288
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "aggs": { "czbk": { "avg": { "script": { "source": "doc.evalcount" } } } } } }
总结:
avg平均
1、统一avg(所有文档)
2、有条件avg(部分文档)
3、脚本统计(所有)
4、脚本统计(部分)代码编写
//平均值 if (m.getValue() instanceof ParsedAvg) { map.put("value", ((ParsedAvg) m.getValue()).getValue()); }
访问验证
http://localhost:6666/v1/analysis/metric/agg
或者
http://localhost:5555/v1/analysis/metric/agg -
Max(最大值)
计算从聚合文档中提取的数值的最大值。
统计所有文档
POST product_list/_search { "size": 0, "aggs": { "czbk": { "max": { "field": "price" } } } }
结果: “value” :1.0E8
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "aggs": { "czbk": { "max": { "field": "price" } } } } }
统计过滤后的文档
POST product_list/_search { "size": 0, "query": { "match": { "onelevel": "手机通讯" } }, "aggs": { "czbk": { "max": { "field": "price" } } } }
结果: “value” :2474000.0
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "query": { "match": { "onelevel": "手机通讯" } }, "aggs": { "czbk": { "max": { "field": "price" } } } } }
结果: “value” : 2474000.0
代码编写
//最大值 if (m.getValue() instanceof ParsedMax) { map.put("value", ((ParsedMax) m.getValue()).getValue()); }
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
Min(最小值)
计算从聚合文档中提取的数值的最小值。
统计所有文档
POST product_list/_search { "size": 0, "aggs": { "czbk": { "min": { "field": "price" } } } }
结果:“value”: 0.0
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "aggs": { "czbk": { "min": { "field": "price" } } } } }
统计筛选后的文档
POST product_list/_search { "size": 1, "query": { "match": { "onelevel": "手机通讯" } }, "aggs": { "czbk": { "min": { "field": "price" } } } }
结果:“value”: 0.0
参数size=1;可查询出金额为0的数据
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 1, "query": { "match": { "onelevel": "手机通讯" } }, "aggs": { "czbk": { "min": { "field": "price" } } } } }
代码编写
//最小值 if (m.getValue() instanceof ParsedMin) { map.put("value", ((ParsedMin) m.getValue()).getValue()); }
访问验证
http://localhost:6666/v1/analysis/metric/agg
或者
http://localhost:5555/v1/analysis/metric/agg -
Sum(总和)
统计所有文档汇总
POST product_list/_search { "size": 0, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "sum": { "field": "price" } } } }
结果:“value” : 9.652872986812243E8
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "sum": { "field": "price" } } } } }
代码编写
//求和 if (m.getValue() instanceof ParsedSum) { map.put("value", ((ParsedSum) m.getValue()).getValue()); }
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
Cardinality(唯一值)
Cardinality Aggregation,基数聚合。它属于multi-value,基于文档的某个值(可以是特定的字段,
也可以通过脚本计算而来),计算文档非重复的个数(去重计数),相当于sql中的distinct。cardinality 求唯一值,即不重复的字段有多少(相当于mysql中的distinct)
统计所有文档
POST product_list/_search { "size": 0, "aggs": { "czbk": { "cardinality": { "field": "storename.keyword" } } } }
结果:“value” : 103169
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "aggs": { "czbk": { "cardinality": { "field": "storename.keyword" } } } } }
统计筛选后的文档
POST product_list/_search { "size": 0, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "cardinality": { "field": "storename.keyword" } } } }
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "cardinality": { "field": "storename.keyword" } } } } }
代码编写
//不重复的值 if (m.getValue() instanceof ParsedCardinality) { map.put("value", ((ParsedCardinality) m.getValue()).getValue()); }
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg
7.2.3 多值分析API设计
-
Stats Aggregation
Stats Aggregation,统计聚合。它属于multi-value,基于文档的某个值(可以是特定的数值型字段,也可以通过脚本计算而来),计算出一些统计信息(min、max、sum、count、avg5个值)
统计所有文档
POST product_list/_search { "size": 0, "aggs": { "czbk": { "stats": { "field": "price" } } } }
返回
"aggregations" : { "czbk" : { "count" : 5072448, "min" : 0.0, "max" : 1.0E8, "avg" : 920.1535462724372, "sum" : 4.667431015482532E9 } }
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "aggs": { "czbk": { "stats": { "field": "price" } } } } }
统计筛选文档
POST product_list/_search { "size": 0, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "stats": { "field": "price" } } } }
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "stats": { "field": "price" } } } } }
代码编写
//状态统计 if (m.getValue() instanceof ParsedStats) { map.put("count", ((ParsedStats) m.getValue()).getCount()); map.put("min", ((ParsedStats) m.getValue()).getMin()); map.put("max", ((ParsedStats) m.getValue()).getMax()); map.put("avg", ((ParsedStats) m.getValue()).getAvg()); map.put("sum", ((ParsedStats) m.getValue()).getSum()); }
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
扩展状态统计
Extended Stats Aggregation,扩展统计聚合。它属于multi-value,比stats多4个统计结果: 平方
和、方差、标准差、平均值加/减两个标准差的区间统计所有文档
POST product_list/_search { "size": 0, "aggs": { "czbk": { "extended_stats": { "field": "price" } } } }
返回
"aggregations" : { "czbk" : { "count" : 5072448, "min" : 0.0, "max" : 1.0E8, "avg" : 920.1535462724372, "sum" : 4.667431015482532E9, "sum_of_squares" : 2.0182209454063148E16, "variance" : 3.9779441210362864E9, "variance_population" : 3.9779441210362864E9, "variance_sampling" : 3.9779449052621484E9, "std_deviation" : 63070.94514145389, "std_deviation_population" : 63070.94514145389, "std_deviation_sampling" : 63070.951358467304, "std_deviation_bounds" : { "upper" : 127062.04382918023, "lower" : -125221.73673663534, "upper_population" : 127062.04382918023, "lower_population" : -125221.73673663534, "upper_sampling" : 127062.05626320705, "lower_sampling" : -125221.74917066217 } } }
- sum_of_squares:平方和
- variance:方差
- std_deviation:标准差
- std_deviation_bounds:标准差的区间
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "aggs": { "czbk": { "extended_stats": { "field": "price" } } } } }
统计筛选后的文档
POST product_list/_search { "size": 1, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "extended_stats": { "field": "price" } } } }
返回
"aggregations" : { "czbk" : { "count" : 340378, "min" : 0.0, "max" : 2474000.0, "avg" : 2835.927406240193, "sum" : 9.652872986812243E8, "sum_of_squares" : 6.06065362437439E13, "variance" : 1.7001407710991383E8, "variance_population" : 1.7001407710991383E8, "variance_sampling" : 1.7001457659747353E8, "std_deviation" : 13038.944631752749, "std_deviation_population" : 13038.944631752749, "std_deviation_sampling" : 13038.963785419206, "std_deviation_bounds" : { "upper" : 28913.81666974569, "lower" : -23241.961857265305, "upper_population" : 28913.81666974569, "lower_population" : -23241.961857265305, "upper_sampling" : 28913.854977078605, "lower_sampling" : -23242.00016459822 } } }
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 1, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "extended_stats": { "field": "price" } } } } }
代码编写
状态统计ParsedStats是扩展状态统计ParsedExtendedStats父类
判断无需更改顺序
//扩展统计 if (m.getValue() instanceof ParsedExtendedStats) { map.put("count", ((ParsedExtendedStats) m.getValue()).getCount()); map.put("min", ((ParsedExtendedStats) m.getValue()).getMin()); map.put("max", ((ParsedExtendedStats) m.getValue()).getMax()); map.put("avg", ((ParsedExtendedStats) m.getValue()).getAvg()); map.put("sum", ((ParsedExtendedStats) m.getValue()).getSum()); map.put("sum_of_squares", ((ParsedExtendedStats) m.getValue()).getSumOfSquares()); map.put("variance", ((ParsedExtendedStats) m.getValue()).getVariance()); map.put("std_deviation", ((ParsedExtendedStats) m.getValue()).getStdDeviation()); map.put("upper", ((ParsedExtendedStats) m.getValue()).getStdDeviationBound(ExtendedStats.Bounds.UPPER)); map.put("lower", ((ParsedExtendedStats) m.getValue()).getStdDeviationBound(ExtendedStats.Bounds.LOWER)); }
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
百分位度量/百分比统计
Percentiles Aggregation,百分比聚合。它属于multi-value,对指定字段(脚本)的值按从小到大累计每个值对应的文档数的占比(占所有命中文档数的百分比),返回指定占比比例对应的值。默认返回[1, 5, 25, 50, 75, 95, 99 ]分位上的值。
它们表示了人们感兴趣的常用百分位数值。
统计所有文档
POST product_list/_search { "size": 0, "aggs": { "czbk": { "percentiles": { "field": "price" } } } }
返回
}, "aggregations" : { "czbk" : { "values" : { "1.0" : 0.0, "5.0" : 14.99999272133453, "25.0" : 58.76038168571048, "50.0" : 139.47447505232998, "75.0" : 388.59368606915626, "95.0" : 3634.3835145207904, "99.0" : 12547.450833578012 } } }
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "aggs": { "czbk": { "percentiles": { "field": "price" } } } } }
统计筛选后的文档
POST product_list/_search { "size": 0, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "percentiles": { "field": "price" } } } }
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "percentiles": { "field": "price" } } } } }
代码编写
//百分位度量 if (m.getValue() instanceof ParsedTDigestPercentiles) { for (Iterator<Percentile> iterator = ((ParsedTDigestPercentiles) m.getValue()).iterator(); iterator.hasNext(); ) { Percentile p = iterator.next(); map.put(p.getPercent(), p.getValue()); } }
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg -
百分位等级/百分比排名聚合
百分比排名聚合:这里有另外一个紧密相关的度量叫 percentile_ranks 。 percentiles 度量告诉
我们落在某个百分比以下的所有文档的最小值。统计所有文档
统计价格在15元之内统计价格在30元之内文档数据占有的百分比
tips:
统计数据会变化
这里的15和30;完全可以理解万SLA的200;比较字段不一样而已POST product_list/_search { "size": 0, "aggs": { "czbk": { "percentile_ranks": { "field": "price", "values": [ 15, 30 ] } } } }
返回
价格在15元之内的文档数据占比是4.92%
价格在30元之内的文档数据占比是12.72%"aggregations" : { "czbk" : { "values" : { "15.0" : 4.89331591488828, "30.0" : 12.732247823263487 } } }
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "aggs": { "czbk": { "percentile_ranks": { "field": "price", "values": [ 15, 30 ] } } } } }
统计过滤后的文档
POST product_list/_search { "size": 0, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "percentile_ranks": { "field": "price", "values": [ 15, 30 ] } } } }
OpenAPI查询参数设计
{ "indexName": "product_list", "map": { "size": 0, "query": { "constant_score": { "filter": { "match": { "threelevel": "手机" } } } }, "aggs": { "czbk": { "percentile_ranks": { "field": "price", "values": [ 15, 30 ] } } } } }
代码编写
//百分位等级 if (m.getValue() instanceof ParsedTDigestPercentileRanks) { for (Iterator<Percentile> iterator = ((ParsedTDigestPercentileRanks) m.getValue()).iterator(); iterator.hasNext(); ) { Percentile p = iterator.next(); map.put(p.getValue(), p.getPercent()); } }
访问验证
http://localhost:6666/v1/analysis/metric/agg
OR
http://localhost:5555/v1/analysis/metric/agg
8、电商平台日志埋点与搜索热词
8.1 什么是热度搜索
以下为【京东】热搜词
8.2 提取热度搜索
热搜词分析流程图
8.3 日志埋点
下面的配置针对需要埋点的服务
这里以service-elasticsearch为例
-
Spring Cloud 整合Log4j2
相比与其他的日志系统,log4j2丢数据这种情况少;disruptor技术,在多线程环境下,性能高于logback等10倍以上;利用jdk1.5并发的特性,减少了死锁的发生;
排除logback的默认集成。
因为Spring Cloud 默认集成了logback, 所以首先要排除logback的集成,在pom.xml文件<!--排除logback的默认集成 Spring Cloud 默认集成了logback--> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> <exclusions> <exclusion> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-logging</artifactId> </exclusion> </exclusions> </dependency>
-
引入log4j2起步依赖
<!-- 引入log4j2起步依赖--> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-log4j2</artifactId> </dependency> <!-- log4j2依赖环形队列--> <dependency> <groupId>com.lmax</groupId> <artifactId>disruptor</artifactId> <version>3.4.2</version> </dependency>
-
设置配置文件
如果自定义了文件名,需要在application.yml中配置
进入Nacos修改配置
logging: config: classpath:log4j2-dev.xml
-
配置文件模板
<Configuration> <Appenders> <Socket name="Socket" host="192.168.150.7" port="4567"> <JsonLayout compact="true" eventEol="true" /> </Socket> </Appenders> <Loggers> <Root level="info"> <AppenderRef ref="Socket"/> </Root> </Loggers> </Configuration>
从配置文件中可以看到,这里使用的是Socket Appender来将日志打印的信息发送到Logstash。
注意了,Socket的Appender必须要配置到下面的Logger才能将日志输出到Logstash里!
另外这里的host是部署了Logstash服务端的地址,并且端口号要和你在Logstash里配置的一致才行。
-
日志埋点
private void getClientConditions(CommonEntity commonEntity, SearchSourceBuilder searchSourceBuilder) { //循环下游业务查询条件 for (Map.Entry<String, Object> m : commonEntity.getMap().entrySet()) { if (StringUtils.isNotEmpty(m.getKey()) && m.getValue() != null) { String key = m.getKey(); String value = String.valueOf(m.getValue()); //构造DSL请求体中的query searchSourceBuilder.query(QueryBuilders.matchQuery(key, value)); logger.info("search for the keyword:" + value); } } }
-
创建索引
下面的索引存储用户输入的关键字,最终通过聚合的方式处理索引数据,最终将数据放到语料库
PUT es-log/ { "mappings": { "properties": { "@timestamp": { "type": "date" }, "host": { "type": "text" }, "searchkey": { "type": "keyword" }, "port": { "type": "long" }, "loggerName": { "type": "text" } } } }
8.4 数据落盘
- 配置Logstash.conf
连接logstash方式有两种
(1) 一种是Socket连接
(2)另外一种是gelf连接
对外暴露logstash容器的4567端口:参考文档
- 执行全文检索
http://localhost:8888/v1/docs/mquery
参数
{
"pageNumber": 1,
"pageSize": 3,
"indexName": "product_list",
"highlight": "productname",
"map": {
"productname": "小米"
}
}
- 查询是否有数据
GET es-log/_search
{
"from": 0,
"size": 200,
"query": {
"match_all": {}
}
}
返回
"hits" : [
{
"_index" : "es-log",
"_type" : "_doc",
"_id" : "H94AKpQB5vqCNWpIYHYT",
"_score" : 1.0,
"_source" : {
"host" : "192.168.150.1",
"loggerName" : "com.xin.service.impl.ElasticsearchDocumentServiceImpl",
"@timestamp" : "2025-01-03T02:30:55.118Z",
"searchkey" : "小米",
"port" : 54544
}
},
{
"_index" : "es-log",
"_type" : "_doc",
"_id" : "ZdgAKpQBrYxtVgSQgvHB",
"_score" : 1.0,
"_source" : {
"host" : "192.168.150.1",
"loggerName" : "com.xin.service.impl.ElasticsearchDocumentServiceImpl",
"@timestamp" : "2025-01-03T02:31:04.021Z",
"searchkey" : "小米",
"port" : 54544
}
}
]
8.5 热度搜索OpenAPI
聚合
获取es-log索引中的文档数据并对其进行分组,统计热搜词出现的频率,根据频率获取有效数据。
DSL实现
POST es-log/_search?size=0
{
"aggs": {
"czbk": {
"terms": {
"field": "searchkey",
"min_doc_count": 5,
"size": 2,
"order": {
"_count": "desc"
}
}
}
}
}
返回
{
"took" : 155,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 14,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"czbk" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "华为",
"doc_count" : 7
},
{
"key" : "小米",
"doc_count" : 7
}
]
}
}
}
OpenAPI查询参数设计
-
定义搜索推荐接口
//获取搜索热词 public Map<String, Long> hotWords(CommonEntity commonEntity) throws Exception;
-
定义搜索推荐实现
@Override public Map<String, Long> hotWords(CommonEntity commonEntity) throws Exception { //定义返回数据 Map<String, Long> map = new LinkedHashMap<>(); //执行查询 SearchResponse response = getSearchResponse(commonEntity); //接收数据 Terms termsAggData = response.getAggregations().get(response.getAggregations().getAsMap().entrySet().iterator().next().getKey()); for (Terms.Bucket entry : termsAggData.getBuckets()) { if (entry.getKey() != null) { //key为分组字段 String key = entry.getKey().toString(); //count数据条数 Long count = entry.getDocCount(); //设置到map map.put(key, count); } } return map; }
-
定义搜索推荐控制器
@GetMapping(value = "/hotwords") public ResponseData hotWords(@RequestBody CommonEntity commonEntity) { //构造返回数据 ResponseData responseData = new ResponseData(); if (StringUtils.isEmpty(commonEntity.getIndexName())) { responseData.setResultEnum(ResultEnum.param_isnull); return responseData; } //定义查询返回结果 Map<String, Long> result = null; try { result = analysisService.hotWords(commonEntity); //通过类型推断自动装箱 responseData.setResultEnum(result, ResultEnum.success, null); //日志记录 logger.info(TipsEnum.hotwords_get_doc_success.getMessage()); } catch (Exception e) { //打印到控制台 e.printStackTrace(); //日志记录 logger.error(TipsEnum.hotwords_get_doc_fail.getMessage()); //构建错误信息 responseData.setResultEnum(ResultEnum.error); } return responseData; }
-
调用验证
获取分析服务热搜词数据
http://localhost:5555/v1/analysis/hotwords
参数
{ "indexName": "es-log", "map": { "aggs": { "per_count": { "terms": { "field": "searchkey", "min_doc_count": 5, "size": 2, "order": { "_count": "desc" } } } } } }
- field表示需要查找的列
- min_doc_count:热搜词在文档中出现的次数
- size表示本次取出多少数据
- order表示排序(升序or降序)
返回
{ "code": "200", "desc": "操作成功!", "data": { "华为": 7, "小米": 7 } }