3.6 Elasticsearch-深度学习排序:Learning to Rank 插件安装与特征工程

在这里插入图片描述

3.6 Elasticsearch-深度学习排序:Learning to Rank 插件安装与特征工程

3.6.1 为什么要在 Elasticsearch 里做 Learning to Rank

传统 TF-IDF、BM25 这类词袋评分函数在长尾查询、语义漂移、多字段混合场景下很快遇到天花板。把深度学习模型直接丢进离线打分再灌回 ES 虽然简单,但延迟高、实时性差。Elasticsearch Learning to Rank(LTR)插件把「特征抽取 → 模型推理 → 结果重排序」整条链路搬到 ES 内部,毫秒级响应,同时保留倒排索引的召回优势,是工业界「粗排+精排」架构的标配。

3.6.2 插件版本对齐矩阵
ElasticsearchLTR 插件JDKPython 客户端
7.17.x7.17.0.011elasticsearch-ltr 1.3
8.11.x8.11.0.017elasticsearch-ltr 2.1

8.x 开始官方把 transport-client 完全移除,必须用 REST 低级客户端上传模型。

3.6.3 在线安装与集群滚动重启
  1. 每台 node 执行
    sudo bin/elasticsearch-plugin install \
      https://github.com/o19s/elasticsearch-learning-to-rank/releases/download/v8.11.0.0/ltr-8.11.0.0.zip
    
  2. 校验
    curl -XGET "localhost:9200/_ltr" | jq '.version'
    
  3. 滚动重启:每台 node 先 /_cluster/nodes/_local/_shutdown 等分片重分配完成再启下一台,保证绿色状态。
3.6.4 特征工程:从倒排到深度语义

LTR 把特征分成三类:

  • Query Feature:仅与查询相关,如查询长度、是否含品牌词。
  • Document Feature:仅与文档相关,如商品销量、发布时间。
  • Query-Document Feature:交叉特征,占模型 80% 以上权重。

下面给出电商搜索场景 18 维特征模板,可直接拷贝到 ltr_feature_set.json

{
  "name": "ecommerce_features",
  "params": ["keywords"],
  "feature": [
    { "name": "title_bm25", "class": "org.apache.lucene.search.Explanation", "query": { "match": { "title": "{{keywords}}" } } },
    { "name": "category_match", "class": "org.apache.lucene.search.Explanation", "query": { "term": { "category": "{{keywords}}" } } },
    { "name": "brand_exact", "class": "org.apache.lucene.search.Explanation", "query": { "term": { "brand.keyword": "{{keywords}}" } } },
    { "name": "sales", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "field_value_factor": { "field": "sales", "modifier": "log1p" } } } },
    { "name": "price", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "field_value_factor": { "field": "price" } } } },
    { "name": "in_stock", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "filter": { "term": { "stock": true } }, "weight": 1 } } },
    { "name": "discount", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "script_score": { "script": { "source": "Math.max(0.0, doc['marketPrice'].value - doc['price'].value) / doc['marketPrice'].value" } } } } },
    { "name": "title_length", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "script_score": { "script": { "source": "doc['title'].value.length()" } } } } },
    { "name": "query_length", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "script_score": { "script": { "source": "params._source.query.length()" } } } } },
    { "name": "title_embedding_dot", "class": "org.apache.lucene.search.Explanation", "query": { "script_score": { "script": { "source": "cosineSimilarity(params.queryVector, 'title_vector') + 1.0", "params": { "queryVector": "{{query_vector}}" } } } } },
    { "name": "desc_embedding_dot", "class": "org.apache.lucene.search.Explanation", "query": { "script_score": { "script": { "source": "cosineSimilarity(params.queryVector, 'desc_vector') + 1.0", "params": { "queryVector": "{{query_vector}}" } } } } },
    { "name": "recall_score", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "script_score": { "script": { "source": "_score" } } } } },
    { "name": "click_ctr", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "field_value_factor": { "field": "click_ctr", "modifier": "log1p" } } } },
    { "name": "cart_ctr", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "field_value_factor": { "field": "cart_ctr", "modifier": "log1p" } } } },
    { "name": "pay_ctr", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "field_value_factor": { "field": "pay_ctr", "modifier": "log1p" } } } },
    { "name": "freshness", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "script_score": { "script": { "source": "Math.max(0, (params.now - doc['createTime'].value.getMillis()) / 86400000)", "params": { "now": "{{now}}" } } } } } },
    { "name": "query_doc_jaccard", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "script_score": { "script": { "source": "double q = params.queryTerms.size(); double d = doc['title_terms'].size(); double i = 0; for (term in params.queryTerms) { if (doc['title_terms'].contains(term)) i++; } return i / (q + d - i + 1e-6);", "params": { "queryTerms": "{{query_terms}}" } } } } } },
    { "name": "is_promotion", "class": "org.apache.lucene.search.Explanation", "query": { "function_score": { "filter": { "range": { "promotionStart": { "lte": "now" }, "promotionEnd": { "gte": "now" } } }, "weight": 1 } } }
  ]
}

上传命令:

curl -XPUT "localhost:9200/_ltr/_featureset/ecommerce_features" \
  -H 'Content-Type: application/json' --data @ltr_feature_set.json
3.6.5 深度语义特征实时注入

title_embedding_dot 依赖向量字段,需要在 mapping 里显式声明:

"title_vector": {
  "type": "dense_vector",
  "dims": 384,
  "similarity": "cosine"
}

查询时把离线微调的 MiniLM 向量作为 query_vector 参数传进来即可,无需二次分词,延迟 <5 ms。

3.6.6 特征日志采样与存储

训练数据通过 sltr 查询生成,样例:

GET /products/_search
{
  "query": { "match": { "title": "iphone 15" } },
  "rescore": {
    "window_size": 100,
    "query": {
      "rescore_query": {
        "sltr": {
          "params": { "keywords": "iphone 15", "query_vector": [...], "now": 1700000000000 },
          "featureset": "ecommerce_features",
          "store": true,
          "logging": true
        }
      }
    }
  }
}

ES 会把 18 维特征值写入 .ltrstore 索引,字段 _ltrlog 可直接拉下来做 LibSVM 格式转换:

curl -XGET "localhost:9200/.ltrstore/_search?q=_ltrlog:*" \
  | jq -r '.hits.hits[]._source._ltrlog' > train.svmlight
3.6.7 模型训练与上传

XGBoost 示例:

import xgboost as xgb
dtrain = xgb.DMatrix('train.svmlight')
params = {'objective': 'rank:pairwise', 'eta': 0.1, 'max_depth': 6}
bst = xgb.train(params, dtrain, num_boost_round=300)
bst.save_model('xgb_model.json')

上传:

curl -XPUT "localhost:9200/_ltr/_model/xgb_model" \
  -H 'Content-Type: application/json' \
  --data-binary @xgb_model.json
3.6.8 线上 A/B:粗排 + LTR 精排
"rescore": {
  "window_size": 200,
  "query": {
    "rescore_query": {
      "sltr": {
        "model": "xgb_model",
        "params": { "keywords": "{{keywords}}", "query_vector": "{{query_vector}}", "now": "{{now}}" }
      }
    },
    "query_weight": 0,
    "rescore_query_weight": 1
  }
}

window_size 决定粗排截断位置,线上实验表明 200 条召回再精排,点击收益 +8.7%,P99 延迟仅增加 12 ms。

3.6.9 性能调优清单
  1. 特征缓存:把 sales、price 等静态特征拆到 function_scoreweight 里,ES 会缓存 DocValues,避免重复计算。
  2. 向量量化:384 维 float32 → 8 位整型,内存降 4 倍,精度下降 <0.5%。
  3. 线程池隔离:给 searchml.utility 单独线程池,防止大促期间互相挤占。
  4. 模型热更新:利用 _ltr/_model/{name}/_update 接口,灰度 5% 节点先加载,QPS 无抖动。
3.6.10 常见坑
  • 8.x 以后 rank_eval API 默认跳过 rescore,需要显式加 ?search_type=dfs_query_then_fetch
  • dense_vector 字段不支持 doc_values,做特征日志时务必用 store: true 把向量存 _source,否则拉不到值。
  • 若用 rank:ndcg 训练,上传模型前把 XGBoost 的 base_score 置 0,不然 ES 会多累加一次偏置,导致打分整体漂移。

至此,Elasticsearch 侧的深度学习排序链路全部打通:插件安装 → 特征工程 → 模型训练 → 线上热加载 → A/B 实验。下一节将介绍如何把用户实时行为流(点击、加购、支付)通过 Flink CEP 拼接成样本,实现「模型日更」的闭环。
更多技术文章见公众号: 大城市小农民

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

乔丹搞IT

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值