[Search Engines笔记] 16: Ranked retrieval: Feature-based models

最新推荐文章于 2021-08-13 09:26:27 发布

原创最新推荐文章于 2021-08-13 09:26:27 发布 · 634 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#search engine

Search Engine 笔记专栏收录该内容

5 篇文章

订阅专栏

本文介绍了学习排序（Learning to Rank）技术的基本概念及其在搜索引擎中的应用。解释了为何需要使用学习排序方法，并对比了点对点、配对和列表级三种主要的学习排序方法。

参考文档：

Jamie的课件：http://boston.lti.cs.cmu.edu/classes/11-642/

阿衡的SE笔记：http://www.shuang0420.com/categories/NLP/Search-Engines/

为啥要Learning to Rank：

我们已经学习了很多的检索方法：

Retrieval Models：Vector Space，BM25，language models…
Representations: Title, body, url, inlink…
Query templates: Sequential dependency models…
Query-independent evidence: PR, url depth….

这些不同方法的evidence可以combine起来提高SE的accuracy 。因为潜在的combinations很多，而且有很多paraneters需要调整，所以人工combine这些方法是不切实际的。于是Learning to Rank 应运而生～

Main idea:

Learn a model that combines many types of evidence (each type as a feature).

类似ML的其他问题：

给training data，从feature vectors->desired scores，学一个model Y=f(X; w)
对一个new data x, 带入model得到y=f(x; w)。

Introduction：

LeToR是个supervised learning。熟悉的任务是classification（难，因为分错了就全完了）和regression（简单，因为和target的差值是loss，就算很大也不算全错）。ranking task是要 find the best ranking （order）of given documents，但一般问题被转化为finding ranking scores。

大型SE中，检索通过一系列的检索模型完成：

exact match boolean：form a set of docs。为了快，而且刚开始文档质量参差不齐，差一点的算法就能搞定了；
best-match retrieval：rank the set，选出一部分。
L2R：reranking，选出一部分。因为此时文档少，而且质量差不多了，需要用复杂的方法去区分他们的差别。

LeToR在大型SE中被用在高层（reranking少量文档）以平衡efficient和effective。

L2R主要包括三个dimensions：

document representation — features
type of training data
ML算法

L2R Framework：

S1: 对于每个query，对docs做feature extraction，把doc d表示为

S2: 用training data学一个model

这个model给每篇docs一个对于qry的分数。

Document representation — features：

VSM
coordinationMatch：number of query terms matching doc d
BM25 for either doc or url, body, inlink, title, keywords…..
Indri
PageRank
Spam
URLDepth
Wiki score
Avg word length
…..

Training data 种类：

binary assessments 相关、不相关 or 二值
document scores 实数score或者几个level
preferences （di > dj）
rankings （di > dj > dk > dm > …..）

L2R Approaches:

Pointwise
- Training data是一个document的class或者score
- Accurate score不等于accurate ranking
- position information ignored
Pairwise
- Training data是一个preference among一对文档
- Accurate preference不等于accurate ranking
- position information ignored
Listwise
- Training data是一个docs的ranking
- 直接optimize ranking metrics很难

相同点：

都用一个trained model h去estimate the score of x（doc的feature vector）；

都用h算出来的分排序。

不同点：

不同的training方法；

不同的training data。

Pointwise是最弱的。