INFS7410 Lecture Notes - Week 4 (Retrieval Models II)

本文链接：https://blog.youkuaiyun.com/weixin_42305093/article/details/126295001

写在前面
从本篇开始, 我打算用英文做笔记, 并辅以适当的中文做解释. 这么做的原因是final需要进行口答, 我担心如果一直用中文做笔记, 会对某些术语, 描述方式感到陌生.
顺带一提, 这个学期到目前为止表现还不错, 学会整理知识了, 继续保持!

Rank Fusion

在Rank Fusion中, 检索结果的文档列表会由多个搜索引擎的搜索结果合成.
Rank Fusion主要有三种实现方法:
1. Voting: 基于一个文档出现在不同列表中的次数
2. Rank aggregation: 每个出现在列表中的文档的排名都会被考虑
3. Score aggregation: 每个出现在列表中的文档的排名和分数都会被考虑

Fusing Methods

Voting
Use election strategy to select k documents from among those nominated(提名) by the result lists.
Other Voting Methods
- Plurality rule: number of lists where the item is ranked first
- Copeland rule: number of pairwise victories minus number of pairwise defeats(其中的victories和defeats代表什么?)
- Borda: the score of an item with respect to a list is the number of items in the list that are ranked lower (也即, List中的文档总数与候选文档排名的差异数. 可以通过下面的式子进行理解)
Scores are summed over the lists
Borda takes into account rank information -> it is a rank-based strategy too
Condocert winner: an item that defeats every other item in strict majority sense.
- 一个投票规则是否是Condocert extension的判断标准如下: 随机选择参与者中的两个部分, 即C和C^-, 以下情况总会成立: 假设对于任意的x∈C, y∈C^-, x的支持总多于y. 如果满足以上判断标准, 则这个投票标准就是Condorcet extension.
- Plurality rule: not Condorcet
Copeland rule: Condorcet
Borda: not Condorcet
Rank-based Fusion
- Borda: the score of an item with respect to a list is the number of items in the list that are ranked lower
Scores are summed over the lists
$\sum_{L_i:d\in L_i}\frac{n-r_{L_i}(d)+1}{n}$
其中, $L_{i}:d \in L_{i}$ 表示所有包含文档d的List, $n$ 表示第i个List包含的文档数量, $r_{L_{i}}(d)$ 则表示文档d在第i个List中的排名.
- RRF: discounts the weight of document occurring deep in lists using a reciprocal(互惠的) distribution. v is parameter, typical value 60.
$\sum_{L_i:d \in L_i}\frac{1}{v+r_{L_i}(d)}$
Score-based Fusion
-  CombSUM: fusion score of doc is the sum of the scores of the doc in each list (很简单的一个score fusion, 即将所有list中文档d的score相加)
$\sum_{L_i:d \in L_i}s_{L_i}(d)$
-  Linear: like CombSUM, but allows to weight differently each list (在CombSUM的基础上, 给不同的list添加了权重)
$\sum_{L_i:d \in L_i}\omega_i \cdot s_{L_i}(d)$
-  CombMNZ: fused score of doc is the sum of the scores of the doc in each list, multiplied by the number of lists in which it occurs (也是在CombSUM的基础上, 在前面加上代表在多少个list中出现的系数m)
   $m \cdot \sum_{L_i:d \in L_i}s_{L_i}(d)$

Score Normalisation

Why do we need normalisation?
Normalisation addresses the problem that scores from different ranking functions / systems for the same item are not directly comparable.
一个标准化的score应该拥有如下三个特质:
1. Shift invariant(不变的): both the shifted and unshifted scores should normalise to the same ordering (标准化后分数排列顺序不变)
2. Scale invariant: the scheme should be insensitive to scaling by a multiplicative constant (对于常数相乘不敏感? 上课问问能不能给一个例子)
3. Outlier insensitive: a single item should not significantly addect the normalised scores for the other items.
Approaches of Score Normalisation
-  Min-Max(Standard Norm): 将分数标准化至0-1的区间
   $s_{L}^{minmax}(d)=\frac{s_L(d)-min_{{d}'\in L}s_L({d}')}{max_{{d}'\in L}s_L({d}') - min_{{d}'\in L}s_L({d}')}$
用语言描述, 即: 一个文档的minmax标准化分数是文档本身的分数和最低分之差, 再除以最大和最小分数之差的结果.
-  Sum normalisation (Sum Norm): shift the minimum value to 0, and scale the sum to 1
$s_{L}^{sum}(d) = \frac{s_L(d)-min_{{d}' \in L}s_L({d}')}{\sum_{{d}' \in L}[s_L({d}')-min_{{d}' \in L}s_L({d}')]}$
-  Zero Mean and Unit Variance: based on the Z-score statistic: shift the mean to 0, and scale the variance to 1.

Rank to Score Transformations

Idea: in score-based rank fusion methods, use a function of rank instead of the actual score
- avoids dealing with largely differing score sources and distributions
- Borda: $|L_i|-r_{L_i}(d)$
- RR: $\frac{1}{\nu + r_{L_i}(d)}$