INFS7410 Lecture Notes - Week 4 (Retrieval Models II)

写在前面
从本篇开始, 我打算用英文做笔记, 并辅以适当的中文做解释. 这么做的原因是final需要进行口答, 我担心如果一直用中文做笔记, 会对某些术语, 描述方式感到陌生.
顺带一提, 这个学期到目前为止表现还不错, 学会整理知识了, 继续保持!

Rank Fusion

  • 在Rank Fusion中, 检索结果的文档列表会由多个搜索引擎的搜索结果合成.
  • Rank Fusion主要有三种实现方法:
    1. Voting: 基于一个文档出现在不同列表中的次数
    2. Rank aggregation: 每个出现在列表中的文档的排名都会被考虑
    3. Score aggregation: 每个出现在列表中的文档的排名和分数都会被考虑

 Fusing Methods

  • Voting
    Use election strategy to select k documents from among those nominated(提名) by the result lists.
  • Other Voting Methods
    -  Plurality rule: number of lists where the item is ranked first
    -  Copeland rule: number of pairwise victories minus number of pairwise defeats(其中的victories和defeats代表什么?)
    -  Borda: the score of an item with respect to a list is the number of items in the list that are ranked lower (也即, List中的文档总数与候选文档排名的差异数. 可以通过下面的式子进行理解)
    Scores are summed over the lists
    Borda takes into account rank information -> it is a rank-based strategy too
  • Condocert winner: an item that defeats every other item in strict majority sense.
    -  一个投票规则是否是Condocert extension的判断标准如下: 随机选择参与者中的两个部分, 即CC^-, 以下情况总会成立: 假设对于任意的x∈C, y∈C^-, x的支持总多于y. 如果满足以上判断标准, 则这个投票标准就是Condorcet extension.
    -  Plurality rule: not Condorcet
       Copeland rule: Condorcet
       Borda: not Condorcet
  • Rank-based Fusion
    Borda: the score of an item with respect to a list is the number of items in the list that are ranked lower
    Scores are summed over the lists
                                                               \sum_{L_i:d\in L_i}\frac{n-r_{L_i}(d)+1}{n}
    其中, L_{i}:d \in L_{i}表示所有包含文档d的List, n表示第i个List包含的文档数量, r_{L_{i}}(d)则表示文档d在第i个List中的排名.
    RRF: discounts the weight of document occurring deep in lists using a reciprocal(互惠的) distribution. v is parameter, typical value 60.
                                                                   \sum_{L_i:d \in L_i}\frac{1}{v+r_{L_i}(d)}
  • Score-based Fusion
    -  CombSUM: fusion score of doc is the sum of the scores of the doc in each list (很简单的一个score fusion, 即将所有list中文档d的score相加)
                                                                       \sum_{L_i:d \in L_i}s_{L_i}(d)
    -  Linear: like CombSUM, but allows to weight differently each list (在CombSUM的基础上, 给不同的list添加了权重)
                                                                       \sum_{L_i:d \in L_i}\omega_i \cdot s_{L_i}(d)
    -  CombMNZ: fused score of doc is the sum of the scores of the doc in each list, multiplied by the number of lists in which it occurs (也是在CombSUM的基础上, 在前面加上代表在多少个list中出现的系数m)
                                                                        m \cdot \sum_{L_i:d \in L_i}s_{L_i}(d)

Score Normalisation

  • Why do we need normalisation?
    Normalisation addresses the problem that scores from different ranking functions / systems for the same item are not directly comparable.
  • 一个标准化的score应该拥有如下三个特质:
    1. Shift invariant(不变的): both the shifted and unshifted scores should normalise to the same ordering (标准化后分数排列顺序不变)
    2. Scale invariant: the scheme should be insensitive to scaling by a multiplicative constant (对于常数相乘不敏感? 上课问问能不能给一个例子)
    3. Outlier insensitive: a single item should not significantly addect the normalised scores for the other items.
  • Approaches of Score Normalisation
    -  Min-Max(Standard Norm): 将分数标准化至0-1的区间
                                          s_{L}^{minmax}(d)=\frac{s_L(d)-min_{​{d}'\in L}s_L({d}')}{max_{​{d}'\in L}s_L({d}') - min_{​{d}'\in L}s_L({d}')}
    用语言描述, 即: 一个文档的minmax标准化分数是文档本身的分数和最低分之差, 再除以最大和最小分数之差的结果.
    -  Sum normalisation (Sum Norm): shift the minimum value to 0, and scale the sum to 1
                                         s_{L}^{sum}(d) = \frac{s_L(d)-min_{​{d}' \in L}s_L({d}')}{\sum_{​{d}' \in L}[s_L({d}')-min_{​{d}' \in L}s_L({d}')]}
    -  Zero Mean and Unit Variance: based on the Z-score statistic: shift the mean to 0, and scale the variance to 1.

Rank to Score Transformations

  • Idea: in score-based rank fusion methods, use a function of rank instead of the actual score
    -  avoids dealing with largely differing score sources and distributions
    -  Borda:       |L_i|-r_{L_i}(d)
    -  RR:               \frac{1}{\nu + r_{L_i}(d)}

Retrieval List Selection

  • Which lists should we fuse(融合)? All lists? A random subset? A biased/elite subset?

 这周的内容真的好少......

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值