Lucene学习总结之七:Lucene搜索过程解析(4)

本文详细介绍了Lucene搜索查询对象的创建,包括Weight对象树的构建,如TermWeight和ConstantScoreQuery的创建。接着讨论了Term Weight分数的计算过程,涉及sumOfSquaredWeights、queryNorm和normalize方法。此外,还分析了Scorer及SumScorer对象树的创建,特别是针对不同类型的查询语句(MUST、SHOULD、MUST_NOT)如何影响倒排表的合并。

 

2.4、搜索查询对象

 

2.4.1.2、创建Weight对象树

BooleanQuery.createWeight(Searcher) 最终返回return new BooleanWeight(searcher),BooleanWeight构造函数的具体实现如下:

public BooleanWeight(Searcher searcher) {

  this.similarity = getSimilarity(searcher);

  weights = new ArrayList(clauses.size());

  //也是一个递归的过程,沿着新的Query对象树一直到叶子节点

  for (int i = 0 ; i < clauses.size(); i++) {

    weights.add(clauses.get(i).getQuery().createWeight(searcher));

  }

}

对于TermQuery的叶子节点,其TermQuery.createWeight(Searcher) 返回return new TermWeight(searcher)对象,TermWeight构造函数如下:

public TermWeight(Searcher searcher) {

  this.similarity = getSimilarity(searcher);

  //此处计算了idf

  idfExp = similarity.idfExplain(term, searcher);

  idf = idfExp.getIdf();

}

//idf的计算完全符合文档中的公式:

image

public IDFExplanation idfExplain(final Term term, final Searcher searcher) {

  final int df = searcher.docFreq(term);

  final int max = searcher.maxDoc();

  final float idf = idf(df, max);

  return new IDFExplanation() {

      public float getIdf() {

        return idf;

      }};

}

public float idf(int docFreq, int numDocs) {

  return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);

}

而ConstantScoreQuery.createWeight(Searcher) 除了创建ConstantScoreQuery.ConstantWeight(searcher)对象外,没有计算idf。

由此创建的Weight对象树如下:

weight    BooleanQuery$BooleanWeight  (id=169)   
   |   similarity    DefaultSimilarity  (id=177)   
   |   this$0    BooleanQuery  (id=89)   
   |   weights    ArrayList  (id=188)   
   |      elementData    Object[3]  (id=190)   
   |------[0]    BooleanQuery$BooleanWeight  (id=171)   
   |          |   similarity    DefaultSimilarity  (id=177)   
   |          |   this$0    BooleanQuery  (id=105)   
   |          |   weights    ArrayList  (id=193)   
   |          |      elementData    Object[2]  (id=199)   
   |          |------[0]    ConstantScoreQuery$ConstantWeight  (id=183)   
   |          |               queryNorm    0.0   
   |          |               queryWeight    0.0   
   |          |               similarity    DefaultSimilarity  (id=177)   

   |          |               //ConstantScore(contents:apple*)  
   |          |               this$0    ConstantScoreQuery  (id=123)   
   |          |------[1]    TermQuery$TermWeight  (id=175)   
   |                         idf    2.0986123   
   |                         idfExp    Similarity$1  (id=241)   
   |                         queryNorm    0.0   
   |                         queryWeight    0.0   
   |                         similarity    DefaultSimilarity  (id=177)   

   |                         //contents:boy
   |                        this$0    TermQuery  (id=124)   
   |                         value    0.0   
   |                 modCount    2   
   |                 size    2   
   |------[1]    BooleanQuery$BooleanWeight  (id=179)   
   |          |   similarity    DefaultSimilarity  (id=177)   
   |          |   this$0    BooleanQuery  (id=110)   
   |          |   weights    ArrayList  (id=195)   
   |          |      elementData    Object[2]  (id=204)   
   |          |------[0]    ConstantScoreQuery$ConstantWeight  (id=206)   
   |          |               queryNorm    0.0   
   |          |               queryWeight    0.0   
   |          |               similarity    DefaultSimilarity  (id=177)   

   |          |               //ConstantScore(contents:cat*)
   |          |               this$0    ConstantScoreQuery  (id=135)   
   |          |------[1]    TermQuery$TermWeight  (id=207)   
   |                         idf    1.5389965   
   |                         idfExp    Similarity$1  (id=210)   
   |                         queryNorm    0.0   
   |                         queryWeight    0.0   
   |                         similarity    DefaultSimilarity  (id=177)

   |                         //contents:dog
   |                         this$0    TermQuery  (id=136)   
   |                         value    0.0   
   |                 modCount    2   
   |                 size    2   
   |------[2]    BooleanQuery$BooleanWeight  (id=182)   
              |  similarity    DefaultSimilarity  (id=177)   
              |  this$0    BooleanQuery  (id=113)   
              |  weights    ArrayList  (id=197)   
              |     elementData    Object[2]  (id=216)   
              |------[0]    BooleanQuery$BooleanWeight  (id=181)   
              |          |    similarity    BooleanQuery$1  (id=220)   
              |          |    this$0    BooleanQuery  (id=145)   
              |          |    weights    ArrayList  (id=221)   
              |          |      elementData    Object[2]  (id=224)   
              |          |------[0]    TermQuery$TermWeight  (id=226)   
              |          |                idf    2.0986123   
              |          |                idfExp    Similarity$1  (id=229)   
              |          |                queryNorm    0.0   
              |          |                queryWeight    0.0   
              |          |                similarity    DefaultSimilarity  (id=177)   

              |          |                //contents:eat
              |          |                this$0    TermQuery  (id=150)   
              |          |                value    0.0   
              |          |------[1]    TermQuery$TermWeight  (id=227)   
              |                          idf    1.1823215   
              |                          idfExp    Similarity$1  (id=231)   
              |                          queryNorm    0.0   
              |                          queryWeight    0.0   
              |                          similarity    DefaultSimilarity  (id=177)   

              |                          //contents:cat^0.33333325
              |                          this$0    TermQuery  (id=151)   
              |                          value    0.0   
              |                  modCount    2   
              |                  size    2   
              |------[1]    TermQuery$TermWeight  (id=218)   
                            idf    2.0986123   
                            idfExp    Similarity$1  (id=233)   
                            queryNorm    0.0   
                            queryWeight    0.0   
                            similarity    DefaultSimilarity  (id=177)   

                            //contents:foods
                            this$0    TermQuery  (id=154)   
                            value    0.0   
                    modCount    2   
                    size    2   
        modCount    3   
        size    3   

image

 

2.4.1.3、计算Term Weight分数

(1) 首先计算sumOfSquaredWeights

按照公式:

image

代码如下:

float sum = weight.sumOfSquaredWeights();

 

//可以看出,也是一个递归的过程

public float sumOfSquaredWeights() throws IOException {

  float sum = 0.0f;

  for (int i = 0 ; i < weights.size(); i++) {

    float s = weights.get(i).sumOfSquaredWeights();

    if (!clauses.get(i).isProhibited())

      sum += s;

  }

  sum *= getBoost() * getBoost();  //乘以query boost

  return sum ;

}

对于叶子节点TermWeight来讲,其TermQuery$TermWeight.sumOfSquaredWeights()实现如下:

public float sumOfSquaredWeights() {

  //计算一部分打分,idf*t.getBoost(),将来还会用到。

  queryWeight = idf * getBoost();

  //计算(idf*t.getBoost())^2

  return queryWeight * queryWeight;

}

对于叶子节点ConstantWeight来讲,其ConstantScoreQuery$ConstantWeight.sumOfSquaredWeights() 如下:

public float sumOfSquaredWeights() {

  //除了用户指定的boost以外,其他都不计算在打分内

  queryWeight = getBoost();

  return queryWeight * queryWeight;

}

(2) 计算queryNorm

其公式如下:

image

其代码如下:

public float queryNorm(float sumOfSquaredWeights) {

  return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));

}

(3) 将queryNorm算入打分

代码为:

weight.normalize(norm);

//又是一个递归的过程

public void normalize(float norm) {

  norm *= getBoost();

  for (Weight w : weights) {

    w.normalize(norm);

  }

}

其叶子节点TermWeight来讲,其TermQuery$TermWeight.normalize(float) 代码如下:

public void normalize(float queryNorm) {

  this.queryNorm = queryNorm;

  //原来queryWeight为idf*t.getBoost(),现在为queryNorm*idf*t.getBoost()。

  queryWeight *= queryNorm;

  //打分到此计算了queryNorm*idf*t.getBoost()*idf = queryNorm*idf^2*t.getBoost()部分。

  value = queryWeight * idf;

}

我们知道,Lucene的打分公式整体如下,到此计算了图中,红色的部分:

image 

 

2.4.2、创建Scorer及SumScorer对象树

当创建完Weight对象树的时候,调用IndexSearcher.search(Weight, Filter, int),代码如下:

//(a)创建文档号收集器

TopScoreDocCollector collector = TopScoreDocCollector.create(nDocs, !weight.scoresDocsOutOfOrder());

search(weight, filter, collector);

//(b)返回搜索结果

return collector.topDocs();

public void search(Weight weight, Filter filter, Collector collector)

    throws IOException {

  if (filter == null) {

    for (int i = 0; i < subReaders.length; i++) {

      collector.setNextReader(subReaders[i], docStarts[i]);

      //(c)创建Scorer对象树,以及SumScorer树用来合并倒排表

      Scorer scorer = weight.scorer(subReaders[i], !collector.acceptsDocsOutOfOrder(), true);

      if (scorer != null) {

        //(d)合并倒排表,(e)收集文档号

        scorer.score(collector);

      }

    }

  } else {

    for (int i = 0; i < subReaders.length; i++) {

      collector.setNextReader(subReaders[i], docStarts[i]);

      searchWithFilter(subReaders[i], weight, filter, collector);

    }

  }

}

在本节中,重点分析(c)创建Scorer对象树,以及SumScorer树用来合并倒排表,在2.4.3节中,分析 (d)合并倒排表,在2.4.4节中,分析文档结果收集器的创建(a),结果文档的收集(e),以及文档的返回(b)

BooleanQuery$BooleanWeight.scorer(IndexReader, boolean, boolean) 代码如下:

public Scorer scorer(IndexReader reader, boolean scoreDocsInOrder, boolean topScorer){

  //存放对应于MUST语句的Scorer

  List required = new ArrayList();

  //存放对应于MUST_NOT语句的Scorer

  List prohibited = new ArrayList();

  //存放对应于SHOULD语句的Scorer

  List optional = new ArrayList();

  //遍历每一个子语句,生成子Scorer对象,并加入相应的集合,这是一个递归的过程。

  Iterator cIter = clauses.iterator();

  for (Weight w  : weights) {

    BooleanClause c =  cIter.next();

    Scorer subScorer = w.scorer(reader, true, false);

    if (subScorer == null) {

      if (c.isRequired()) {

        return null;

      }

    } else if (c.isRequired()) {

      required.add(subScorer);

    } else if (c.isProhibited()) {

      prohibited.add(subScorer);

    } else {

      optional.add(subScorer);

    }

  }

  //此处在有关BooleanScorer及scoreDocsInOrder一节会详细描述

  if (!scoreDocsInOrder && topScorer && required.size() == 0 && prohibited.size() < 32) {
     return new BooleanScorer(similarity, minNrShouldMatch, optional, prohibited);
  }

  //生成Scorer对象树,同时生成SumScorer对象树

  return new BooleanScorer2(similarity, minNrShouldMatch, required, prohibited, op

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值