lucene 的字段缓存

最新推荐文章于 2024-06-20 10:09:55 发布

原创最新推荐文章于 2024-06-20 10:09:55 发布 · 1.1k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#lucene #string #wrapper #文档 #null #solr

技术学习笔记同时被 2 个专栏收录

183 篇文章

订阅专栏

搜索引擎

104 篇文章

订阅专栏

static final class StringIndexCache extends Cache {
    StringIndexCache(FieldCacheImpl wrapper) {
      super(wrapper);
    }


    @Override
    protected Object createValue(IndexReader reader, Entry entryKey, boolean setDocsWithField /* ignored */)
        throws IOException {
      String field = StringHelper.intern(entryKey.field);
      final int[] retArray = new int[reader.maxDoc()];
      String[] mterms = new String[reader.maxDoc()+1];
      TermDocs termDocs = reader.termDocs();
      TermEnum termEnum = reader.terms (new Term (field));
      int t = 0;  // current term number


      // an entry for documents that have no terms in this field
      // should a document with no terms be at top or bottom?
      // this puts them at the top - if it is changed, FieldDocSortedHitQueue
      // needs to change as well.
      mterms[t++] = null;


      try {
        do {
          Term term = termEnum.term();
          if (term==null || term.field() != field || t >= mterms.length) break;

          //保存该字段的所有值，如果某个文档要猎取该值，可以通过retArray[docId]来获取相应的值

          // store term text
          mterms[t] = term.text();

           //将每个文档在该字段里对应的排名保存起来，以方便排序使用（可以通过一个文档id便可知排序的值）
          termDocs.seek (termEnum);
          while (termDocs.next()) {
            retArray[termDocs.doc()] = t;
          }


          t++;
        } while (termEnum.next());
      } finally {
        termDocs.close();
        termEnum.close();
      }


      if (t == 0) {
        // if there are no terms, make the term array
        // have a single null entry
        mterms = new String[1];
      } else if (t < mterms.length) {
        // if there are less terms than documents,
        // trim off the dead array space
        String[] terms = new String[t];
        System.arraycopy (mterms, 0, terms, 0, t);
        mterms = terms;
      }


      StringIndex value = new StringIndex (retArray, mterms);
      return value;
    }
  }

StringIndex保存了两个重要的信息 retArray：对应的doc在该字段的排名：

mterms:保存该字段所有值这里这个string字段类型不同数字类型的缓存处理，

数字类型直接retArray保存的是它的值，因为它可以通过简单地比较他们的值来排序，对于字符串型，那样保存法，

我觉得1）可以省空间，如果重复多的话，用另一个数组来保存这些字符串的值。。这样大家就可以想清楚什么时候能尽量不使用string类型来处理。。

2）排序更快，如果像数字型那样在retArray上保存值，这样每次还得去对这些字符串排序，比较费时。

Lucene的字段缓存为solr的函数function功能那一块起了一个很重要的作用。

比如ord(),rord()这几个函数都是要借助缓存来处理，而且不会每次都动态获取并计算，这些值可以很快的计算出来。。