Lucene 7.x中根据Field字段值进行排序的小例子

最新推荐文章于 2024-06-01 23:49:37 发布

原创最新推荐文章于 2024-06-01 23:49:37 发布 · 2.6k 阅读

4 ·

CC 4.0 BY-SA版权

Java 专栏收录该内容

164 篇文章

订阅专栏

本文档介绍了Lucene 7.x版本中如何利用DocValues对文档进行排序，特别是针对NumericDocValuesField和SortedDocValuesField的使用。通过创建索引并重写CustomScoreQuery类，实现了按照图书出版时间倒序排序的功能。文章详细阐述了DocValues在内存和性能优化方面的优势，以及7.x版本与之前API的变化。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Lucene 7中对DocValues系列的API做了一些改动本帖就是说明一下API的变化本帖的例子是将一些图书信息写入索引然后搜索的时候按照图书出版时间的倒叙排序

先来看写入文档的代码：

        Directory dir=FSDirectory.open(Paths.get("E:/lucene_indexes"));
        IndexWriterConfig config=new IndexWriterConfig();
	IndexWriter writer = new IndexWriter(dir,config);
        Document document = new Document();
        Document document1 = new Document();
        Document document2 = new Document();
        Document document3 = new Document();
        Document document4 = new Document();
        document.add(new NumericDocValuesField("pubdate",201006));
        document.add(new StoredField("pubdate", 201006));
        document.add(new StringField("title","Spring In Action",Field.Store.YES));
        document.add(new SortedDocValuesField("title",new BytesRef("Spring In Action".getBytes())));
        document1.add(new NumericDocValuesField("pubdate",201007));
        document1.add(new StringField("title","Lucene In Action",Field.Store.YES));
        document1.add(new SortedDocValuesField("title",new BytesRef("Lucene In Action".getBytes())));
        document1.add(new StoredField("pubdate", 201007));
        document2.add(new NumericDocValuesField("pubdate",201008));
        document2.add(new StringField("title","Solr In Action",Field.Store.YES));
        document2.add(new SortedDocValuesField("title",new BytesRef("Solr In Action".getBytes())));
        document2.add(new StoredField("pubdate", 201008));
        document3.add(new NumericDocValuesField("pubdate",201009));
        document3.add(new StringField("title","Hadoop In Action",Field.Store.YES));
        document3.add(new SortedDocValuesField("title",new BytesRef("Hadoop In Action".getBytes())));
        document3.add(new StoredField("pubdate", 201009));
        document4.add(new NumericDocValuesField("pubdate",201010));
        document4.add(new StringField("title","Spark In Action",Field.Store.YES));
        document4.add(new SortedDocValuesField("title",new BytesRef("Spark In Action".getBytes())));
        document4.add(new StoredField("pubdate", 201010));
        writer.addDocument(document);
        writer.addDocument(document1);
        writer.addDocument(document2);
        writer.addDocument(document3);
        writer.addDocument(document4);
        writer.close();

这段代码首先创建了Directory 用的是4.x之后的新API 包括nio库中的Paths类的方法

然后创建IndexWriterConfig和IndexWriter

然后创建Document 这里Document的每个域需要说明一下

NumercDocValuesField和SortedDocValuesField是基于在lucene4.x之后出现的docvalues新特性的一个域，在构建索引时会对开启docvalues的字段，额外构建一个已经排好序的文档到字段级别的一个列式存储映射，它减轻了在排序和分组时，对内存的依赖，而且大大提升了这个过程的性能，当然它也会耗费的一定的磁盘空间。

前者是用于预排序数字类型，后者是用于预排序字符串类型的域。

StoredField和StringField都是将同样的字段保存在索引里，否则索引里不保存字段的具体值。

然后写入Document，索引添加完毕。

接下来就是重写CustomScoreQuery类

public class RecencyBoostingQuery extends CustomScoreQuery{
	
	private double multiplier;
	
	private int today;
	
	private int maxDaysAgo;
	
	private String dayField;
	
	static int MSEC_PER_DAY=1000*2600*24;
	
	public RecencyBoostingQuery(Query subQuery,double multiplier,int maxDaysAgo,String dayField) {
		super(subQuery);
		// TODO Auto-generated constructor stub
		today=(int) (new Date().getTime()/MSEC_PER_DAY);
		this.multiplier=multiplier;
		this.maxDaysAgo=maxDaysAgo;
		this.dayField=dayField;
	}

	private class RecencyBooster extends CustomScoreProvider{
		
		private NumericDocValues publishDay;  
		
		public RecencyBooster(LeafReaderContext context) throws IOException {
			super(context);
			// TODO Auto-generated constructor stub
			publishDay=context.reader().getNumericDocValues(dayField);
		}
		
		public float customScore(int doc,float subQueryScore,float valSrcScore) throws IOException {
			int docId=publishDay.advance(doc);
			int daysAgo=(int) (today-publishDay.longValue());
			if(daysAgo<maxDaysAgo) {
				float boost=(float) (multiplier*(maxDaysAgo-daysAgo)/maxDaysAgo);
				return (float) (subQueryScore*(1.0+boost));
			}else {
				return subQueryScore;
			}
		}
	}
	
	public CustomScoreProvider getCustomScoreProvider(LeafReaderContext context) throws IOException {
		return new RecencyBooster(context);
	}
}

具体的算法不说了都是摘自Lucene In Action这本书的源码，主要说一下私有内部类RecencyBoost，这个类继承自CustomScoreProvider，就是提供算法的类。

成员变量是一个NumericDocValues类的对象publishDay，这个类是获取一批文档中的同一个数值类型域的值，在构造方法里对这个对象进行实例化，通过对LeafReaderContext对象调用reader方法再调用getNumericValues方法获取一批文档中的所有该字段的值，LeafReaderContext是5.5之后替代AtomicReaderContext的类，获取索引目录的上下文环境的类。

customScore方法是重写父类的方法，对外提供算法，其中publishDay这个变量对应的NumericDocValues类在7中的API有了明显变化，6.x之前的API获取某一个文档的相关域值的方法是get方法，7.x中是通过advance方法用迭代器遍历文档集合，将迭代器指向和参数一样的文档id或大于参数的第一个文档id处，也就是移动到指定文档的位置，这时候调用publishDay的longValue方法就返回指定文档的指定值。

最后在外部类中重写getCustomScoreProvider方法，注册这个算法提供类。

下面看最后的搜索阶段：

  Directory dir=TestUtil.getBookIndexDirectory();
  IndexReader reader=DirectoryReader.open(dir);
  IndexSearcher searcher=new IndexSearcher(reader);
  Query q1=new MatchAllDocsQuery();
  Query q2=new RecencyBoostingQuery(q1, 2.0, 2*365, "pubdate");
		
  Sort sort=new Sort(new SortField[] {SortField.FIELD_SCORE,new SortField("title",SortField.Type.STRING,true)});
  TopDocs hits=searcher.search(q2, 10, sort, true, true);
  System.out.println(hits.totalHits);
    for(int i=0;i<hits.totalHits;i++) {
	Document doc=reader.document(hits.scoreDocs[i].doc);
	System.out.println(doc);
	System.out.println(doc.getField("title").stringValue()+":"+doc.get("pubdate")+":"+hits.scoreDocs[i].score);
    }

然后就可以看搜索出来的文档按照日期倒叙排序了。