Lucene/Solr Dev 1:Lucene indexing Time(Date)& Lucene Query Time(Date)

先看一段代码及其运行结果:

File indexFile = new File("lucene.index_all");
QueryService service = new QueryService();	
IndexReader reader = CBEUtil.getIndexReader(indexFile);
IndexSearcher searcher = new IndexSearcher(reader);
String  start = "2010-07-27T14:30:57.78Z", end = "2010-07-27T14:44:49.187Z";
BooleanQuery.setMaxClauseCount(999999999);
service.singleRangeQuery(start, end, searcher);
service.multiRangeQuery(start, end, searcher);
service.queryDateService(indexFile, start, end, "creationTimeStr");

 

运行结果:

range = creationTimeStr:[2010-07-27T14:30:57.78Z TO 2010-07-27T14:44:49.187Z]
hits = 355210
Single range spent: 1593ms

booleanQuery = +creationTimeStr:[2010-07-27T14:30:57.78Z TO zzzzzzzzz] +creationTimeStr:[000000000 TO 2010-07-27T14:44:49.187Z]
hits = 355210
multi Range spent: 15500ms

query result: total matching documents 355210 total spent 750 milliseconds

比较运行结果发现,同样查找到355210个Document,singleRangeQuery()方法用了1593毫秒,multiRangeQuery()用了15500毫秒,而最后queryDateService()方法只用了750毫秒,他们效率相差很大,multiRangeQuery()是singleRangeQuery()的10倍,是queryDateService()的20倍,下面对此现象做一简单分析:

贴出singleRangeQuery()方法代码:

public  void singleRangeQuery(String fromDate, String toDate, IndexSearcher indexSearcher) throws IOException {
        long start = System.currentTimeMillis();
        RangeQuery range = new RangeQuery(new Term("creationTimeStr", fromDate), new Term("creationTimeStr", toDate), true);

        System.out.println("range = " + range);
        Hits hits = indexSearcher.search(range);

        long end = System.currentTimeMillis();
        System.out.println("hits = " + hits.length());
        System.out.println("Single range spent: " + (end -start) + "ms");
    }

 此方法主要用了RangeQuery 来查询Field对应值大于起始时间,小于结束时间的Document,这种方法在现在已经被弃用;

multiRangeQuery()代码:

public void multiRangeQuery(String fromDate, String toDate, IndexSearcher indexSearcher) throws IOException {
        long start = System.currentTimeMillis();
        RangeQuery from = new RangeQuery(new Term("creationTimeStr", fromDate), new Term("creationTimeStr",DateField.MAX_DATE_STRING()), true);
        RangeQuery to = new RangeQuery(new Term("creationTimeStr",DateField.MIN_DATE_STRING()), new Term("creationTimeStr", toDate) , true);
        BooleanQuery booleanQuery = new BooleanQuery();
        booleanQuery.add(new BooleanClause(from, BooleanClause.Occur.MUST));
        booleanQuery.add(new BooleanClause(to,BooleanClause.Occur.MUST));

        System.out.println("booleanQuery = " + booleanQuery);
        Hits hits = indexSearcher.search(booleanQuery);

        long end = System.currentTimeMillis();
        System.out.println("hits = " + hits.length());
        System.out.println("multi Range spent: " + (end -start) + "ms");
    }

 此方法用了BooleanQuery 来完成查询查询Field对应值大于起始时间,小于结束时间的Document,BooleanQuery 有add(Query query, BooleanClause.Occur occur)方法,所以它可以包含多个Query,此处包含两个RangeQuery ,不难看出此种方法的效率的不能够满足Application的需求的,同样此方法中用到的许多方法现在已经弃用;

由上面两种方法的比较可以解释一个关于Lucene Time Range 的结论:“Date searchers should use a single Range term rather than two”.

queryDateService()代码:

public void queryDateService(File indexFile, String start, String end, String dateField) {
		Count.set();
		IndexReader reader = null;
		try {
			reader = CBEUtil.getIndexReader(indexFile);
			IndexSearcher searcher = new IndexSearcher(reader);
			TermRangeQuery query = new TermRangeQuery(dateField, start, end, true,true);
			TopDocs matches = searcher.search(query, null, 10, new Sort(dateField));
			System.out.println(" query result: total matching documents " + matches.totalHits + " total spent " + (Count.result() + " milliseconds"));
			Count.destory();
		}  catch (IOException e) {
			errorHanlder("",e);
		}
	}

 从运行结果数据看此方法效率最高,在Application开发中应用此方法;

 

对Lucene Time 做索引及查询的总结

 前面在Lucene学习笔记(二)中提到Lucene对时间的索引及查询,这里我主要针对查询效率对Lucene indexing Time(Date)& Lucene Query Time(Date)做一总结:

1 两种思路做索引:

Method One:Time(Date)它对应一个Long型数字,所以可以用NumericField做索引;

Method Two: 将Time(Date)转化为格式了的字符串,用普通Field

为了详细研究,我们把Method One:分为两种情况(分别以毫秒和秒做索引)

贴出做索引代码:

public Document getDocument() {
		Document doc = new Document();
		doc.add(new NumericField("creationTimeSec", Field.Store.YES, true)
				.setLongValue(new Date().getTime() / 1000));
		doc.add(new NumericField("creationTimeMill", Field.Store.YES, true)
				.setLongValue(new Date().getTime()));
		doc.add(new Field("creationTimeStr", new SimpleDateFormat(
				"yyyy-MM-dd'T'HH:mm:ss.S'Z'").format(new Date()),
				Field.Store.YES, Field.Index.NOT_ANALYZED));
		return doc;
	}

 如代码所示,在每个Document上添加三个Field分别表示:NumericField/秒 NumericField/毫秒 Field/字符串;

要对上述索引做查询同样需两种方法,直接贴出两种方法:

 

public void queryDateService(File indexFile, long startDate, long endDate, String dateField) {
		Count.set();
		IndexReader reader = null;
		try {
			reader = CBEUtil.getIndexReader(indexFile);
			IndexSearcher searcher = new IndexSearcher(reader);
			NumericRangeQuery query = NumericRangeQuery.newLongRange(dateField, startDate, endDate, true,true);
			TopDocs matches = searcher.search(query, null, 10, new Sort(dateField));
			System.out.println(" query result: total matching documents " + matches.totalHits + " total spent " + (Count.result() + " milliseconds"));
			Count.destory();
		}  catch (IOException e) {
			errorHanlder("",e);
		}
	} 
	
	public void queryDateService(File indexFile, String start, String end, String dateField) {
		Count.set();
		IndexReader reader = null;
		try {
			reader = CBEUtil.getIndexReader(indexFile);
			IndexSearcher searcher = new IndexSearcher(reader);
			TermRangeQuery query = new TermRangeQuery(dateField, start, end, true,true);
			TopDocs matches = searcher.search(query, null, 10, new Sort(dateField));
			System.out.println(" query result: total matching documents " + matches.totalHits + " total spent " + (Count.result() + " milliseconds"));
			Count.destory();
		}  catch (IOException e) {
			errorHanlder("",e);
		}
	}

分析上述代码:

queryDateService(File indexFile, long startDate, long endDate, String dateField)传入参数为要做查询的索引文件,开始Time(Date)对应long值,结束Time(Date)对应long值,及 Time(Date)对应Field名字;此处传入long值可以是毫秒对应值(new Date().getTime()),也可以是秒对应值(new Date().getTime() / 1000);

queryDateService(File indexFile, String start, String end, String dateField)传入参数为要做查询的索引文件,开始Time(Date)对应格式字符串的值,结束Time(Date)对应格式字符串的值,及 Time(Date)对应Field名字;

 

下面给出测试结果:



 

在上图中:X轴表示索引文件的大小,单位为MB,本实验开始索引文件从0MB一直到最后的1456MB,Y轴表示查询时间,单位为毫秒,本实验查询最多耗时1922;

图中三条曲线:

         query by milliseconds range 表示:索引NumericField/毫秒,查询时,Time Range 对应为毫秒

         query by seconds range表示:索引NumericField/秒,查询时,Time Range 对应为秒

         query by string range表示:索引Field/字符串,String Range查询

分析上图:

1、  让索引文件为200MB左右时,三种方式查询用时相差最小,都为400毫秒左右

2、  NumericField/毫秒 方式查询最耗时,Field/字符串最省时

3、  随着索引文件的增加Field/字符串方式查询时间增长最慢,是最理想的Time Range 查询模式

 

 

上图对应表格数据如下: 

Indexed file size(MB)207416624837104012481456
Time(query by milliseconds range)4537349531218143816871922
Time(query by seconds range)4065637811000118813601562
Time(query by string range)34448460976587510151140

 

上面表格和曲线图是一种一一对应关系,分析上述结果不难看出:将Time(Date)转化为格式了的字符串,用普通Field做索引,查询时用String range查询是最佳选择;

结论:Time(Date)做索引,并对索引结果进行查询的最佳方案为:将Time(Date)转化为格式了的字符串,用普通Field做索引,查询时用String range查询;

 

 

 

 

 

 

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值