Lucene: Indexing numbers, dates, and times And Field truncation

本文探讨了在搜索引擎中如何有效处理数值与日期/时间值的索引问题,包括保存文本中的数字作为独立标记以便搜索,以及直接索引字段中的单一数值进行精确匹配、范围查询及排序等操作。

Although most content is textual in nature, in many cases handling numeric or date/time values is crucial. In a commerce setting, the product’s price, and perhaps other numeric attributes like weight and height, are clearly important. A video search engine may index the duration of each video. Press releases and articles have a time-stamp.

 

Indexing numbers

There are two common scenarios in which indexing numbers is important.

  • In one scenario, numbers are embedded in the text to be indexed, and you want to make sure those numbers are preserved and indexed as their own tokens so that you can use them later as ordinary tokens in searches. To enable this, simply pick an analyzer that doesn’t discard numbers.
  • In the other scenario, you have a field that contains a single number and you want to index it as a numeric value and then use it for precise (equals) matching, rangesearching, and/or sorting.
doc.add(new NumericField("price").setDoubleValue(19.99));

 

Indexing dates and times

 

Such values are easily handled by first converting them to an equivalent int or long value, and then indexing that value as a number. The simplest approach is to use Date.getTime to get the equivalent value, in millisecond precision, for a Java Date object:

doc.add(new NumericField("timestamp")
➥ .setLongValue(new Date().getTime()));

 

doc.add(new NumericField("day")
➥ .setIntValue((int) (new Date().getTime()/24/3600)));

 

Calendar cal = Calendar.getInstance();
cal.setTime(date);
doc.add(new NumericField("dayOfMonth")
➥ .setIntValue(cal.get(Calendar.DAY_OF_MONTH)));

 

------------------------------------------------------------------------------------------------------------------------------------

Field truncation

Some applications index documents whose sizes aren’t known in advance. As a safety mechanism to control the amount of RAM and hard disk space used, you may want to limit the amount of input they are allowed index per field. It’s also possible that a large binary document is accidentally misclassified as a text document, or contains binary content embedded in it that your document filter failed to process, which quickly adds many absurd binary terms to your index, much to your horror. Other applications deal with documents of known size but you’d like to index only a portion of each. For example, you may want to index only the first 200 words of each document.

 

To support these diverse cases, IndexWriter allows you to truncate per-Field indexing so that only the first N terms are indexed for an analyzed field. When you instantiate IndexWriter, you must pass in a MaxFieldLength instance expressing this limit. MaxFieldLength provides two convenient default instances: MaxField-Length.UNLIMITED, which means no truncation will take place, and MaxField-Length.LIMITED, which means fields are truncated at 10,000 terms. You can also instantiate MaxFieldLength with your own limit.

 

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值