Lucene: Indexing numbers, dates, and times And Field truncation

最新推荐文章于 2021-02-13 08:35:41 发布

原创最新推荐文章于 2021-02-13 08:35:41 发布 · 160 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#java

Lucene 专栏收录该内容

9 篇文章

订阅专栏

本文探讨了在搜索引擎中如何有效处理数值与日期/时间值的索引问题，包括保存文本中的数字作为独立标记以便搜索，以及直接索引字段中的单一数值进行精确匹配、范围查询及排序等操作。

Although most content is textual in nature, in many cases handling numeric or date/time values is crucial. In a commerce setting, the product’s price, and perhaps other numeric attributes like weight and height, are clearly important. A video search engine may index the duration of each video. Press releases and articles have a time-stamp.

Indexing numbers

There are two common scenarios in which indexing numbers is important.

In one scenario, numbers are embedded in the text to be indexed, and you want to make sure those numbers are preserved and indexed as their own tokens so that you can use them later as ordinary tokens in searches. To enable this, simply pick an analyzer that doesn’t discard numbers.
In the other scenario, you have a field that contains a single number and you want to index it as a numeric value and then use it for precise (equals) matching, rangesearching, and/or sorting.

doc.add(new NumericField("price").setDoubleValue(19.99));

Indexing dates and times

Such values are easily handled by first converting them to an equivalent int or long value, and then indexing that value as a number. The simplest approach is to use Date.getTime to get the equivalent value, in millisecond precision, for a Java Date object:

doc.add(new NumericField("timestamp")
➥ .setLongValue(new Date().getTime()));

doc.add(new NumericField("day")
➥ .setIntValue((int) (new Date().getTime()/24/3600)));

Calendar cal = Calendar.getInstance();
cal.setTime(date);
doc.add(new NumericField("dayOfMonth")
➥ .setIntValue(cal.get(Calendar.DAY_OF_MONTH)));

------------------------------------------------------------------------------------------------------------------------------------

Field truncation

Some applications index documents whose sizes aren’t known in advance. As a safety mechanism to control the amount of RAM and hard disk space used, you may want to limit the amount of input they are allowed index per field. It’s also possible that a large binary document is accidentally misclassified as a text document, or contains binary content embedded in it that your document filter failed to process, which quickly adds many absurd binary terms to your index, much to your horror. Other applications deal with documents of known size but you’d like to index only a portion of each. For example, you may want to index only the first 200 words of each document.

To support these diverse cases, IndexWriter allows you to truncate per-Field indexing so that only the first N terms are indexed for an analyzed field. When you instantiate IndexWriter, you must pass in a MaxFieldLength instance expressing this limit. MaxFieldLength provides two convenient default instances: MaxField-Length.UNLIMITED, which means no truncation will take place, and MaxField-Length.LIMITED, which means fields are truncated at 10,000 terms. You can also instantiate MaxFieldLength with your own limit.