lucene对要解析的内容方面的限制及注意事项

最新推荐文章于 2025-11-23 22:15:59 发布

原创最新推荐文章于 2025-11-23 22:15:59 发布 · 147 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#lucene #Google

Lucene 专栏收录该内容

5 篇文章

订阅专栏

本文探讨了文档长度限制对于内存管理和防止内存泄漏的重要性。默认情况下，文档会被截断为不超过10,000个词条，这对于平均含有约250个单词的英文页面来说大约是40页。如果文档超出此限制且需要完整索引，则需要通过IndexWriter.setMaxFieldLength()方法增加限制。该方法允许根据可用内存调整文档的最大长度。

对内容长短的限制：

主要目的是防止内部不足而产生的内存泄露问题。只要内存足够大，这个值可以设置成Integer.MAX_VALUE,能覆盖目前可能的文档大小。

参考内容：

Documents are truncated by default

The indexer by default truncates documents to IndexWriter.DEFAULT_MAX_FIELD_LENGTH or 10,000 terms in Lucene 2.0.

Rule of thumb: an average page of English text contains about 250 words. (Source: Google Answers.) This means only about 40 pages are indexed by default. If any of your documents are longer than this (and you want them indexed in full), you should raise the limit with IndexWriter.setMaxFieldLength().

public void setMaxFieldLength(int maxFieldLength)

The maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. This setting refers to the number of running terms, not to the number of different terms.

Note: this silently truncates large documents, excluding from the index all terms that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError.

By default, no more than DEFAULT_MAX_FIELD_LENGTH terms will be indexed for a field.