lucene对要解析的内容方面的限制及注意事项

本文探讨了文档长度限制对于内存管理和防止内存泄漏的重要性。默认情况下,文档会被截断为不超过10,000个词条,这对于平均含有约250个单词的英文页面来说大约是40页。如果文档超出此限制且需要完整索引,则需要通过IndexWriter.setMaxFieldLength()方法增加限制。该方法允许根据可用内存调整文档的最大长度。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

对内容长短的限制:

 

主要目的是防止内部不足而产生的内存泄露问题。只要内存足够大,这个值可以设置成Integer.MAX_VALUE,能覆盖目前可能的文档大小。

 

 

参考内容:

Documents are truncated by default

The indexer by default truncates documents to IndexWriter.DEFAULT_MAX_FIELD_LENGTH or 10,000 terms in Lucene 2.0.

Rule of thumb: an average page of English text contains about 250 words. (Source: Google Answers.) This means only about 40 pages are indexed by default. If any of your documents are longer than this (and you want them indexed in full), you should raise the limit with IndexWriter.setMaxFieldLength().



public void setMaxFieldLength(int maxFieldLength)
The maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. This setting refers to the number of running terms, not to the number of different terms.

Note: this silently truncates large documents, excluding from the index all terms that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError.

By default, no more than DEFAULT_MAX_FIELD_LENGTH terms will be indexed for a field.


 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值