DeDuplicator for Heritrix 3 - 27/07/2010
Version 3.0.0-SNAPSHOT-20100727 is now available here.
This version is compiled against Heritrix 3.0.0.
It also updates to use Lucene 3.0.2 (from 2.0.0). Please note that changes in the Lucene library mean that memory usage will be approximately 40% greater than before. Memory usage appears to be approximately 5 bytes per URL in index, as compared to 3.6 bytes per URL previously. Query times have however improved significantly and are now fixed time without regard for the index size. For large indexes this can mean as much as 10-30 times shorter query times. Building indexes is also much faster (approximately 3-4 times as fast).
Currently the DeDupFetchHTTP processor has not been converted.
This release heralds the end of the existing DeDuplicator, built against Heritrix 1.14. One final release (1.0.0) will be released soon with some accumulated bugfixes. A release candidate is available here.
Version 3.0.0-SNAPSHOT-20100727 is now available here.
This version is compiled against Heritrix 3.0.0.
It also updates to use Lucene 3.0.2 (from 2.0.0). Please note that changes in the Lucene library mean that memory usage will be approximately 40% greater than before. Memory usage appears to be approximately 5 bytes per URL in index, as compared to 3.6 bytes per URL previously. Query times have however improved significantly and are now fixed time without regard for the index size. For large indexes this can mean as much as 10-30 times shorter query times. Building indexes is also much faster (approximately 3-4 times as fast).
Currently the DeDupFetchHTTP processor has not been converted.
This release heralds the end of the existing DeDuplicator, built against Heritrix 1.14. One final release (1.0.0) will be released soon with some accumulated bugfixes. A release candidate is available here.
Heritrix3去重组件更新
发布了针对Heritrix3.0.0的DeDuplicator更新版本3.0.0-SNAPSHOT-20100727,此版本引入了Lucene3.0.2并显著提升了查询速度及索引构建效率。
881

被折叠的 条评论
为什么被折叠?



