Heritrix 插件 DeDuplicator

发布了针对Heritrix3.0.0的DeDuplicator更新版本3.0.0-SNAPSHOT-20100727,此版本引入了Lucene3.0.2并显著提升了查询速度及索引构建效率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

DeDuplicator for Heritrix 3 - 27/07/2010


Version 3.0.0-SNAPSHOT-20100727 is now available here.

This version is compiled against Heritrix 3.0.0.

It also updates to use Lucene 3.0.2 (from 2.0.0). Please note that changes in the Lucene library mean that memory usage will be approximately 40% greater than before. Memory usage appears to be approximately 5 bytes per URL in index, as compared to 3.6 bytes per URL previously. Query times have however improved significantly and are now fixed time without regard for the index size. For large indexes this can mean as much as 10-30 times shorter query times. Building indexes is also much faster (approximately 3-4 times as fast).

Currently the DeDupFetchHTTP processor has not been converted.

This release heralds the end of the existing DeDuplicator, built against Heritrix 1.14. One final release (1.0.0) will be released soon with some accumulated bugfixes. A release candidate is available here.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值