heritrix3
javaite
这个作者很懒,什么都没留下…
展开
专栏收录文章
- 默认排序
- 最新发布
- 最早发布
- 最多阅读
- 最少阅读
-
How to install heritrix3
Use svn, checkout the project from the sourceforget.net on https: / / archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/trunk/heritrix3 Especially if you're customizing Heritrix (as seem...原创 2012-12-09 12:11:02 · 155 阅读 · 0 评论 -
org.archive.crawler.framework.ToeThread
[b]1、controller.getFetchChain().process(curi,this);[/b]1.1、org.archive.crawler.prefetch.Preselector, 1.2、org.archive.crawler.prefetch.PreconditionEnforcer, 1.3、org.archive.modules.fetcher.FetchD...原创 2012-12-17 23:15:41 · 165 阅读 · 0 评论 -
org.archive.modules.deciderules.DecideRuleSequence
ToeThread.run()ProcessorChain.prcess(CrawlURI curi, ChainStatusReceiver thread)Processor.process(CrawlURI curi)Scoper.isInScope(CrawlURI caUri)//foreach getRules() DecideResult r = rule.de...原创 2012-12-17 17:34:34 · 136 阅读 · 0 评论 -
Processor
When a URI is crawled, a ToeThread will execute a series of processors on it.The processors are split into 5 distinct chains that are exectued in sequence:Pre-fetch processing chainFetch pro...原创 2012-12-11 22:01:34 · 162 阅读 · 0 评论 -
crawler-beans.cxml
[b]1、CrawlMetadata[/b]: including identification of crawler/operator[b]org.archive.modules.CrawlMetadata[/b]: Basic crawl metadata, as consulted by functional modules and recorded in ARCs/WARCs....原创 2012-12-11 14:06:41 · 186 阅读 · 0 评论 -
Mirroring HTML Files Only
you would like to save the crawled files in a file/directory format instead of saving them in WARC files. First, create a job with a single seed, http://foo.org/bar/. Configure the warcWriter bean...原创 2012-12-11 08:10:31 · 123 阅读 · 0 评论 -
hbase-writer
An Hadoop HBase WriterPool implementation for the Heritrix crawler原创 2012-12-10 23:59:27 · 159 阅读 · 0 评论 -
org.archive.crawler.restlet.JobResource
1、build :validateConfiguration()2、launch:launch() new Thread start ,CrawlController.requestCrawlStart() getFrontier().run();3、pause:getCrawlController().requestCrawlPause()4、unpause:getC...原创 2012-12-09 23:30:49 · 144 阅读 · 0 评论 -
org.archive.crawler.Heritrix
1、ensure using java 1.6+ before hitting a later cryptic error2、Set some system properties early.ignoredSchemes,maxFormSize3、parsing command line options 4、DEFAULTS until changed by cmd-line op...原创 2012-12-09 22:26:35 · 216 阅读 · 0 评论 -
A Quick Guide to Running Your First Crawl Job
The Main Console page is displayed after you have installed Heritrix and logged into the WUI. Enter the name of the new job in the text box with the "Create new job with recommended starting confi...原创 2012-12-09 16:21:09 · 148 阅读 · 0 评论 -
org.archive.modules.extractor.Hop
/** * The kind of "hop" from one URI to another. Each hop type can be * represented by a single character; strings of these characters can * appear in logs. Eg, "LLLX" means that a URI was t...原创 2012-12-20 21:41:18 · 117 阅读 · 0 评论
分享