【Heritrix基础教程之3】Heritrix的基本架构


Heritrix可分为四大模块:

1、控制器CrawlController

2、待处理的uri列表  Frontier

3、线程池 ToeThread

4、各个步骤的处理器

(1)Pre-fetch processing chain:主要处理DNS-lookup, robots.txt,认证,抓取范围检查等。
(2)Fetch Processing chain:抓取处理器。对于每个协议,均有一个类作支持,如FetchHTTP。
(3)Extractor processing chain:内容提取器。用于提取页面中的链接。
(4)Write/index processing chain:将抓取的文件定入归档文件中,有时还会建立索引。
(5)Post-processing chain:更新抓取状态,检查第4步抓取的链接是否在抓取范围中。


附官方文档:

4. Overview of the crawler

The Heritrix Web Crawler is designed to be modular. Which modules to use can be set at runtime from the user interface. Our hope is that if you want the crawler to behave different from the default, it should only be a matter of writing a new module as a replacement or in addition to the modules shipped with the crawler.

The rest of this document assumes you have a basic understanding of how to run a crawl (see: [Heritrix User Guide]). Since the crawler is written in the Java programming language, you also need a fairly good understanding of Java.

The crawler consists of core classes and pluggable modules. The core classes can be configured, but not replaced. The pluggable classes can be substituted by altering the configuration of the crawler. A set of basic pluggable classes are shipped with the crawler, but if you have needs not met by these classes you could write your own.


Figure 1. Crawler overview


4.1. The CrawlController

The CrawlController collects all the classes which cooperate to perform a crawl, provides a high-level interface to the running crawl, and executes the "master thread" which doles out URIs from the Frontier to the ToeThreads. As the "global context" for a crawl, subcomponents will usually reach each other through the CrawlController.

4.2. The Frontier

The Frontier is responsible for handing out the next URI to be crawled. It is responsible for maintaining politeness, that is making sure that no web server is crawled too heavily. After a URI is crawled, it is handed back to the Frontier along with any newly discovered URIs that the Frontier should schedule for crawling.

It is the Frontier which keeps the state of the crawl. This includes, but is not limited to:

  • What URIs have been discovered

  • What URIs are being processed (fetched)

  • What URIs have been processed

The Frontier implements the Frontier interface and can be replaced by any Frontier that implements this interface. It should be noted though that writing a Frontier is not a trivial task.

The Frontier relies on the behavior of at least the following external processors: PreconditionEnforcer, LinksScoper and the FrontierScheduler (See below for more each of these Processors). The PreconditionEnforcer makes sure dns and robots are checked ahead of any fetching. LinksScoper tests if we are interested in a particular URL -- whether the URL is 'within the crawl scope' and if so, what our level of interest in the URL is, the priority with which it should be fetched. The FrontierScheduler adds ('schedules') URLs to the Frontier for crawling.

4.3. ToeThreads

The Heritrix web crawler is multi threaded. Every URI is handled by its own thread called a ToeThread. A ToeThread asks the Frontier for a new URI, sends it through all the processors and then asks for a new URI.

4.4. Processors

Processors are grouped into processor chains (Figure 2, “Processor chains”). Each chain does some processing on a URI. When a Processor is finished with a URI the ToeThread sends the URI to the next Processor until the URI has been processed by all the Processors. A processor has the option of telling the URI to skip to a particular chain. Also if a processor throws a fatal error, the processing skips to the Post-processing chain.

Figure 2. Processor chains


The task performed by the different processing chains are as follows:

4.4.1. Pre-fetch processing chain

The first chain is responsible for investigating if the URI could be crawled at this point. That includes checking if all preconditions are met (DNS-lookup, fetching robots.txt, authentication). It is also possible to completely block the crawling of URIs that have not passed through the scope check.

In the Pre-fetch processing chain the following processors should be included (or replacement modules that perform similar operations):

  • Preselector

    Last check if the URI should indeed be crawled. Can for example recheck scope. Useful if scope rules have been changed after the crawl starts. The scope is usually checked by the LinksScoper, before new URIs are added to the Frontier to be crawled. If the user changes the scope limits, it will not affect already queued URIs. By rechecking the scope at this point, you make sure that only URIs that are within current scope are being crawled.

  • PreconditionEnforcer

    Ensures that all preconditions for crawling a URI have been met. These currently include verifying that DNS and robots.txt information has been fetched for the URI.

4.4.2. Fetch processing chain

The processors in this chain are responsible for getting the data from the remote server. There should be one processor for each protocol that Heritrix supports: e.g. FetchHTTP.

4.4.3. Extractor processing chain

At this point the content of the document referenced by the URI is available and several processors will in turn try to get new links from it.

4.4.4. Write/index processing chain

This chain is responsible for writing the data to archive files. Heritrix comes with an ARCWriterProcessor which writes to the ARC format. New processors could be written to support other formats and even create indexes.

4.4.5. Post-processing chain

A URI should always pass through this chain even if a decision not to crawl the URI was done in a processor earlier in the chain. The post-processing chain must contain the following processors (or replacement modules that perform similar operations):

  • CrawlStateUpdater

    Updates the per-host information that may have been affected by the fetch. This is currently robots and IP address info.

  • LinksScoper

    Checks all links extracted from the current download against the crawl scope. Those that are out of scope are discarded. Logging of discarded URLs can be enabled.

  • FrontierScheduler

    'Schedules' any URLs stored as CandidateURIs found in the current CrawlURI with the frontier for crawling. Also schedules prerequisites if any.


转载于:https://www.cnblogs.com/jinhong-lu/p/4559449.html

Heritrix是IA的开放源代码,可扩展的,基于整个Web的,归档网络爬虫工程 Heritrix工程始于2003年初,IA的目的是开发一个特殊的爬虫,对网上的 资源进行归档,建立网络数字图书馆,在过去的6年里,IA已经建立了400TB的数据。 IA期望他们的crawler包含以下几种: 宽带爬虫:能够以更高的带宽去站点爬。 主题爬虫:集中于被选择的问题。 持续爬虫:不仅仅爬更当前的网页还负责爬日后更新的网页。 实验爬虫:对爬虫技术进行实验,以决定该爬什么,以及对不同协议的爬虫爬行结果进行分析的。 Heritrix的主页是http://crawler.archive.org Heritrix是一个爬虫框架,可加如入一些可互换的组件。 它的执行是递归进行的,主要有以下几步: 1。在预定的URI中选择一个。 2。获取URI 3。分析,归档结果 4。选择已经发现的感兴趣的URI。加入预定队列。 5。标记已经处理过的URI Heritrix主要有三大部件:范围部件,边界部件,处理器链 范围部件:主要按照规则决定将哪个URI入队。 边界部件:跟踪哪个预定的URI将被收集,和已经被收集的URI,选择下一个URI,剔除已经处理过的URI。 处理器链:包含若干处理器获取URI,分析结果,将它们传回给边界部件 Heritrix的其余部件有: WEB管理控制台:大多数都是单机的WEB应用,内嵌JAVA HTTP 服务器。 操作者可以通过选择Crawler命令来操作控制台。 Crawler命令处理部件:包含足够的信息创建要爬的URI。 Servercache(处理器缓存):存放服务器的持久信息,能够被爬行部件随时 查到,包括IP地址,历史记录,机器人策略。 处理器链: 预取链:主要是做一些准备工作,例如,对处理进行延迟和重新处理,否决随后的操作。 提取链:主要是获得资源,进行DNS转换,填写请求和响应表单 抽取链:当提取完成时,抽取感兴趣的HTML,JavaScript,通常那里有新的也适合的URI,此时URI仅仅被发现,不会被评估 写链:存储爬行结果,返回内容和抽取特性,过滤完存储。 提交链:做最后的维护,例如,测试那些不在范围内的,提交给边界部件 Heritrix 1.0.0包含以下关键特性: 1.用单个爬虫在多个独立的站点一直不断的进行递归的爬。 2。从一个提供的种子进行爬,收集站点内的精确URI,和精确主机。 3。主要是用广度优先算法进行处理。 4。主要部件都是高效的可扩展的 5。良好的配置,包括: a。可设置输出日志,归档文件和临时文件的位置 b。可设置下载的最大字节,最大数量的下载文档,和最大的下载时间。 c。可设置工作线程数量。 d。可设置所利用的带宽的上界。 e。可在设置之后一定时间重新选择。 f。包含一些可设置的过滤机制,表达方式,URI路径深度选择等等。 Heritrix的局限: 1。单实例的爬虫,之间不能进行合作。 2。在有限的机器资源的情况下,却要复杂的操作。 3。只有官方支持,仅仅在Linux上进行了测试。 4。每个爬虫是单独进行工作的,没有对更新进行修订。 5 。在硬件和系统失败时,恢复能力很差。 6。很少的时间用来优化性能。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值