《A introduction to Heritrix》翻译(续)

翻译于 2010-06-19 21:36:00 发布 · 639 阅读

文章标签：

#components #http服务器 #web #parameters #任务 #application

Try to translate 专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍了爬虫系统的几个核心组件，包括Web管理控制台、CrawlOrder配置对象、CrawlController全局上下文以及爬行范围和边界部件的功能与工作原理。探讨了如何通过这些组件实现对网页的有效爬取及资源管理。

今天上海最高温度在35摄氏度以上，在这样的热力中翻译不是件容易的事情，本来天气就让我烦躁不安了，翻译需要的耐心琢磨更增强了这种躁动。好不容易翻译了几小段，还不大让自己满意...

----------------------------------------------------------------------------------------------------------------------------

关键部件：

这一节我们将对图1所示的每个部件进行更详细的介绍。

从很多方面来看，web管理控制台是一个独立的web应用程序，内嵌一个Jetty Java HTTP服务器。在它的web页面上，可以进行选择爬行部件和参数的操作（可以选择爬行部件和参数？），构成一个crawlOrder---一个也有外部XML描述文件的配置对象。

通过将这个crawlOrder传递给crawlController来初始化一次爬行任务，CrawlController是一个对所有配置部件进行实例化和存储引用的部件，是一次爬行任务的全局上下文：通过它，所有的子部件相互联系。Web管理控制台也是通过它来控制一次爬行任务。

CrawlOrder包含了创建爬行范围部件（the Scope范围部件？）的充足信息，范围部件利用初始化URIs，为了边界部件提供基础边界，并协同决定后期发现的URIs是否也归入队列。

边界部件负责整理要访问的URIs，确保URIs不必要地再次被访问，限制爬虫访问任意一个远程网站。它通过维护一系列将被访问的URIs内部队列，以及一个已被访问或者已排队的URIs列表来实现这些目标。

仅当以一种和配置的礼貌策略相兼容的方式进行抓取时，URIs才从队列中释放出来。默认[提供的]边界实现（implementation?）主要提供了广度优先---抓取顺序的策略，用于选择URI进行处理，优先选择新网站作为开始[继续爬行]，而不是正要完成的网站[所含链接]。

-----------------------------------------------------------英文原文--------------------------------------------------------

Key Components

In this section we go into more detail on each of the components featured in Figure 1.

The Web Administrative Console is in many ways a standalone web application,hosted by the embedded Jetty Java HTTP server. Its web pages allow the operator to choose a crawl's components and parameters by composing a CrawlOrder, a configuration object that also has an external XML representation.

A crawl is initiated by passing this CrawlOrder to the CrawlController, a component which instantiates and holds references to all configured crawl components. The CrawlController is the crawl's global context: all subcomponents can reach each other through it. The Web Administrative Console controls the crawl through the CrawlController.

The CrawlOrder contains sufficient information to create the Scope. The Scope seeds the Frontier with initial URIs and is consulted to decide which later-discovered URIs should also be scheduled.

The Frontier has responsibility for ordering the URIs to be visited, ensuring URIs are not revisited unnecessarily, and moderating the crawler's visits to any one remote site. It achieves these goals by maintaining a series of internal queues of URIs to be visited, and a list of all URIs already visited or queued. URIs are only released from queues for fetching in a manner compatible with the configured politeness policy. The default provided Frontier implementation offers a primarily breadth-first, order-of-discovery policy for choosing URIs to process, with an option to prefer finishing sites in progress to beginning new sites. Other Frontier implementations are possible.