转载本文请注明出处:http://blog.youkuaiyun.com/pwlazy
Introduction
The open-source Nutch search engine consists, very roughly, of three components:
-
the crawler, which discovers and retrieves web pages
-
theWebDB, a custom database that stores knownURLs and fetched page contents
-
the indexer, which dissects pages and builds keyword-based indexes from them
This document attempts to describe the operation of the crawler. We begin with theory and drill down to into the details needed to create a customized crawler.
Nutch is implemented in Java, so basic knowledge of the language is assumed.
介绍
开源Nutch搜索引擎大致包含3部分
- crawler,发觉和检索网页
- theWebDB,一个定制的数据库用于存储已知的url和检索的网页内容
- indexer,剖析页面以及从中构建基于关键词的索引
Nutch使用java实现的,所以我们假定你有基本的相关知识。
注:本人英文水平有限,翻译不当之处请批评指正,谢谢
Nutch爬虫解析
本文介绍了开源搜索引擎Nutch中的爬虫组件操作原理及细节,包括如何发现和检索网页、使用定制数据库存储信息以及构建关键词索引等内容。
538

被折叠的 条评论
为什么被折叠?



