使用Apache Nutch抓取网站内容

税码行者

于 2025-02-12 16:38:02 发布

阅读量1.6k

点赞数 25

CC 4.0 BY-SA版权

分类专栏： Apache Nutch&Manticoresearch集成文章标签： apache 网络爬虫

本文链接：https://blog.youkuaiyun.com/taxCode/article/details/145590514

Apache Nutch&Manticoresearch集成专栏收录该内容

5 篇文章

订阅专栏

Nutch 需要进行两项配置更改，然后才能抓取网站：

自定义爬网属性，其中至少为爬网程序提供一个名称，以便外部服务器识别
设置要抓取的 URL 的种子列表

自定义爬网属性

默认爬网属性可以在 {{conf/nutch-default.xml }}- 中查看和编辑，其中大多数都可以在不修改的情况下使用

<!-- nutch-config.xml -->

<!-- HTTP properties -->

<property>
<!-- 为爬网程序提供一个名称，以便外部服务器识别 -->
  <name>http.agent.name</name>
  <value></value>
  <description>'User-Agent' name: a single word uniquely identifying your crawler.

  The value is used to select the group of robots.txt rules addressing your
  crawler. It is also sent as part of the HTTP 'User-Agent' request header.

  This property MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  Following RFC 9309 the 'User-Agent' name (aka. 'product token')
  &quot;MUST contain only uppercase and lowercase letters ('a-z' and
  'A-Z'), underscores ('_'), and hyphens ('-').&quot;

  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  and set their values appropriately.
  </description>
</property>

conf/nutch-site.xml: 该文件用作添加您自己的自定义爬网属性的位置，这些属性将覆盖nutch-default.xml中的属性, 一下例子是为爬网程序提供一个名称，以便外部服务器识别，此配置会覆盖nutch-default.xml中http.agent.name的配置：

<!-- nutch-site.xml -->
<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>My First Nutch Spider</value>
  </property>
  <property>
    <name>generate.parser.text</name>
    <value>true</value>
  </property>
  <property>
    <name>generate.parser.data</name>
    <value>true</value>
  </property>
</configuration>

官网手册上还需要配置plugin.includes属性的值为indexer-solr, 是因为要将nutch爬取的内容索引数据推送到 Solr 搜索引擎，如果您不是用Solr作为引擎服务，而是用elasticSearch，应该配置成indexer-elastic，nutch-default.xml中已经配置了一些默认插件：

<!-- nutch-default.xml -->
<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  By default Nutch includes plugins to crawl HTML and various other
  document formats via HTTP/HTTPS and indexing the crawled content
  into Solr.  More plugins are available to support more indexing
  backends, to fetch ftp:// and file:// URLs, for focused crawling,
  and many other use cases. (默认情况下，Nutch 包含用于通过 HTTP/HTTPS 爬取 HTML 和各种其他文档格式的插件，并将爬取的内容索引到 Solr 中。还有更多插件可用，以支持更多索引后端，获取 ftp:// 和 file:// URL，进行针对性爬虫，以及许多其他用例。)
  </description>
</property>

创建 URL 种子列表

URL 种子列表包括一个网站列表，每行一个，nutch 将要抓取这些网站
conf/regex-urlfilter.txt: 该文件将提供正则表达式，使nutch能够过滤和缩小要抓取和下载的web资源类型

创建 URL 种子列表

回到nutch的安装目录
mkdir -p urls
cd urls
touch seed.txt:创建一个包含以下内容的文本文件（您希望 Nutch 抓取的每个站点每行一个 URL）。

seed.txt

#seed.txt
https://fgk.chinatax.gov.cn/zcfgk/c100006/listflfg.html
https://shanghai.chinatax.gov.cn/zcfw/zcfgk/ 
https://fg.taxation.cn/tax/#/home/welcome

配置正则表达式过滤器

编辑文件并替换conf/regex-urlfilter.txt中的内容

# accept anything else
+.

替换为与您要爬取的域匹配的正则表达式。例如，如果您希望将爬取限制到域

# accept anything else
+^https?://fgk\.chinatax\.gov\.cn\/zcfgk\/
+^https?://fg\.taxation\.cn
+^https?://shanghai\.chinatax\.gov\.cn

注意：如果不指定要包含在 regex-urlfilter.txt 中的任何域，将导致链接到您的种子 URL 文件的所有域也被抓取。

使用 URL 列表为 crawldb 设定种子

1. 从初始种子列表引导

bin/nutch inject crawl/crawldb urls

现在我们有一个 Web 数据库（在当前目录下会新建一个crawl文件夹），其中包含您未爬取的 URL

2. 爬取

要爬取，我们首先从数据库中生成一个 fetch 列表

bin/nutch generate crawl/crawldb crawl/segments

在crawl目录下自动创建了segments目录

这将为所有要爬取的页面生成一个 fetch 列表。爬取列表放置在新创建的区段目录中。区段目录按其创建时间命名。我们将此段的名称保存在 shell 变量s1中

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

现在我们在这个 segment 上运行 fetcher：

bin/nutch fetch $s1

然后我们解析这些条目：

bin/nutch parse $s1

完成后，我们使用 fetch 的结果更新数据库：

bin/nutch updatedb crawl/crawldb $s1

现在，数据库包含所有初始页面的更新条目，以及与从初始集链接的新发现页面相对应的新条目。

3. 提取解析后的内容

bin/nutch readseg -dump $s1 fetch-result -nocontent -noparse

fetch-result 是我在当前目录下新建的文件夹，是将dump文件保存到此文件夹下

dump文件（文本文件）：每个 URL 的元数据（标题、抓取时间、状态等）。

用cat命令查看一下dump文件内容：