Nutch 需要进行两项配置更改,然后才能抓取网站:
- 自定义爬网属性,其中至少为爬网程序提供一个名称,以便外部服务器识别
- 设置要抓取的 URL 的种子列表
自定义爬网属性
- 默认爬网属性可以在 { {conf/nutch-default.xml }}- 中查看和编辑,其中大多数都可以在不修改的情况下使用
<!-- nutch-config.xml -->
<!-- HTTP properties -->
<property>
<!-- 为爬网程序提供一个名称,以便外部服务器识别 -->
<name>http.agent.name</name>
<value></value>
<description>'User-Agent' name: a single word uniquely identifying your crawler.
The value is used to select the group of robots.txt rules addressing your
crawler. It is also sent as part of the HTTP 'User-Agent' request header.
This property MUST NOT be empty -
please set this to a single word uniquely related to your organization.
Following RFC 9309 the 'User-Agent' name (aka. 'product token')
"MUST contain only uppercase and lowercase letters ('a-z' and
'A-Z'), underscores ('_'), and hyphens ('-')."
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
- conf/nutch-site.xml: 该文件用作添加您自己的自定义爬

最低0.47元/天 解锁文章
319

被折叠的 条评论
为什么被折叠?



