Scrapy建议的几个防止爬虫被禁的策略

本文介绍如何使用Scrapy框架通过随机更换User-Agent来应对网站反爬虫策略,并提出了包括使用IP池、设置请求延迟及禁用Cookie等其他反爬措施。
1. 随机切换UA

配置文件settings.py 同级目录下新增下载中间件 rotate_useragent.py

# -*- coding: utf-8 -*-
import random

class RotateUserAgentMiddleware(scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware):

	def __init__(self, user_agent=''):
		self.user_agent = user_agent

	def process_request(self, request, spider):
			request.headers.setdefault('User-Agent', random.choice(self.user_agent_list))

	#UA池,更多UA头部可参考 http://www.useragentstring.com/pages/useragentstring.php
	user_agent_list = [
		"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 ",
		"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
		"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 ",
		"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
		"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 ",
		"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
		"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 ",
		"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
		"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 ",
		"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
		"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 ",
		"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
		"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 ",
		"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
		"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
		"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
		"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
		"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 ",
		"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
		"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 ",
		"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
	]

编辑配置文件settings.py,启用下载中间件

DOWNLOADER_MIDDLEWARES = {
    'WebCrawler.spiders.rotate_useragent.RotateUserAgentMiddleware': 1,
}

2. IP池

防止IP过频被禁


3. 请求时延

限制爬取速度,一方面能避免被反爬虫措施封禁,另一方面也能减轻对服务器的压力
编辑配置文件settings.py,增加如下一行:

DOWNLOAD_DELAY = 3

4. 禁用cookie

防止被行为跟踪

转载于:https://my.oschina.net/u/2400083/blog/735887

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值