一、反爬取设置
1.降低请求频率
settings.py中设置
DOWNLOAD_DELAY = 3 # 下载延迟时间为3秒
RANDOMIZE_DOWNLOAD_DELAY = True # 设置一个(0.5~1.5) *DOWNLOAD_DELAY之间的随机延迟时间
2.禁用Cookie
settings.py中设置
COOKIES_ENABLED = False
3.伪装成随机浏览器
方法一:调用fake-useragent第三方库
- 优点:方便,调用简单,settints.py里面不用手动添加很多user-agent
- 缺点:受第三方库影响,随机取user-agent的时候经常出现timeout的错误,导致拿不到user-agent
(1)设定浏览器列表
sudo pip install fake-useragent
(2)在中间件UserAgentMiddleware中从浏览器列表中随机获取一个浏览器
middlewares.py中设置
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from fake_useragent import UserAgent
class KaidailiUserAgentMiddleware(UserAgentMiddleware):
# 定义类KaidailiUserAgentMiddleware,用于设置随机user-agent
# 继承于UserAgentMiddleware
def process_request(self, request, spider):
# 处理Request请求函数
ua = UserAgent()
request.headers['User-Agent'] = ua.random
print(request.headers['User-UserAgent'])
(3)启用中间件UserAgentMiddleware
settings.py中设置
DOWNLOADER_MIDDLEWARES = {
'pcProxy.middlewares.PcproxyDownloaderMiddleware': None,
'pcProxy.middlewares.KaidailiUserAgentMiddleware': 100,
}
方法二:手动添加user-agent
- 优点:不用受其他因素影响
- 缺点: 需要手动在settints.py里添加很多user-agent,user-agent不全面
(1)设定浏览器列表
settings.py中手动添加一些user-agent,可以自己增加
# 设置user-agent
MY_USER_AGENT = [
'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
'Mozilla/5.0 (compatible; ABrowse 0.4; Syllable)',
'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)',
'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)',
]
(2)在中间件UserAgentMiddleware中从浏览器列表中随机获取一个浏览器
middlewares.py中设置
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random
from pcProxy.settings import MY_USER_AGENT
class MyUserAgentMiddleware(UserAgentMiddleware):
# 定义类UserAgentMiddleware,用于设置随机user-agent,作用于所有spider
# 继承于UserAgentMiddleware
def process_request(self, request, spider):
# 处理Request请求函数
agent = random.choice(list(MY_USER_AGENT))
request.headers.setdefault('User-Agent', agent)
(3)启用中间件UserAgentMiddleware
settings.py中设置
DOWNLOADER_MIDDLEWARES = {
'pcProxy.middlewares.PcproxyDownloaderMiddleware': None,
'pcProxy.middlewares.MyUserAgentMiddleware': 100,
}
4.更换IP地址
在Request请求中将代理服务器的URL赋给meta参数的键proxy
Request(url, meta={"proxy":"http://68.185.57.66:80"})
二、数据库持久化
1.Redis数据库
(1)在配置文件settings.py中添加Redis的配置信息
# Redis
REDIS_HOST = "localhost"
REDIS_PORT = "6379"
REDIS_DB_INDEX = 0
REDIS_DB_PASSWORD = "123456"
(2)在pipelines.py中新增RedisPipeline类
import redis
class RedisPipeline(object):
# Spider开启时,获取数据库配置信息,链接Redis数据库服务
def open_spider(self, spider):
host = spider.settings.get("REDIS_HOST")
port = spider.settings.get("REDIS_PORT")
db_index = spider.settings.get("REDIS_DB_INDEX")
db_password = spider.settings.get("REDIS_DB_PASSWORD")
self.db_conn = redis.StrictRedis(
host=host,
port=port,
db=db_index,
password=db_password,
)
# 将数据存储于Redis数据库
def process_item(self, item, spider):
if spider.name == 'kuaidaili':
item_dict = dict(item)
self.db_conn.sadd("ip", item_dict['url'])
return item
# Spider关闭时,执行数据库关闭工作
def close_spider(self, spider):
self.db_conn.connection_pool.disconnect()
(3)在配置文件settings.py中启用RedisPipeline
ITEM_PIPELINES = {
'pcProxy.pipelines.RedisPipeline': 300,
}
2.Mysql数据库
(1)在配置文件settings.py中添加Mysql的配置信息
# Mysql
MYSQL_HOST = "localhost"
MYSQL_USER = "sql"
MYSQL_PASSWORD = "123456"
MYSQL_DB_NAME = "test"
MYSQL_TB_NAME = "proxy"
(2)在pipelines.py中新增MysqlPipeline类
import MySQLdb
class MysqlPipeline(object):
# 将数据保存于MySQL的Item Pipeline
def open_spider(self, spider):
# Spider开启时,获取数据库配置信息,连接MySQL数据库服务
host = spider.settings.get("MYSQL_HOST", "localhost")
user = spider.settings.get("MYSQL_USER", "sql")
pwd = spider.settings.get("MYSQL_PASSWORD", "123456")
db_name = spider.settings.get("MYSQL_DB_NAME", "test")
# 连接MySQL数据库服务
self.db_conn = MySQLdb.connect(
db=db_name,
host=host,
user=user,
password=pwd,
charset="utf8",
)
# 获取游标
self.db_cursor = self.db_conn.cursor()
def process_item(self, item, spider):
# 将数据保存于MySQL
tb_name = spider.settings.get("MYSQL_TB_NAME", "proxy")
values = (
item['ip'],
item['port'],
item['anonymity_levels'],
item['protocol'],
# item['position'],
item['country'],
)
sql = 'insert into {tb_name}(ip, port, anonymity_levels, protocol, country) values (%s, %s, %s, %s, %s)'.format(tb_name=tb_name)
self.db_cursor.execute(sql, values)
return item
def close_spider(self, spider):
# Spider关闭,执行数据库关闭工作
self.db_conn.commit()
self.db_cursor.close()
self.db_conn.close()
(3)在配置文件settings.py中启用MySQLPipeline
ITEM_PIPELINES = {
'pcProxy.pipelines.MysqlPipeline': 300,
}
三、启动方式
1.单独启动某一个spider
- 命令行输入:
scrapy crawl spider_name
2.运行多个爬虫
- 在 项目目录 下创建一个 main.py 文件,下面的示例代码都写在这个文件中,项目执行时,在命令行下执行 python main.py
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
# 根据项目配置获取 CrawlerProcess 实例
process = CrawlerProcess(get_project_settings())
# 添加需要执行的爬虫
process.crawl('kuaidaili')
process.crawl('proxydb')
process.crawl('listdaily')
# 执行
process.start()
3. 运行所有爬虫
- 编辑 main.py 如下, python main.py
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.spiderloader import SpiderLoader
# 根据项目配置获取 CrawlerProcess 实例
process = CrawlerProcess(get_project_settings())
# 获取 spiderloader 对象,以进一步获取项目下所有爬虫名称
spider_loader = SpiderLoader(get_project_settings())
# 添加需要执行的爬虫
for spidername in spider_loader.list():
process.crawl(spidername)
# 执行
process.start()
4. 命令行传参给spider启动
(1)命令行启动
scrapy crawl kuaidaili -a url='http://www.baidu.com'
(2)main.py脚本启动
- 通过 process.crawl 方法中添加,eg:
process.crawl('kuaidaili', url='http://www.baidu.com')
上面两种情况,在 kuaidaili 这个spider(类)中,可以在 __init__
方法中接收这个 url 参数。eg:
class KuaidailiSpider(scrapy.Spider):
name = 'kuaidaili'
def __init__(self, url)
self.url = url
四、配置不同spider
1. 应用场景
- 同一项目下包含多个不同的spider,每个spider针对某个网站或者某个功能,因为不同网站的反爬机制不同,需要对特定spider启用不同的DOWNLOAD_DELAY设置;不同网站,ITEM结构也有所偏差,不同spider启用不同的piepline设置…
2. 通过spider.name做逻辑上的判断
- 设置pipeline
- if spider.name == ‘kuaidaili’:…
def process_item(self, item, spider):
# 将数据保存于MySQL
tb_name = spider.settings.get("MYSQL_TB_NAME", "proxy")
if spider.name == 'kuaidaili': # 针对kuaidaili这个站点的设置
values = (
item['ip'],
item['port'],
item['anonymity_levels'],
item['protocol'],
item['position'],
item['country'],
)
sql = 'insert into {tb_name}(ip, port, anonymity_levels'.format(tb_name=tb_name) + \
', protocol, position, country) values (%s, %s, %s, %s, %s, %s)'
self.db_cursor.execute(sql, values)
return item
3. custom_settings启用
- scrapy 设置启用顺序:
default_settngs ~ project_settings ~ custom_settings
所以custom_settings里面的设置会覆盖settings.py里面的设置
custom_settings设置
class CheckSpider(scrapy.Spider):
name = 'check'
start_urls = ['http://127.0.0.1:5000/all']
custom_settings = {
'DOWNLOAD_DELAY': 0.01
}
settings.py设置
DOWNLOAD_DELAY = 10
RANDOMIZE_DOWNLOAD_DELAY = True
运行效果
2020-03-13 12:02:30 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'pcProxy',
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 0.01,
'NEWSPIDER_MODULE': 'pcProxy.spiders',
'SPIDER_MODULES': ['pcProxy.spiders']}
五、Spider VS CrawlSpider
1. 创建
- scrapy genspider exampleSpider example.com 创建以scrapy.Spider为模板的爬虫类
- scrapy genspider -t crawl exampleSpider example.com 创建以scrapy.spiders.CrawlSpider为模板的爬虫类
2. 使用
- Spider中start_urls返回的Response会使用编写的parse方法去解析,如果继续爬取深度网址需要使用xpath或css selector去解析response中想要继续爬取的网址,再用Scrapy.Request发送请求,得到的Response使用Request中指定的callback函数去解析
- CrawlSpider中可使用Rule:Rule中编写需要深度爬取的网址的正则匹配规则,start_urls中返回的Response会先与Rule中的LinkExtractor规则匹配,满足条件的URL会自动发送Request,得到的Response会调用制定的callback函数去解析,如果指定属性follow=True,返回的每个Response都会与Rule中的LinkExtractor规则匹配,不断循环,直到获取完所有链接(scrapy中Request队列带有去重功能,所以重复网址也只会发送一次Request)