Scrapy框架原理
Scrapy异步框架核心原理
- 同步与异步
- Scrapy运行原理
同步与异步
同步:是下一个方法依赖于上一个方法的结果
异步:下一个方法不依赖于上一个方法的结果
同步爬虫程序的问题:
href_s = [url1,url2,url3]
for href in href_s:
response = urlopen(href).read()
//处理response
只有前一个url被访问时才能继续访问,前端向服务器发出访问和得到响应的时间过长
异步解决问题:
每个url会各自与服务器发现访问和响应,可以同时进行。
Scrapy运行原理
Scrapy项目创建与配置
- Scrapy的安装
- 项目创建
- 基本配置
- 入门案例
项目创建
- 创建项目 >> scrapy startproject 【项目名字】
- 进入项目 >> cd 【项目名字】
- 创建爬虫文件 >> scrapy genspider 【爬虫名字】 “【HOST地址】“
- 运行爬虫文件 >> scrapy crawl 【爬虫名字】
基本配置
Settings中的常用配置: USER_AGENT = "" # User-Agent
ROBOTSTXT_OBEY = True|Flase # 是否遵守机器人协议
DEFAULT_REQUEST_HEADERS = {} # 默认Headers
CONCURRENT_REQUESTS = 16 # 下载器最大处理的请求数
DOWNLOAD_DELAY = 3 # 下载延时
SPIDER_MIDDLEWARES # Spider中间件
DOWNLOADER_MIDDLEWARES # Downloader中间件
ITEM_PIPELINES # 管道文件
Spider中的属性和方法:
# 爬虫名字
name = 's1'
# 如果URL地址的HOST不属于allowed_domains,则过滤掉该请求
allowed_domains = ['edu.youkuaiyun.com']
# 项目启动时,访问的URL地址
start_urls = ['http://edu.youkuaiyun.com/']
# 访问start_urls,得到响应后调用的方法
def parse(self, response): # response为响应对象
pass
# 爬虫开始,执行的方法,相当于start_urls
def start_requests(self):
yield scrapy.Request( # 向调度器发送一个Request对象
url='http://edu.youkuaiyun.com', # 请求地址,默认Get方式
callback=self.parse2 # 得到响应后,调用的函数
)
def parse2(self,response): # 得到响应后,调用的函数
print(response.body) # 得到字节类型的数据
中间件的应用
配置DownlaoderMiddleware,可设置代理IP
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
from scrapy.http.headers import Headers
from myscrapy1 import user_agent
import urllib.request as ur
class Myscrapy1SpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class Myscrapy1DownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
request.headers=Headers(
{
'User-Agent':user_agent.get_user_agent_pc()
}
)
request.meta['proxy']='http://'+ur.urlopen('http://api.ip.data5u.com/dynamic/get.html?order=b32af6184d3eb674429f3110c217364e&sep=3').read().decode('utf-8').strip()
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
在setting中开启DownlaoderMiddleware
DOWNLOADER_MIDDLEWARES = {
'myscrapy1.middlewares.Myscrapy1DownloaderMiddleware': 543,
}
将在命令行运行启动scrapy项目的代码写到spider下的py文件中,新建一个start.py文件,写入代码
from scrapy import cmdline
cmdline.execute('scrapy crawl s1'.split())
配置pipelines管道文件
将爬取的文字保存到本地文件
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class Myscrapy1Pipeline(object):
def process_item(self, item, spider):
#处理item
#print('pipline运行了')
print(item)
data = item['data']
save_data = ','.join(data)
with open('data.txt','w',encoding='utf-8') as f:
f.write(save_data)
return item
在setting中开始pipelines文件
ITEM_PIPELINES = {
'myscrapy1.pipelines.Myscrapy1Pipeline': 300,
}
setting中可开启的功能
- DawnloadMiddleware中间件
- pipelines管道文件
- 爬虫延时功能
- 默认请求头
- 机器人协议
- 配置LOG日志
LOG_FILE = “日志文件地址”
LOG_LEVEL = “日志级别”
日志级别:CRITICAL 严重错误(critical) ERROR 一般错误(regular errors) WARNING 警告信息(warning messages) INFO 一般信息(informational messages) DEBUG 调试信息(debugging messages)
# LOG_FILE = 'zhaobiao.log'
# LOG_LEVEL = 'ERROR'
请求链的健硕性优化
通过建立请求循环体,进行请求链的优化
def start_requests(self):
for keyword in self.keyword_s:
form_data = deepcopy(self.form_data)
form_data['key'] = keyword
form_data['currentPage'] = '1'
request= scrapy.FormRequest(
url='http://ss.ebnew.com/tradingSearch/index.htm',
formdata = form_data,
callback=self.parse_start
)
request.meta['form_data']=form_data
yield request
# yield scrapy.Request(
# url='http://www.ebnew.com/businessShow/643996492.html',
# callback=self.parse_page2
# )
#
# form_data = self.form_data
# form_data['key'] = '路由器'
# form_data['currentPage'] = '2'
# yield scrapy.FormRequest(
# url='http://ss.ebnew.com/tradingSearch/index.htm',
# formdata= form_data,
# callback= self.parse_page1,
# )
def parse_start(self,response):
a_text_s=response.xpath('//form[@id="pagerSubmitForm"]/a/text()').extract()
page_max = max(
[int(a_text) for a_text in a_text_s if re.match('\d+',a_text)]
)
page_max=2
self.parse_page1(response)
for page in range(2,page_max+1):
form_data = deepcopy(response.meta['form_data'])
form_data['currentPage']=str(page)
request = scrapy.FormRequest(
url='http://ss.ebnew.com/tradingSearch/index.htm',
formdata=form_data,
callback=self.parse_page1
)
request.meta['form_data'] = form_data
yield request
异步IO实现数据持久化
- 导包 import pymysql
- 创建连接对象 mysql_conn = pymysql.Connection(
host=‘localhost’, # 主机地址
port=3306, # 端口号
user=‘root’, # 登录用户名
password="", # 登录密码
database=’【连接数据库名称】’,
charset=‘utf8’, # utf-8的编码
) - 创建光标对象
cs = mysql_conn.cursor() - 定义要执行的SQL语句 cs.execute(’【SQL语句】’)
- 提交 mysql_conn.commit()