为什么要学习Splash?
我们经常使用scrapy框架编写爬虫代码,站在巨人的肩膀上感觉很好,但是一旦遇到网站用JavaScript动态渲染,scrapy就显得有些力不从心了,我们了解的selenium可以完成动态加载,返回浏览器渲染后的页面,今天我们不讲selenium,Scrapy-Splash(是一个Scrapy中支持JavaScript渲染的工具)同样可以完成这件事,下面我们来说说Splash如何与Scrapy进行对接。
官方文档:https://splash.readthedocs.io/en/stable/
准备工作
- 安装:https://splash.readthedocs.io/en/stable/install.html
Scrapy-Splash 会使用Splash的HTTP API 进行页面渲染,所以我们需要安装Splash,这里需要通过Docker安装
Linux + Docker
- Install Docker
- Pull the image:
sudo docker pull scrapinghub/splash
- Start the container:
sudo docker run -it -p 8050:8050 scrapinghub/splash
OS X + Docker
-
Install Docker for Mac (see https://docs.docker.com/docker-for-mac/).
-
Pull the image:
docker pull scrapinghub/splash
- Start the container:
docker run -it -p 8050:8050 scrapinghub/splash
运行完毕后在浏览器中打开地址:http://0.0.0.0:8050 出现如下界面
安装scrapy-splash
pip3 install scrapy-splash
使用对比
step1: 使用scrapy创建项目:
- 设置文件中修改如下:
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
}
- 爬虫文件中代码如下:
import scrapy
class SplashspiderSpider(scrapy.Spider):
name = 'splashSpider'
allowed_domains = ['douban.com']
start_urls = ['https://movie.douban.com/subject_search?search_text=%E6%88%90%E9%BE%99&cat=1002']
def parse(self, response):
print(response.status,response.url)
with open('page.html','w') as file:
file.write(response.text)
将响应结果存储到本地后并没有页面展示的元数据
step2: 使用scrapy-splash完成请求
- 在settings.py文件中,你需要额外的填写下面的一些内容:
# 渲染服务的url(本地或者远端服务器ip)
SPLASH_URL = 'http://127.0.0.1:8050'
# 设置爬虫中间件
SPIDER_MIDDLEWARES = {
#'splashpeoject.middlewares.SplashpeojectSpiderMiddleware': 543,
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
#设置相关下载器中间件
#这里配置了三个下载中间件( DownloadMiddleware),是scrapy-splash的核心部分,我们不需要
#像对接selenium那样自己定制中间件,scrapy-splash已经为我们准备好了,直接配置即可
DOWNLOADER_MIDDLEWARES = {
#'splashpeoject.middlewares.SplashpeojectDownloaderMiddleware': 543,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# 配置去重组件类DUPEFILTER_CLASS
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 使用Splash的Http缓存
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
- 在爬虫代码文件中做如下修改:
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest,SplashFormRequest
class SplashspiderSpider(scrapy.Spider):
name = 'splashSpider'
allowed_domains = ['douban.com']
start_urls = ['https://movie.douban.com/subject_search?search_text=%E6%88%90%E9%BE%99&cat=1002']
def start_requests(self):
for url in self.start_urls:
#SplashRequest对象,前两个参数依然是请求的URL和回调函数。另外我们还可以
#通过args传递一些渲染参数,例如等待时间wait等,还可以根据endpoint参数指定渲
#染接口。更多参数可以参考文档说明:https://github.com/scrapy-plugins/scrapy-
#splash#requests。
yield SplashRequest(
url=url,
callback=self.parse,
meta={'title':'xxxx'},
args={
'wait':1,
}
)
def parse(self, response):
print(response.status,response.url,response.meta)
with open('page.html','w') as file:
file.write(response.text)
完成以上基本代码我们就可以使用Splash来抓取页面了,这里我们使用创建
SplashRequest对象构建请求,scrapy会将此请求转发给Splash,Splash对页面进行渲染,然后将渲染后的页面返回给spider进行解析即可。
在Spider里用SplashRequest对接Lua脚本,构建请求
- 比如我们需要在页面加载出来后,自动点击下一页,则需要执行响应的js代码
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest,SplashFormRequest
class SplashspiderSpider(scrapy.Spider):
name = 'splashSpider'
allowed_domains = ['douban.com']
start_urls = ['https://movie.douban.com/subject_search?search_text=%E6%88%90%E9%BE%99&cat=1002']
script = """
function main(splash,args)
splash.images_enabled = false
assert(splash:go(args.url))
assert(splash:wait(args.wait))
js = "document.querySelector('a.next').click()"
splash:evaljs(js)
assert(splash:wait(args.wait))
return splash:html()
end
"""
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
callback=self.parse,
endpoint='execute',
args={
'wait':1,
'lua_source':self.script
}
)
def parse(self, response):
print(response.status,response.url,response.meta,response.request.headers)
with open('page.html','w') as file:
file.write(response.text)
关于scrapy-splash使用以及如何设置代理ip
- 方式一:现在我们需要给我们的scrapy添加代理中间件middlewares
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['splash']['args']['proxy'] = proxyServer
proxy_user_pass = "USERNAME:PASSWORD"
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers["Proxy-Authorization"] = 'Basic ' + encoded_user_pass
注意:这里我们需要注意的是设置代理不再是
request.meta[‘proxy’] = proxyServer而是request.meta[‘splash’] [‘args’][‘proxy’] = proxyServer
- 方式二:在构造请求的时候设置代理
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url,
url=url,
callback=self.parse,
args={
'wait': 5,
'proxy': 'http://proxy_ip:proxy_port'
}
===============================================
splash+requests get请求示例
import requests
def splash_render(url):
splash_url = "http://localhost:8050/render.html"
args = {
"url":url,
"timeout": 5,
"image": 0
}
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
}
response = requests.get(splash_url, params=args,headers=headers)
return response.text
if __name__ == '__main__':
url = "https://movie.douban.com/subject_search?search_text=%E6%88%90%E9%BE%99&cat=1002"
html = splash_render(url)
with open('page1.html', 'w') as file:
file.write(html)
args参数说明:
- url: 需要渲染的页面地址
- timeout: 超时时间
- proxy:代理
- wait:等待渲染时间
- images: 是否下载,默认1(下载)
- js_source: 渲染页面前执行的js代码
Splash其他参考文献:
https://www.cnblogs.com/lmx123/p/9989915.html