Scrapy框架介绍
图片来源:百度图片
忽略引擎(Scrapy Engine)(引擎负责全局指挥,发号施令)的话,大概流程是,写好爬虫(Spiders),请求交给调度器(Scheduler),调度器入队列,调度器拿出请求交给下载器(Downloader),下载器返回的响应文件交给爬虫提取,如果提取到的是URL地址,重复上述过程,如果是Items数据就交给管道(Item Pipeline)存储。
Windows下搭建环境
进入命令行窗口,如下操作:
pip install –upgrade pip
pip install Scrapy
如果pip已经是最新的了,上述第一条语句会提示不需要更新pip了。在执行第二条语句的时候,笔者遇到了错误如下:
上图表明Twisted安装时出错,解决方案如下(先下载适当版本的Twisted,放在一个目录下,根据目录,参考如下操作):
这个时候已经成功安装了那个Twisted,然后重新输入pip install Scrapy,即完成安装
笔者在利用scrapy bench命令的时候,又发现了一个错误:ModuleNotFoundError: No module named ‘win32api’,表示没有安装pywin32,然后笔者安装这个,用命令行操作:pip install pytwin32,安装成功后,可以用scrapy bench测试性能了
几个命令
在命令行下输入scrapy,弹出下面界面:
Scrapy 1.5.0 - no active project
Usage:
scrapy [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use “scrapy -h” to see more info about a command
bench
Run quick benchmark test
测试机器性能,大概一分钟能爬取多少网页
scrapy bench
fetch
Fetch a URL using the Scrapy downloader
scrapy fetch "http://www.baidu.com"
genspider
Generate new spider using pre-defined templates
runspider
Run a self-contained spider (without creating a project)
settings
Get settings values
shell
Interactive scraping console
scrapy shell "http://www.baidu.com"
输出:
2018-04-05 21:12:14 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-04-05 21:12:14 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 16:07:46) [MSC v.1900 32 bit (Intel)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.15063-SP0
2018-04-05 21:12:14 [scrapy.crawler] INFO: Overridden settings: {‘DUPEFILTER_CLASS’: ‘scrapy.dupefilters.BaseDupeFilter’, ‘LOGSTATS_INTERVAL’: 0}
2018-04-05 21:12:14 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.corestats.CoreStats’,
‘scrapy.extensions.telnet.TelnetConsole’]
2018-04-05 21:12:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
[‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,
‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,
‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,
‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,
‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,
‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,
‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,
‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,
‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,
‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,
‘scrapy.downloadermiddlewares.stats.DownloaderStats’]
2018-04-05 21:12:15 [scrapy.middleware] INFO: Enabled spider middlewares:
[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,
‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,
‘scrapy.spidermiddlewares.referer.RefererMiddleware’,
‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,
‘scrapy.spidermiddlewares.depth.DepthMiddleware’]
2018-04-05 21:12:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-05 21:12:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-05 21:12:15 [scrapy.core.engine] INFO: Spider opened
2018-04-05 21:12:18 [scrapy.core.engine] DEBUG: Crawled (200)
response.body
输出:
b’\r\n
http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93

http://home.baidu.com>\xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6 http://ir.baidu.com>About Baidu
©2017 Baidu http://www.baidu.com/duty/>\xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb http://jianyi.baidu.com/ class=cp-feedback>\xe6\x84\x8f\xe8\xa7\x81\xe5\x8f\x8d\xe9\xa6\x88 \xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7
startproject
Create new project
version
Print Scrapy version
view
Open URL in browser, as seen by Scrapy