Scrapy框架简介:
Scrapy是用Python实现的一个为了爬取网站数据、提取结构性数据而编写的应用框可以应用于数据挖掘,用户
只需要定制开发几个模块就可以轻松实现一个爬虫,它能使我们更好的完成爬虫任务。
Scrapy框架安装方法:
-
如果已经安装过pycharm,在控制台下(Ctrl + r cmd)输入以下命令
conda install Scrapy
-
安装好后输入以下命令,如果结果如下图,即代表安装成功,否则,代表安装失败
scrapy
-
如果安装失败,需要进入这网站,里面是编译的各种库:http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml,然后找到下图文件并根据自己的python版本和电脑系统来选择自己需要的文件, .
-
下载完后,运行如下命令安装wheel文件(这里以3.6版 32位文件为例,其他文件更换文件名即可),安装完成后,再使用第2步命令测试是否安装成功
pip install wheel pip install lxml‑3.7.2‑cp35‑cp35m‑win32.whl
- 安装成功以后,再创建个项目试一下吧!首先创建用来存放项目的文件夹
- 移动到Scrapy文件夹下输入以下命令创建一个项目
scrapy startproject baidu
-
创建项目成功以后可以使用tree查看文件结构
-
项目中主要包括以下文件
-
csrapy.fg :项目配置文件
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = baidu.settings [deploy] #url = http://localhost:6800/ project = baidu
-
items.py: 项目items文件
# -*- coding: utf-8 -*- # model模型 # 在这定义你的爬虫程序的数据模型 # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class BaiduItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass
-
pipelines.py: 项目管道文件
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html # 管道文件 class BaiduPipeline(object): def process_item(self, item, spider): return item
-
settings.py: 项目配置文件
# -*- coding: utf-8 -*- # Scrapy settings for baidu project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'baidu' # 爬虫所在地 SPIDER_MODULES = ['baidu.spiders'] NEWSPIDER_MODULE = 'baidu.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'baidu (+http://www.yourdomain.com)' # Obey robots.txt rules # 遵守爬虫协议 # ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # 最大请求并发量 默认16 #CONCURRENT_REQUESTS = 32 # Configure 配置 delay延迟 在请求同一个网站的时候的延迟时间 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # cookies_enabled 是否使用cookie # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { #0 1000 值越小优先级越高 优先级越高 越先执行 # 'baidu.middlewares.BaiduSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 0 - 1000 值越小优先级越高 优先级越高 越先执行 # 'baidu.middlewares.BaiduDownloaderMiddleware': 543, #} # Enable or disable extensions 是否进行扩展 # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { # 值越小优先级越高 优先级越高 越先执行 'baidu.pipelines.BaiduPipeline': 1, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
-
middleware.py:爬虫中间件
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals # 爬虫中间件 class BaiduSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class BaiduDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
-
spiders: 放置spider的目录
-
-
使用以下命令定义爬虫域名(以百度为例)
scrapy genspider baiduSpider baidu.com
-
获取响应体,首先要更改baiduSpider.py文件,如图
-
执行以下命令,获取响应体:
scrapy crawl baiduSpider