Python3 Scrapy框架安装与配置

本文介绍Scrapy框架的基本概念及安装步骤,并提供详细的项目文件说明,包括配置文件、模型文件、管道文件等内容,帮助读者快速上手。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Scrapy框架简介:

    Scrapy是用Python实现的一个为了爬取网站数据、提取结构性数据而编写的应用框可以应用于数据挖掘,用户

只需要定制开发几个模块就可以轻松实现一个爬虫,它能使我们更好的完成爬虫任务。

Scrapy框架安装方法:

  1. 如果已经安装过pycharm,在控制台下(Ctrl + r   cmd)输入以下命令 

    conda install Scrapy

                                             

  2. 安装好后输入以下命令,如果结果如下图,即代表安装成功,否则,代表安装失败     

    scrapy

                                      

  3. 如果安装失败,需要进入这网站,里面是编译的各种库:http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml,然后找到下图文件并根据自己的python版本和电脑系统来选择自己需要的文件,                                          .

  4. 下载完后,运行如下命令安装wheel文件(这里以3.6版 32位文件为例,其他文件更换文件名即可),安装完成后,再使用第2步命令测试是否安装成功

    pip install wheel
    pip install lxml‑3.7.2‑cp35‑cp35m‑win32.whl
  5.  安装成功以后,再创建个项目试一下吧!首先创建用来存放项目的文件夹
  6.   移动到Scrapy文件夹下输入以下命令创建一个项目
    scrapy startproject baidu

      

  7. 创建项目成功以后可以使用tree查看文件结构                                                     

  8. 项目中主要包括以下文件                                                                                                                                                                     

    1. csrapy.fg :项目配置文件

      # Automatically created by: scrapy startproject
      #
      # For more information about the [deploy] section see:
      # https://scrapyd.readthedocs.io/en/latest/deploy.html
      
      [settings]
      default = baidu.settings
      
      [deploy]
      #url = http://localhost:6800/
      project = baidu
      

                                                                 

    2. items.py: 项目items文件 

      # -*- coding: utf-8 -*-
      
      # model模型
      # 在这定义你的爬虫程序的数据模型
      # Define here the models for your scraped items
      #
      # See documentation in:
      # https://doc.scrapy.org/en/latest/topics/items.html
      
      import scrapy
      
      
      class BaiduItem(scrapy.Item):
          # define the fields for your item here like:
          # name = scrapy.Field()
          pass
      

                                                      

    3. pipelines.py: 项目管道文件

      # -*- coding: utf-8 -*-
      
      # Define your item pipelines here
      #
      # Don't forget to add your pipeline to the ITEM_PIPELINES setting
      # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
      
      # 管道文件
      class BaiduPipeline(object):
          def process_item(self, item, spider):
              return item

       

    4. settings.py: 项目配置文件

      # -*- coding: utf-8 -*-
      
      # Scrapy settings for baidu project
      #
      # For simplicity, this file contains only settings considered important or
      # commonly used. You can find more settings consulting the documentation:
      #
      #     https://doc.scrapy.org/en/latest/topics/settings.html
      #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
      #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
      
      BOT_NAME = 'baidu'
      
      # 爬虫所在地
      SPIDER_MODULES = ['baidu.spiders']
      NEWSPIDER_MODULE = 'baidu.spiders'
      
      
      # Crawl responsibly by identifying yourself (and your website) on the user-agent
      #USER_AGENT = 'baidu (+http://www.yourdomain.com)'
      
      # Obey robots.txt rules
      # 遵守爬虫协议
      # ROBOTSTXT_OBEY = False
      
      # Configure maximum concurrent requests performed by Scrapy (default: 16)
      # 最大请求并发量 默认16
      #CONCURRENT_REQUESTS = 32
      
      # Configure 配置 delay延迟  在请求同一个网站的时候的延迟时间
      # Configure a delay for requests for the same website (default: 0)
      # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
      # See also autothrottle settings and docs
      #DOWNLOAD_DELAY = 3
      # The download delay setting will honor only one of:
      #CONCURRENT_REQUESTS_PER_DOMAIN = 16
      #CONCURRENT_REQUESTS_PER_IP = 16
      # cookies_enabled 是否使用cookie
      # Disable cookies (enabled by default)
      #COOKIES_ENABLED = False
      
      # Disable Telnet Console (enabled by default)
      #TELNETCONSOLE_ENABLED = False
      
      # Override the default request headers:
      #DEFAULT_REQUEST_HEADERS = {
      #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      #   'Accept-Language': 'en',
      #}
      
      # Enable or disable spider middlewares
      # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
      #SPIDER_MIDDLEWARES = {
          #0 1000 值越小优先级越高 优先级越高 越先执行
      #    'baidu.middlewares.BaiduSpiderMiddleware': 543,
      #}
      
      # Enable or disable downloader middlewares
      # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
      #DOWNLOADER_MIDDLEWARES = {
      #       0 - 1000 值越小优先级越高 优先级越高 越先执行
      #    'baidu.middlewares.BaiduDownloaderMiddleware': 543,
      #}
      
      # Enable or disable extensions   是否进行扩展
      # See https://doc.scrapy.org/en/latest/topics/extensions.html
      #EXTENSIONS = {
      #    'scrapy.extensions.telnet.TelnetConsole': None,
      #}
      
      # Configure item pipelines
      # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
      ITEM_PIPELINES = {
      #     值越小优先级越高 优先级越高 越先执行
         'baidu.pipelines.BaiduPipeline': 1,
      }
      
      # Enable and configure the AutoThrottle extension (disabled by default)
      # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
      #AUTOTHROTTLE_ENABLED = True
      # The initial download delay
      #AUTOTHROTTLE_START_DELAY = 5
      # The maximum download delay to be set in case of high latencies
      #AUTOTHROTTLE_MAX_DELAY = 60
      # The average number of requests Scrapy should be sending in parallel to
      # each remote server
      #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
      # Enable showing throttling stats for every response received:
      #AUTOTHROTTLE_DEBUG = False
      
      # Enable and configure HTTP caching (disabled by default)
      # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
      #HTTPCACHE_ENABLED = True
      #HTTPCACHE_EXPIRATION_SECS = 0
      #HTTPCACHE_DIR = 'httpcache'
      #HTTPCACHE_IGNORE_HTTP_CODES = []
      #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
      

       

    5. middleware.py:爬虫中间件
      # -*- coding: utf-8 -*-
      
      # Define here the models for your spider middleware
      #
      # See documentation in:
      # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
      
      from scrapy import signals
      
      # 爬虫中间件
      class BaiduSpiderMiddleware(object):
          # Not all methods need to be defined. If a method is not defined,
          # scrapy acts as if the spider middleware does not modify the
          # passed objects.
      
          @classmethod
          def from_crawler(cls, crawler):
              # This method is used by Scrapy to create your spiders.
              s = cls()
              crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
              return s
      
          def process_spider_input(self, response, spider):
              # Called for each response that goes through the spider
              # middleware and into the spider.
      
              # Should return None or raise an exception.
              return None
      
          def process_spider_output(self, response, result, spider):
              # Called with the results returned from the Spider, after
              # it has processed the response.
      
              # Must return an iterable of Request, dict or Item objects.
              for i in result:
                  yield i
      
          def process_spider_exception(self, response, exception, spider):
              # Called when a spider or process_spider_input() method
              # (from other spider middleware) raises an exception.
      
              # Should return either None or an iterable of Response, dict
              # or Item objects.
              pass
      
          def process_start_requests(self, start_requests, spider):
              # Called with the start requests of the spider, and works
              # similarly to the process_spider_output() method, except
              # that it doesn’t have a response associated.
      
              # Must return only requests (not items).
              for r in start_requests:
                  yield r
      
          def spider_opened(self, spider):
              spider.logger.info('Spider opened: %s' % spider.name)
      
      
      class BaiduDownloaderMiddleware(object):
          # Not all methods need to be defined. If a method is not defined,
          # scrapy acts as if the downloader middleware does not modify the
          # passed objects.
      
          @classmethod
          def from_crawler(cls, crawler):
              # This method is used by Scrapy to create your spiders.
              s = cls()
              crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
              return s
      
          def process_request(self, request, spider):
              # Called for each request that goes through the downloader
              # middleware.
      
              # Must either:
              # - return None: continue processing this request
              # - or return a Response object
              # - or return a Request object
              # - or raise IgnoreRequest: process_exception() methods of
              #   installed downloader middleware will be called
              return None
      
          def process_response(self, request, response, spider):
              # Called with the response returned from the downloader.
      
              # Must either;
              # - return a Response object
              # - return a Request object
              # - or raise IgnoreRequest
              return response
      
          def process_exception(self, request, exception, spider):
              # Called when a download handler or a process_request()
              # (from other downloader middleware) raises an exception.
      
              # Must either:
              # - return None: continue processing this exception
              # - return a Response object: stops process_exception() chain
              # - return a Request object: stops process_exception() chain
              pass
      
          def spider_opened(self, spider):
              spider.logger.info('Spider opened: %s' % spider.name)
      

       

    6. spiders: 放置spider的目录

  9. 使用以下命令定义爬虫域名(以百度为例)

    scrapy genspider baiduSpider baidu.com

  10. 获取响应体,首先要更改baiduSpider.py文件,如图                                        

  11. 执行以下命令,获取响应体:

    scrapy crawl baiduSpider

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值