scrapy 利用 itempipelines 下载文件

最新推荐文章于 2022-11-25 14:34:51 发布

临咸鱼

最新推荐文章于 2022-11-25 14:34:51 发布

阅读量280

点赞数

分类专栏：爬虫文章标签： python 经验分享

本文链接：https://blog.youkuaiyun.com/weixin_35695511/article/details/95051056

版权

爬虫专栏收录该内容

4 篇文章

订阅专栏

本文详细介绍如何使用Scrapy爬虫框架抓取matplotlib官方网站上的示例代码，包括项目创建、配置文件下载路径、自定义Item及Pipeline等关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

步骤　01　首先创建 Scrapy 项目，取名为 matplotlib_examples，再使用 scrapy genspider 命令创建 Spider：

$ scrapy startproject matplotlib_examples
$ cd matplotlib_examples
$ scrapy genspider examples matplotlib.org

步骤　02　在配置文件 settings.py 中启用 FilesPipeline ，并指定文件下载目录，代码如下：

    ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1,
    }
    FILES_STORE = 'examples_src'

步骤　03　实现 ExampleItem，需定义 file_urls 和 files 两个字段，在 items.py 中完成如下代码：

    class ExampleItem(scrapy.Item):
        file_urls = scrapy.Field()
        files = scrapy.Field()

步骤　04　实现 ExamplesSpider 。首先设置起始爬取点：

 import scrapy


 class ExamplesSpider(scrapy.Spider):
      name = "examples"
      allowed_domains = ["matplotlib.org"]
      start_urls = ['http://matplotlib.org/examples/index.html']


      def parse(self, response):
          pass

parse 方法是例子列表页面的解析函数，在该方法中提取每个例子页面的链接，用其构造 Request 对象并提交，提取链接的细节已在页面分析时讨论过，实现 parse 方法的代码如下：

     import scrapy
     from scrapy.linkextractors import LinkExtractor


     class ExamplesSpider(scrapy.Spider):
        name = "examples"
        allowed_domains = ["matplotlib.org"]
        start_urls = ['http://matplotlib.org/examples/index.html']
     def parse(self, response):


        le = LinkExtractor(restrict_css='div.toctree-wrapper.compound',
                                 deny='/index.html$')
        print(len(le.extract_links(response)))
        for link in le.extract_links(response):
            yield scrapy.Request(link.url, callback=self.parse_example)


     def parse_example(self, response):
        pass

上面代码中，我们将例子页面的解析函数设置为 parse_example 方法，下面来实现这个方法。例子页面中包含了例子源码文件的下载链接，在 parse_example 方法中获取源码文件的 url，将其放入一个列表，赋给 ExampleItem 的 file_urls 字段。实现 parse_example 方法的代码如下：

import scrapy
     from scrapy.linkextractors import LinkExtractor
     from ..items import ExampleItem
     class ExamplesSpider(scrapy.Spider):
        name = "examples"
        allowed_domains = ["matplotlib.org"]
        start_urls = ['http://matplotlib.org/examples/index.html']
        def parse(self, response):
           le = LinkExtractor(restrict_css='div.toctree-wrapper.compound',
                                 deny='/index.html$')
           print(len(le.extract_links(response)))
           for link in le.extract_links(response):
              yield scrapy.Request(link.url, callback=self.parse_example)
        def parse_example(self, response):
           href = response.css('a.reference.external::attr(href)').extract_first()
           url = response.urljoin(href)
           example = ExampleItem()
           example['file_urls'] = [url]
           return example

可用以上括号中的部分作为文件路径，在 pipelines.py 实现 MyFilesPipeline ，代码如下：

from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
from os.path import basename, dirname, join


class MyFilesPipeline(FilesPipeline):


def file_path(self, request, response=None, info=None):
	path = urlparse(request.url).path
	return join(basename(dirname(path)), basename(path))

修改配置文件，使用 MyFilesPipeline 替代 FilesPipeline ：

 ITEM_PIPELINES = {
       #'scrapy.pipelines.files.FilesPipeline': 1,
       'matplotlib_examples.pipelines.MyFilesPipeline': 1,
     }