步骤 01 首先创建 Scrapy 项目,取名为 matplotlib_examples,再使用 scrapy genspider 命令创建 Spider:
$ scrapy startproject matplotlib_examples
$ cd matplotlib_examples
$ scrapy genspider examples matplotlib.org
步骤 02 在配置文件 settings.py 中启用 FilesPipeline ,并指定文件下载目录,代码如下:
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = 'examples_src'
步骤 03 实现 ExampleItem,需定义 file_urls 和 files 两个字段,在 items.py 中完成如下代码:
class ExampleItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
步骤 04 实现 ExamplesSpider 。首先设置起始爬取点:
import scrapy
class ExamplesSpider(scrapy.Spider):
name = "examples"
allowed_domains = ["matplotlib.org"]
start_urls = ['http://matplotlib.org/examples/index.html']
def parse(self, response):
pass
parse 方法是例子列表页面的解析函数,在该方法中提取每个例子页面的链接,用其构造 Request 对象并提交,提取链接的细节已在页面分析时讨论过,实现 parse 方法的代码如下:
import scrapy
from scrapy.linkextractors import LinkExtractor
class ExamplesSpider(scrapy.Spider):
name = "examples"
allowed_domains = ["matplotlib.org"]
start_urls = ['http://matplotlib.org/examples/index.html']
def parse(self, response):
le = LinkExtractor(restrict_css='div.toctree-wrapper.compound',
deny='/index.html$')
print(len(le.extract_links(response)))
for link in le.extract_links(response):
yield scrapy.Request(link.url, callback=self.parse_example)
def parse_example(self, response):
pass
上面代码中,我们将例子页面的解析函数设置为 parse_example 方法,下面来实现这个方法。例子页面中包含了例子源码文件的下载链接,在 parse_example 方法中获取源码文件的 url,将其放入一个列表,赋给 ExampleItem 的 file_urls 字段。实现 parse_example 方法的代码如下:
import scrapy
from scrapy.linkextractors import LinkExtractor
from ..items import ExampleItem
class ExamplesSpider(scrapy.Spider):
name = "examples"
allowed_domains = ["matplotlib.org"]
start_urls = ['http://matplotlib.org/examples/index.html']
def parse(self, response):
le = LinkExtractor(restrict_css='div.toctree-wrapper.compound',
deny='/index.html$')
print(len(le.extract_links(response)))
for link in le.extract_links(response):
yield scrapy.Request(link.url, callback=self.parse_example)
def parse_example(self, response):
href = response.css('a.reference.external::attr(href)').extract_first()
url = response.urljoin(href)
example = ExampleItem()
example['file_urls'] = [url]
return example
可用以上括号中的部分作为文件路径,在 pipelines.py 实现 MyFilesPipeline ,代码如下:
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
from os.path import basename, dirname, join
class MyFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
path = urlparse(request.url).path
return join(basename(dirname(path)), basename(path))
修改配置文件,使用 MyFilesPipeline 替代 FilesPipeline :
ITEM_PIPELINES = {
#'scrapy.pipelines.files.FilesPipeline': 1,
'matplotlib_examples.pipelines.MyFilesPipeline': 1,
}