Scrapy入门项目

最新推荐文章于 2024-08-28 17:46:53 发布

原创最新推荐文章于 2024-08-28 17:46:53 发布 · 455 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#python #Scrapy

爬虫专栏收录该内容

8 篇文章

订阅专栏

项目目标

创建一个 Scrapy 项目。
创建一个 Spider 来抓取站点和处理数据。
通过命令行将抓取的内容导出。
将抓取的内容保存到 MongoDB 数据库。

开发工具

Scrapy 框架
MongoDB
PyMongo 库

创建项目

创建一个 Scrapy 项目，项目文件可以直接用 scrapy 命令生成，命令如下所示：

scrapy startproject tutorial

这个命令将会创建一个名为 tutorial 的文件夹，文件夹结构如下所示：

scrapy.cfg     # Scrapy 部署时的配置文件
tutorial         # 项目的模块，引入的时候需要从这里引入
    __init__.py    
    items.py     # Items 的定义，定义爬取的数据结构
    middlewares.py   # Middlewares 的定义，定义爬取时的中间件
    pipelines.py       # Pipelines 的定义，定义数据管道
    settings.py       # 配置文件
    spiders         # 放置 Spiders 的文件夹
    __init__.py

创建 Spider

在spiders文件夹里创建.py爬虫文件，格式如下：

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        pass

这里有三个属性 ——name、allowed_domains 和 start_urls，还有一个方法 parse。

name，它是每个项目唯一的名字，用来区分不同的 Spider。
allowed_domains，它是允许爬取的域名，如果初始或后续的请求链接不是这个域名下的，则请求链接会被过滤掉。
start_urls，它包含了 Spider 在启动时爬取的 url 列表，初始请求是由它来定义的。
parse，它是 Spider 的一个方法。默认情况下，被调用时 start_urls 里面的链接构成的请求完成下载执行后，返回的响应就会作为唯一的参数传递给这个函数。该方法负责解析返回的响应、提取数据或者进一步生成要处理的请求。

创建 Item

Item 是保存爬取数据的容器，它的使用方法和字典类似。不过，相比字典，Item 多了额外的保护机制，可以避免拼写错误或者定义字段错误。
创建 Item 需要继承 scrapy.Item 类，并且定义类型为 scrapy.Field 的字段。类似如下定义：

import scrapy

class QuoteItem(scrapy.Item):

    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

解析 Response

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        pass

parse() 方法的参数 response 是 start_urls 里面的链接爬取后的结果。所以在 parse() 方法中，我们可以直接对 response 变量包含的内容进行解析，比如浏览请求结果的网页源代码，或者进一步分析源代码内容，或者找出结果中的链接而得到下一个请求。
提取的方式可以是 CSS 选择器或 XPath 选择器。

css选择器

例：
源码：

<div class="quote" itemscope=""itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> 
            <a class="tag" href="/tag/change/page/1/">change</a>
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            <a class="tag" href="/tag/world/page/1/">world</a>
        </div>
    </div>

不同css选择器的返回结果如下：
quote.css(’.text’)

[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]"data='<span class="text"itemprop="text">“The '>]

quote.css(’.text::text’)

[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()"data='“The world as we have created it is a pr'>]

quote.css(’.text’).extract()

['<span class="text"itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>']

quote.css(’.text::text’).extract()

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']

quote.css(’.text::text’).extract_first()

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”

使用Item

上文定义了 Item，接下来就要使用它了。Item 可以理解为一个字典，不过在声明的时候需要实例化。然后依次用刚才解析的结果赋值 Item 的每一个字段，最后将 Item 返回即可。
QuotesSpider 的改写如下所示：

import scrapy
from tutorial.items import QuoteItem # 导入库

class QuotesSpider(scrapy.Spider): # 自定义爬虫类 继承scrapy.Spider
    name = "quotes" # 爬虫名字
    allowed_domains = ["quotes.toscrape.com"] # 待爬取网站域名
    start_urls = ['http://quotes.toscrape.com/'] # 待爬取网站的起始网址

    def parse(self, response): # 解析/提取规则
     '''
        <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world">
            <a class="tag" href="/tag/change/page/1/">change</a>
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            <a class="tag" href="/tag/world/page/1/">world</a>
        </div>
    </div>
        '''
        quotes = response.css('.quote') # 获取当页所有名言 div标签
        for quote in quotes: 
            item = QuoteItem() # 实例化
            item['text'] = quote.css('.text::text').extract_first() # .text css选择器 ::text获取节点的文本内容，结果是列表，用extract_first()获取第一个元素
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tag::text').extract() # 获取整个列表
            yield item

后续 Request

这一页爬完了，要生成下一页的链接，构造请求时需要用到 scrapy.Request。这里我们传递两个参数 ——url 和 callback，这两个参数的说明如下。

url：它是请求链接。
callback：它是回调函数。当指定了该回调函数的请求完成之后，获取到响应，引擎会将该响应作为参数传递给这个回调函数。回调函数进行解析或生成下一个请求，回调函数如上文的 parse() 所示。

利用选择器得到下一页链接并生成请求，在 parse() 方法后追加如下的代码：

next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)

第一句代码首先通过 CSS 选择器获取下一个页面的链接，即要获取 a 超链接中的 href 属性。这里用到了::attr(href) 操作。然后再调用 extract_first() 方法获取内容。

第二句代码调用了 urljoin() 方法，urljoin() 方法可以将相对 URL 构造成一个绝对的 URL。例如，获取到的下一页地址是 /page/2，urljoin() 方法处理后得到的结果就是：http://quotes.toscrape.com/page/2/。

第三句代码通过 url 和 callback 变量构造了一个新的请求，回调函数 callback 依然使用 parse() 方法。这个请求完成后，响应会重新经过 parse 方法处理，得到第二页的解析结果，然后生成第二页的下一页，也就是第三页的请求。这样爬虫就进入了一个循环，直到最后一页。

通过几行代码，我们就轻松实现了一个抓取循环，将每个页面的结果抓取下来了。

改写之后的整个 Spider 类如下所示：

import scrapy
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            item['text'] = quote.css('.text::text').extract_first()
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tag::text').extract()
            yield item
    # 下一个要爬取的页面url
        '''
        <li class="next">
                <a href="/page/2/">Next <span aria-hidden="true">→</span></a>
            </li>
        '''
        next = response.css('.pager .next a::attr("href")').extract_first()
        url = response.urljoin(next) # 生成新的URL
        yield scrapy.Request(url=url, callback=self.parse) # 当请求完成后，引擎将响应作为参数传递给回调函数 继续解析

运行

进入目录，运行如下命令：

scrapy crawl quotes

保存到文件

Scrapy 提供的 Feed Exports 可以轻松将抓取结果输出。例如，我们想将上面的结果保存成 JSON 文件，可以执行如下命令：

scrapy crawl quotes -o quotes.json

命令运行后，项目内多了一个 quotes.json 文件，文件包含了刚才抓取的所有内容，内容是 JSON 格式。
另外我们还可以每一个 Item 输出一行 JSON，输出后缀为 jl，为 jsonline 的缩写，命令如下所示：

scrapy crawl quotes -o quotes.jl

或

scrapy crawl quotes -o quotes.jsonlines

输出格式还支持很多种，例如 csv、xml、pickle、marshal 等，还支持 ftp、s3 等远程输出，另外还可以通过自定义 ItemExporter 来实现其他的输出。

例如，下面命令对应的输出分别为 csv、xml、pickle、marshal 格式以及 ftp 远程输出：

scrapy crawl quotes -o quotes.csv
scrapy crawl quotes -o quotes.xml
scrapy crawl quotes -o quotes.pickle
scrapy crawl quotes -o quotes.marshal
scrapy crawl quotes -o ftp://user:pass@ftp.example.com/path/to/quotes.csv

其中，ftp 输出需要正确配置用户名、密码、地址、输出路径，否则会报错。

通过 Scrapy 提供的 Feed Exports，我们可以轻松地输出抓取结果到文件。对于一些小型项目来说，这应该足够了。不过如果想要更复杂的输出，如输出到数据库等，我们可以使用 Item Pileline 来完成。

使用 Item Pipeline

如果想进行更复杂的操作，如将结果保存到 MongoDB 数据库，或者筛选某些有用的 Item，则我们可以定义 Item Pipeline 来实现。

Item Pipeline 为项目管道。当 Item 生成后，它会自动被送到 Item Pipeline 进行处理，我们常用 Item Pipeline 来做如下操作：

清洗 HTML 数据
验证爬取数据，检查爬取字段
查重并丢弃重复内容
将爬取结果储存到数据库

要实现 Item Pipeline 很简单，只需要定义一个类并实现 process_item() 方法即可。启用 Item Pipeline 后，Item Pipeline 会自动调用这个方法。process_item() 方法必须返回包含数据的字典或 Item 对象，或者抛出 DropItem 异常。

**process_item() 方法有两个参数。**一个参数是 item，每次 Spider 生成的 Item 都会作为参数传递过来。另一个参数是 spider，就是 Spider 的实例。

实现一个 Item Pipeline，筛掉 text 长度大于 50 的 Item，并将结果保存到 MongoDB。代码如下：

import pymongo
from scrapy.exceptions import DropItem

class TextPipeline(object):
    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit: # 存在item 的 text 属性，判断长度是否大于 50
                item['text'] = item['text'][0:self.limit].rstrip() + '...' # 大于50，那就截断然后拼接省略号
            return item
        else:
            return DropItem('Missing Text') # 不存在 item 的 text 属性，抛出 DropItem 异常

# 将处理后的 item 存入 MongoDB，定义另外一个 Pipeline
# 实现另一个类 MongoPipeline
class MongoPipeline(object):
    def __init__(self,mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    # 从配置文件setting.py中获取mongo_uri，mongo_db 需要自己在setting.py中定义
    # MongoDB 连接需要的地址(mongo_uri)和数据库名称(mongo_db)
    @classmethod
    def from_crawler(cls, crawler):
        return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    # 连接并打开数据库
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    # 该方法必须定义，而且必须要有item和spider两个参数 其他方法可以随便写
    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))  # 将数据插入集合 要转换为字典形式 键值对
        return item

    # 关闭连接
    def close_spider(self, spider):
        self.client.close()

MongoPipeline 类实现了 API 定义的另外几个方法。

from_crawler，这是一个类方法，用 @classmethod 标识，是一种依赖注入的方式，方法的参数就是 crawler，通过 crawler 这个我们可以拿到全局配置的每个配置信息，在全局配置 settings.py 中我们可以定义 MONGO_URI 和 MONGO_DB 来指定 MongoDB 连接需要的地址和数据库名称，拿到配置信息之后返回类对象即可。所以这个方法的定义主要是用来获取 settings.py 中的配置的。
open_spider，当 Spider 被开启时，这个方法被调用。在这里主要进行了一些初始化操作。
close_spider，当 Spider 被关闭时，这个方法会调用，在这里将数据库连接关闭。

最主要的 process_item() 方法则执行了数据插入操作。

定义好 TextPipeline 和 MongoPipeline 这两个类后，我们需要在 settings.py 中使用它们。MongoDB 的连接信息还需要定义。
我们在 settings.py 中加入如下内容：

# 赋值 ITEM_PIPELINES 字典，键名是 Pipeline 的类名称，键值是调用优先级，是一个数字，数字越小则对应的 Pipeline 越先被调用。
ITEM_PIPELINES = {
   'tutorial.pipelines.TextPipeline': 300,
   'tutorial.pipelines.MongoPipeline': 400,
}
MONGO_URI='localhost'
MONGO_DB='tutorial