scrapy1-优快云博客

本文详细介绍Scrapy爬虫的安装、配置与基础使用方法，包括项目创建、数据抓取、链接跟踪及输出格式选择。通过实例演示如何利用CSS和XPath解析网页内容，适用于初学者快速上手。

1.scrapy安装

我安装的是scrapy1.3.3。使用pycharm安装，避免许多麻烦。安装scrapy之前需要安装vc++9.0,我没找到安装包，就安装了vcforPython2.7，安装成功（我用的是Python3.5的，还可以安装这个版本。。。。不管了，我就用了）

2.scrapy1.3的官方文档阅读（地址：https://docs.scrapy.org/en/latest/intro/tutorial.html）

因为我之前基本没用过Python，只有一些使用Java开发简单爬虫的经验，因此我打算将文档浏览一遍，照着例子敲一遍，将自己觉得有用的部分记下来，方便查阅。顺便熟悉Python。

Scrapy Tutorial

2.1 Creating a project

在你想创建项目的文件夹下，打开命令行窗口，执行命令：scrapy startproject tutorial(项目名)，自动创建项目。

第一个例子分析：

首先继承scrapy.Spider类，scrapy中还有CrawlSpider类，两者有所区别

name：是这个爬虫的唯一标识，不能与其他爬虫重名（是同一项目的吧）。

定义start_requests方法，调用parse方法的这种方式便于理解，与后面的start-urls[]的默认调用parse方法相互印证。

yield scrapy.Request(url=url,callback = self.parse)：yield还不是很清楚，后面补充，request方法发出请求，返回一个response ，就是url地址的内容，该response内容传递给callback=的函数，解析。该机制可以灵活运用，解析不同的url。

parse()解析返回的内容

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

使用start_urls=[]，默认调用pares()的方式

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

2.2 How to run our spider

进行刚刚创建的tutorial目录，启动命令行窗口，输入：scrapy crawl qutes(爬虫名) 还可以在没有定义输出的情况下，使用scrapy crawl quotes -o quotes.json将数据以json格式输出。scrapy可以有很多种输出的形式，接下来在研究一下。

或者在项目的spider项目下，新建run.py 。输入以下内容，运行。

from scrapy import cmdline
cmdline.execute("scrapy crawl quotes".split())

2.3Extracting data

scrapy使用两种方式进行页面的解析，也就是解析response的内容，CSS和XPath两种

使用CSS抽取(解析的是'http://quotes.toscrape.com/page/1/'内容)：

获取title的内容

response.css('title::text').extract_first()

获取<div class = 'quote'>的元素

response.css("div.quote")

注意两点：1.加上::text 可以抽取相应文本，不加则是获得该行所有标签内容；2.extract_first（）抽取的第一个span标签的内容，extract()可以获得所有span内容，返回的是一个list

也可以加上正则表达式进一步精确内容

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

XPath解析：

获取title元素与内容：

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

两种方式以后我打算尽量使用XPath，因为官方推荐。。。。

在本例中抽取信息：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

其中的yield被当做一个字典使用

2.4Follow Links

就是获得当前页面的链接，并且进行内容的请求，就像next_page，循环页面爬取内容。。

例如：这个是下一页标签的html

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

要获得href

response.css('li.next a::attr(href)').extract_first()

完整的爬取和转下一页代码：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

文档又给一个应用的例子具有一定的实际开发意义：

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(href),
                                 callback=self.parse_author)

        # follow pagination links
        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

而且scrapy会自动帮你过滤掉重复的作者页面链接（这里帮助过滤重复链接，是scrapy的默认设置，在真正开发一个爬虫系统时候，我觉得应该还要在存入数据库，根据数据库内容过滤重复，预防程序意外终止）

接下来，具体看看css和xpath的使用，写一个简单的页面爬虫。

转载于:https://my.oschina.net/u/3411375/blog/875765