Scrapy 小实践

开发环境

Windows10 + Python3.6 + Scrapy1.5 + Mysql5.7

环境搭建

Scrapy搭建:https://blog.youkuaiyun.com/nima1994/article/details/74931621

其它自己网上找资料偷笑

Scrapy架构原理

官方文档:https://doc.scrapy.org/en/latest/topics/architecture.html

实践内容

http://quotes.toscrape.com 网站爬取资料,并将数据保存到Mysql数据库。

实践过程

1.准备

在数据库中创建 quotes_t 表

create table quotes_t(
s_id int not null auto_increment,
quotes varchar(4000),
author varchar(50),
author_link varchar(100),
tags varchar(100),
primary key (s_id)
)engine=innodb charset='utf8';

用 scrapy shell 工具踩点

D:\quotes>scrapy shell "http://quotes.toscrape.com"
--------省略----------
In [1]: response.status
Out[1]: 200
  • Windwos环境,网址要用双引号,否则会报错。

F12 分析网站,每页有10个引言

获取所有引言

In [2]: quotes = response.css('div.quote')
In [3]: quotes
Out[3]:
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
 <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>]

以下是具体的一段

In [8]: quote = quotes[0]
In [9]: quote
Out[9]: <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> 
            <a class="tag" href="/tag/change/page/1/">change</a>
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>      
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>     
            <a class="tag" href="/tag/world/page/1/">world</a>
        </div>
    </div>

获取引言

In [10]: quote.css('span.text::text').extract_first()
Out[10]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

这里用 extract_first() 有两个好处:

  • 获取第一个匹配的,返回的是一个 str 而非 list;
  • 即使匹配不到也不会抛错;

获取作者

In [11]: quote.css('small.author::text').extract_first()
Out[11]: 'Albert Einstein'

获取作者介绍链接

In [13]: quote.css('a::attr(href)').extract_first()
Out[13]: '/author/Albert-Einstein'

获取tags

In [14]: quote.css('div.tags a.tag::text').extract()
Out[14]: ['change', 'deep-thoughts', 'thinking', 'world']

2.创建一个 quotes 项目

在命令行中,切换到D盘,执行如下:

C:\Users\asus>d:

D:\>scrapy startproject quotes
New Scrapy project 'quotes', using template directory 'd:\\programs\\python\\anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
    D:\quotes
You can start your first spider with:
    cd quotes
    scrapy genspider example example.com

显示项目已经创建成功,目录结构如下:

在命令行中进入 quotes 目录,创建爬虫类:

D:\>cd quotes

D:\quotes>scrapy genspider myspider quotes.toscrape.com
Created spider 'myspider' using template 'basic' in module:
  quotes.spiders.myspider

执行后目录:

3.编写 items.py

import scrapy
class QuotesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    quotes = scrapy.Field()
    author = scrapy.Field()
    author_link = scrapy.Field()
    tags = scrapy.Field()
  • 需要保存的字段定义在这里

4.编写 myspider.py

编写具体解析,就是我们刚刚在 scrapy shell 里所获取的信息

# -*- coding: utf-8 -*-
import scrapy
from quotes.items import QuotesItem
class MyspiderSpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    page = 0

    def parse(self, response):
        item = QuotesItem()
        for quote in response.css('div.quote'):
            item['quotes'] = quote.css('span.text::text').extract_first().replace("'", "\\'")
            item['author'] = quote.css('small.author::text').extract_first()
            link = quote.css('a::attr(href)').extract_first()
            item['author_link'] = response.urljoin(link)
            item['tags'] = ';'.join(quote.css('div.tags a.tag::text').extract())
            yield item
        self.page += 1
        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page and self.page < 3:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
  • 这里限制了只爬取3页的数据;
  • 引言中可能带有单引号,需要转换,否则插入数据库报错;
  • tags改成用分号分隔;

5.编写 settings.py

ITEM_PIPELINES = {
    'quotes.pipelines.QuotesPipeline': 300
}
HOST = 'localhost'
DATABASE = 'test'
USER = 'winds'
PASSWORD = 'winds'
PORT = 3306
  • 定义 pipelines 处理程序(pipelines.py 里的类)
  • 定义 mysql 数据库连接信息

6.编写 pipelines.py

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
class QuotesPipeline(object):
    def __init__(self, host, database, user, password, port):
        self.host = host
        self.database = database
        self.user = user
        self.password = password
        self.port = port
        self.db = None
        self.cursor = None

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            host=crawler.settings.get("HOST"),
            database=crawler.settings.get("DATABASE"),
            user=crawler.settings.get("USER"),
            password=crawler.settings.get("PASSWORD"),
            port=crawler.settings.get("PORT")
        )

    def open_spider(self, spider):
        self.db = pymysql.connect(self.host, self.user, self.password, self.database, port=self.port, charset='utf8')
        self.cursor = self.db.cursor()

    def process_item(self, item, spider):
        sql = "insert into quotes_t(quotes,author,author_link,tags) values('%s','%s','%s','%s')" \
              % (item['quotes'], item['author'], item['author_link'], item['tags'])
        print(sql)
        self.cursor.execute(sql)
        self.db.commit()
        return item

    def close_spider(self, spider):
        self.db.close()
  • 定义 __init__ 构造函数,构造数据库基本信息
  • 定义 from_crawler(cls,crawler) 函数,获取 settings.py 里配置的数据库连接信息
  • 定义 open_spider(self,spider) 函数,创建数据库连接
  • 定义 process_item(self, item, spider) 函数,将具体的每个 item 插入到数据库中
  • 定义 close_spider(self,spider) 函数,关闭数据库

7.执行爬虫

数据库信息

C:\Users\asus>mysql -u winds -p
Enter password: *****

mysql> use test
Database changed

mysql> select count(1) from quotes_t;
+----------+
| count(1) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

运行爬虫程序

D:\quotes>scrapy list
myspider

D:\quotes>scrapy crawl myspider
-----------省略-------------
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2018, 6, 15, 3, 1, 28, 553687)}
2018-06-15 11:01:30 [scrapy.core.engine] INFO: Spider closed (finished)
  • 必须在 quotes 目录下执行

再次检查数据库,数据成功写入

mysql> select count(1) from quotes_t;
+----------+
| count(1) |
+----------+
|       30 |
+----------+
1 row in set (0.00 sec)

 

以上是 scrapy 爬虫的一个简单案例,还需要深入了解。

 

源码地址:https://github.com/sunwingshao/PythonCrawl/tree/master/quotes

想了解 scrapy 的全部内容,可以阅读官方文件:https://doc.scrapy.org/en/latest/#getting-help

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值