开发环境
Windows10 + Python3.6 + Scrapy1.5 + Mysql5.7
环境搭建
Scrapy搭建:https://blog.youkuaiyun.com/nima1994/article/details/74931621
其它自己网上找资料
Scrapy架构原理
官方文档:https://doc.scrapy.org/en/latest/topics/architecture.html
实践内容
从 http://quotes.toscrape.com 网站爬取资料,并将数据保存到Mysql数据库。
实践过程
1.准备
在数据库中创建 quotes_t 表
create table quotes_t(
s_id int not null auto_increment,
quotes varchar(4000),
author varchar(50),
author_link varchar(100),
tags varchar(100),
primary key (s_id)
)engine=innodb charset='utf8';
用 scrapy shell 工具踩点
D:\quotes>scrapy shell "http://quotes.toscrape.com"
--------省略----------
In [1]: response.status
Out[1]: 200
- Windwos环境,网址要用双引号,否则会报错。
F12 分析网站,每页有10个引言
获取所有引言
In [2]: quotes = response.css('div.quote')
In [3]: quotes
Out[3]:
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>,
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>]
以下是具体的一段
In [8]: quote = quotes[0]
In [9]: quote
Out[9]: <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world">
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
获取引言
In [10]: quote.css('span.text::text').extract_first()
Out[10]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
这里用 extract_first() 有两个好处:
- 获取第一个匹配的,返回的是一个 str 而非 list;
- 即使匹配不到也不会抛错;
获取作者
In [11]: quote.css('small.author::text').extract_first()
Out[11]: 'Albert Einstein'
获取作者介绍链接
In [13]: quote.css('a::attr(href)').extract_first()
Out[13]: '/author/Albert-Einstein'
获取tags
In [14]: quote.css('div.tags a.tag::text').extract()
Out[14]: ['change', 'deep-thoughts', 'thinking', 'world']
2.创建一个 quotes 项目
在命令行中,切换到D盘,执行如下:
C:\Users\asus>d:
D:\>scrapy startproject quotes
New Scrapy project 'quotes', using template directory 'd:\\programs\\python\\anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
D:\quotes
You can start your first spider with:
cd quotes
scrapy genspider example example.com
显示项目已经创建成功,目录结构如下:
在命令行中进入 quotes 目录,创建爬虫类:
D:\>cd quotes
D:\quotes>scrapy genspider myspider quotes.toscrape.com
Created spider 'myspider' using template 'basic' in module:
quotes.spiders.myspider
执行后目录:
3.编写 items.py
import scrapy
class QuotesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
quotes = scrapy.Field()
author = scrapy.Field()
author_link = scrapy.Field()
tags = scrapy.Field()
- 需要保存的字段定义在这里
4.编写 myspider.py
编写具体解析,就是我们刚刚在 scrapy shell 里所获取的信息
# -*- coding: utf-8 -*-
import scrapy
from quotes.items import QuotesItem
class MyspiderSpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
page = 0
def parse(self, response):
item = QuotesItem()
for quote in response.css('div.quote'):
item['quotes'] = quote.css('span.text::text').extract_first().replace("'", "\\'")
item['author'] = quote.css('small.author::text').extract_first()
link = quote.css('a::attr(href)').extract_first()
item['author_link'] = response.urljoin(link)
item['tags'] = ';'.join(quote.css('div.tags a.tag::text').extract())
yield item
self.page += 1
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page and self.page < 3:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
- 这里限制了只爬取3页的数据;
- 引言中可能带有单引号,需要转换,否则插入数据库报错;
- tags改成用分号分隔;
5.编写 settings.py
ITEM_PIPELINES = {
'quotes.pipelines.QuotesPipeline': 300
}
HOST = 'localhost'
DATABASE = 'test'
USER = 'winds'
PASSWORD = 'winds'
PORT = 3306
- 定义 pipelines 处理程序(pipelines.py 里的类)
- 定义 mysql 数据库连接信息
6.编写 pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
class QuotesPipeline(object):
def __init__(self, host, database, user, password, port):
self.host = host
self.database = database
self.user = user
self.password = password
self.port = port
self.db = None
self.cursor = None
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get("HOST"),
database=crawler.settings.get("DATABASE"),
user=crawler.settings.get("USER"),
password=crawler.settings.get("PASSWORD"),
port=crawler.settings.get("PORT")
)
def open_spider(self, spider):
self.db = pymysql.connect(self.host, self.user, self.password, self.database, port=self.port, charset='utf8')
self.cursor = self.db.cursor()
def process_item(self, item, spider):
sql = "insert into quotes_t(quotes,author,author_link,tags) values('%s','%s','%s','%s')" \
% (item['quotes'], item['author'], item['author_link'], item['tags'])
print(sql)
self.cursor.execute(sql)
self.db.commit()
return item
def close_spider(self, spider):
self.db.close()
- 定义 __init__ 构造函数,构造数据库基本信息
- 定义 from_crawler(cls,crawler) 函数,获取 settings.py 里配置的数据库连接信息
- 定义 open_spider(self,spider) 函数,创建数据库连接
- 定义 process_item(self, item, spider) 函数,将具体的每个 item 插入到数据库中
- 定义 close_spider(self,spider) 函数,关闭数据库
7.执行爬虫
数据库信息
C:\Users\asus>mysql -u winds -p
Enter password: *****
mysql> use test
Database changed
mysql> select count(1) from quotes_t;
+----------+
| count(1) |
+----------+
| 0 |
+----------+
1 row in set (0.00 sec)
运行爬虫程序
D:\quotes>scrapy list
myspider
D:\quotes>scrapy crawl myspider
-----------省略-------------
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2018, 6, 15, 3, 1, 28, 553687)}
2018-06-15 11:01:30 [scrapy.core.engine] INFO: Spider closed (finished)
- 必须在 quotes 目录下执行
再次检查数据库,数据成功写入
mysql> select count(1) from quotes_t;
+----------+
| count(1) |
+----------+
| 30 |
+----------+
1 row in set (0.00 sec)
以上是 scrapy 爬虫的一个简单案例,还需要深入了解。
源码地址:https://github.com/sunwingshao/PythonCrawl/tree/master/quotes
想了解 scrapy 的全部内容,可以阅读官方文件:https://doc.scrapy.org/en/latest/#getting-help