scrapy安装
pip install scrapy
windows:
windows下scrapy目前仅支持python2.7, 首先安装vc compiler for py2.7,然后再安装lxml:
pip install lxml
#如果无法安装,需要在这里下载对应的.whl文件(http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml),找到对于的版本下载。等待页面加载完,按Ctrl+F查找 Lxml,找到Lxml后在这一行下面找合适的版本文件。
#如果没有wheel,先安装
pip install wheel
#进入下载文件目录,执行安装命令,我的文件名是lxml-3.6.4-cp27-cp27m-win_amd64
pip install lxml-3.6.4-cp27-cp27m-win_amd64.whl
ubantu:
yum install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
scrapy dataflow
scrapy使用downloader抓取数据,我们只需指定被抓取的url, 抓取的内容进入item pipeline, 由用户进行自定义处理过程。 scrapy框架封装了下载、调度等过程,极大方便了爬虫的编写。
pycharm中scrapy调试
命令行下通过scrapy命令直接运行爬虫应用,但在ide中调试scrapy应用则需要scrapy的python启动脚本,便于设置调试入口,scrapy命令行启动脚本为$PYTHON_HOME\Lib\site-packages\scrapy\cmdline.py,参数栏填写与命令行相同的参数。
scrapy用法
创建project
scrapy startproject ppspider
项目目录结构
ppspider/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
python中配置也是代码,setting.py中是全局配置,对项目下所有spider起作用,可以在settings.py中定义项目用到的一些属性配置,比如数据库连接相关配置。
定义spider
定义一个spider包含两个过程,首先在项目spiders目录下定义实现用户自定义spider,指定抓取数据源以及当前spider的custom_settings;其次,在items.py中定义解析出来的结构化item, 用于在pipeline中流通;最后在pipelines中定义数据处理逻辑的pipeline
LianjiaSpider
为了分析上海各地区房价情况,我们从链家官网上抓取房源信息(房屋结构,面积,地址,房价等)存入MySQL,然后使用sql统计筛选具体房价信息。
1、分析抓取页面url及内容,定义spider入口
继承自scrapy.Spider,custom_setting定制化spider属性, start_requests是下载的入口,在其中定义request以及获取response后的回调函数,进行结构化信息提取。从html中提取信息时可以使用css选择器或xpath就行结点定位和数据提取。
import scrapy
from ppspider.items import House
class LianJiaSpider(scrapy.Spider):
name = 'lianjia'
url = 'http://sh.lianjia.com/ershoufang/d{}'
custom_settings = {
'ITEM_PIPELINES': {
'ppspider.pipelines.persistent_pipeline.PersistentPipeline': 800,
}
}
def start_requests(self):
for i in xrange(3501, 4000): //可能会被屏蔽
yield scrapy.Request(self.url.format(i), callback=self.get_house_info)
def get_house_info(self, response):
house_lst = response.css('#house-lst')
house_lst_li = house_lst.xpath('li')
house = House()
for house_info in house_lst_li:
info = house_info.css('.info-panel')
house['id'] = info.xpath('h2/a/@key').extract()[0]
house['title'] = info.xpath('h2/a/text()').extract()[0]
col1 = info.css('.col-1')
where = col1.css('.where')
house['cell'] = where.xpath('a/span/text()').extract()[0]
house['area'] = u','.join(where.xpath('span/text()').extract())
other = col1.css('.other').css('.con')
house['location'] = u','.join(other.xpath('a/text()').extract())
chanquan = col1.css('.chanquan').css('.agency').css('.view-label')
house['chanquan'] = u','.join(chanquan.xpath('span/span/text()').extract())
col3 = info.css('.col-3')
house['total'] = col3.css('.num').xpath('text()').extract()[0]
house['price'] = col3.css('.price-pre').xpath('text()').extract()[0]
yield house
2、item.py中定义房源属性,在response回调中使用
class House(scrapy.Item):
id = Field()
title = Field()
cell = Field()
area = Field()
location = Field()
chanquan = Field()
total_price = Field()
price = Field()
3、 item pipeline将数据抓取和处理异步化,进行信息进一步加工,过滤、持久化等
这里定义一个通用persistent_pipline, 根据item的属性拼接成插入数据的sql,将提取的信息存入db
from scrapy.utils.project import get_project_settings
from twisted.enterprise import adbapi
import logging
class PersistentPipeline(object):
insert_sql = """replace into %s (%s) values (%s)"""
def __init__(self):
settings = get_project_settings()
db_args = settings.get('DB_CONNECT')
db_server = settings.get('DB_SERVER')
self.db_pool = adbapi.ConnectionPool(db_server, **db_args)
def __del__(self):
pass
# self.dbpool.close()
def process_item(self, item, spider):
keys = item.fields.keys()
fields = u','.join(keys)
qm = u','.join([u'%s'] * len(keys))
sql = self.insert_sql % (spider.name, fields, qm)
data = []
for k in keys:
if isinstance(item[k], list) or isinstance(item[k], dict):
data.append('%s' % item[k])
else:
data.append(item[k])
logging.debug(sql)
logging.debug(data)
self.db_pool.runOperation(sql, data)
运行spider
scrapy crawl lianjia
这样一个简单的spider就已完成,当然用户自定义更多地业务处理pipeline。房源信息存入mysql后,可以借助sql groupBy/orderBy/where过滤统计各种房价信息,此外还可以借助zeepline进行数据可视化显示。
总结
scrapy简单易用,借助scrapy框架以及css/xpath选择器,只用关注业务数据处理,不用关心网络、调度等底层细节,可以很方便的完成一个爬虫。