scrapy是python的爬虫框架,当初为了在ubuntu下配置环境,花了我两天时间>_< 原因是14.04LTS版本内置2.7的py,然后改成3.4,以及在安装pip3,OpenSSL遇到了不少坑,不过所幸在百度上终于找到了答案,并成功安装。
兴奋过后,就要开始实干了,根据scrapy1.2文档,我学习翻阅了前两章,并写了一个简单的爬虫,当然,爬的还是panda.tv 23333~
接着上代码->
命令行输入(项目目录):
scrapy startproject panda
cd panda
import scrapy
#定义数据类型
class Video(scrapy.Item):
title = scrapy.Field()
name = scrapy.Field()
population = scrapy.Field()
category = scrapy.Field()
class PandaSpider(scrapy.Spider):
name = 'panda' #爬虫名字,唯一
allowed_domains = ['panda.tv']
start_urls = ['http://www.panda.tv/all']
# def start_requests(self):
# yield scrapy.Request('http://www.panda.tv/all',self.parse)
def parse(self,response):
video = Video()
# video = {}
item = response.xpath('//a[@class="video-list-item-wrap"]')
for info in item:
subinfo = info.xpath('.//div[@class="video-info"]')
video['title'] = info.xpath('.//div[@class="video-title"]/text()').extract()
video['name'] = subinfo.xpath('.//span[@class="video-nickname"]/text()').extract()
video['population'] = subinfo.xpath('.//span[@class="video-number"]/text()').extract()
video['category'] = subinfo.xpath('.//span[@class="video-cate"]/text()').extract()
yield {
'title':video['title'],
'name':video['name'],
'population':video['population'],
'category':video['category']
}
#翻页功能,有待完善
# next_page = response.css('a.j-page-next::attr(href)').extract()
# if next_page is not None:
# next_page = response.urljoin(next_page)
# yield scrapy.Request(next_page,self.parse)
命令行输入(爬虫名):
scrapy crawl panda
结果显示:
相比于之前写的nodejs爬虫好像要方便很多~成功了第一步,后续还会继续学习哒!