-
在Pycharm中新建一个项目KwScrapySpider
2.File->setting->Python Interpreter安装scrapy
-
打开Terminal,执行命令:
scrapy startproject KwSpider :创建Scrapy项目
cd KwSpider
scrapy genspider kuwo kuwo.cn :生成一个爬虫(域名为允许爬取范围)
(venv) E:\work\python\PycharmProjects\KwScrapySpider>scrapy startproject KwSpider
New Scrapy project 'KwSpider', using template directory 'e:\work\python\pycharmprojects\kwscrapyspider\venv\lib\site-packages\scrapy\templates\project', created in:
E:\work\python\PycharmProjects\KwScrapySpider\KwSpider
You can start your first spider with:
cd KwSpider
scrapy genspider example example.com
(venv) E:\work\python\PycharmProjects\KwScrapySpider>cd KwSpider
(venv) E:\work\python\PycharmProjects\KwScrapySpider\KwSpider>scrapy genspider kuwo kuwo.cn
Created spider 'kuwo' using template 'basic' in module:
{spiders_module.__name__}.{module}
(venv) E:\work\python\PycharmProjects\KwScrapySpider>
执行之后,项目目录结构:
- 打开kuwo.py
修改之前:
import scrapy
class KuwoSpider(scrapy.Spider):
name = 'kuwo'
allowed_domains = ['kuwo.cn']
start_urls = ['http://kuwo.cn/']
def parse(self, response):
pass
- 功能:打印推荐歌单信息
编写xpath,打印歌曲名称
打开settings.py,修改打印日志级别,默认是INFO,打印信息太多,修改为LOG_LEVEL = “WARNING”
def parse(self, response):
div_list = response.xpath("//div[@class='rec_list']//div[@class='item']")
for item in div_list:
# 歌曲名称
name = item.xpath(".//p[@class='name']/span/text()").extract_first()
print(name)
运行项目:
(venv) E:\work\python\PycharmProjects\KwScrapySpider\KwSpider>scrapy crawl kuwo
每日最新单曲推荐
我买了两本几米的漫画,另一本,将来送给你
当你放开手,遗忘在昨天
【德云社】德云女孩必修曲儿
【最新广场舞】春暖花开,广场舞跳起来
(venv) E:\work\python\PycharmProjects\KwScrapySpider\KwSpider>
继续爬取图片信息:
pic_out = item.xpath(".//div[@class='pic_out']//img/@src").extract_first()
print(pic_out)
发现爬取的都是静态图片,真实图片地址没有打印出来
百度了下,好像是scrapy不能动态爬取,需要结合scrapy框架+selenium,参考这里
修改之后结果:
li_list = response.xpath('//div[@class="item item"]')
for li in li_list:
item = {}
pic_out = li.xpath('.//img[@class="pic"]/@data-src').extract_first()
# print(pic_out)
name = li.xpath('.//p[@class="name"]//span/text()').extract_first()
# print(name)
count = li.xpath('.//p[@class="count"]/text()').extract_first()
# print(count)
item['pic'] = pic_out
item['name'] = name
item['count'] = count
# print(item)
yield item