一、目的
刚学了关于scrapy框架的各种琐碎的知识点,总感觉什么都没学,所以为了不辜负学习的时间,就在网上找了一个的爬虫项目做,本篇文章仅以用来记录自己的学习过程,切勿他用!!!
二、分析
以往都是单独的爬取歌单,或者音频,或者音乐的评论,比较不全面,所以我把这些整合了一下,基本上开始全栈爬取音乐。
思路:
- 根据排行榜的页面来爬取排行榜所展示的所有歌曲,相当于只爬取大众喜欢的歌曲。首先获取每一个排行榜页面的URL
- 在排行榜页面获取相关的歌曲信息以及歌曲对应的路由URL
- 在进行访问歌曲的URL,从而获取歌曲等相关信息。
- 将歌曲的歌词,作者,歌名等信息存储在csv文件当中,图片以jpg的格式来保存,音频以MP3的格式保存。
爬取的内容如下
要获取
1.歌曲的图片img_url
2.歌曲所属专辑album
3.歌曲的时长duration
4.歌曲的时长名字songname
5.歌曲手singer
6.歌词lyric
7.评论comment
"""
评论的内容又包括
1.评论人
2.评论内容
3.评论时间
4.评论点赞数
不包括回复的内容
"""
8.歌曲的音频文件media
三、遇到的困难
由于这是在项目完成之后所记的笔记,所以很多结果没有保存的截图,在这里只能口述,代码会放到后面,如果不想看,可以直接跳转到代码块查看
3.1网站的反爬机制
3.1.1禁止访问404,返回None
原因:scrapy框架实际上实现了异步,所以对网站的请求速度十分快,这就导致访问网站时出现过快,频率过高的现象,从而被禁止访问,返回不到信息。
解决方法:在此项目中用到了以下解决办法
- 随机更换UA标识
- 禁用cookie,scrapy默认是开启cookie的
- 限制访问速度
3.1.2 JS,Ajax等干扰
如果你认为你只需要发起请求给网站,网站就会给你返回你想要的内容,那么你就大错特错了。
现象:明明已经定位到相对应的元素,为什么返回的歌曲内容是一大堆看不懂的逻辑表达式呢?
原因:网易云网站进行了JS加密等一系列的操作,你所访问的网站,看似element元素上面有你想要的东西,但是那些内容都是通过JS,Ajax等手段从另一个网页返回给你的,你必须找到那个真正的有你想要的内容的网页
解决方法:通过F12对控制台进行调试找到相对应的网页,比如我想要歌词,那么找到的网站就是https://music.163.com/weapi/song/lyric?csrf_token= 这个,这样一来,如果你想获得歌曲的歌词,就要对这个网址发送请求,而不是你看到的那个网址。
你会发现存储音频的网站,存储歌词的网站,存储评论的网站都是同一个,不管歌词id是不是一样。一到这,你就会立马明白,要想获得对应歌曲的信息,就要对着同一个网站发起请求,但是要携带不同的参数,在这里很明显就是歌曲ID
以下是三个网站,这三个网站怎么得来的?通过控制台调试。
"""
评论 https://music.163.com/weapi/comment/resource/comments/get?csrf_token=
音频 https://music.163.com/weapi/song/enhance/player/url/v1?csrf_token=
歌词 https://music.163.com/weapi/song/lyric?csrf_token=
"""
3.1.3 JS加密参数
现象:你会发现无论怎么请求,不管用POST还是GET都没用,没有内容,参数错误。
原因:网易云对请求的数据参数进行了加密,当你以他的规则来进行参数生成并请求时才会有返回值
解决过程:只需要两个参数params,encSecKey:,自然要在控制台寻找这两个参数是怎么生成的。
这两个参数用到了AES加密,rsa加密等加密算法,通过在控制台调试最终得到了这两个参数的生成方法,并进行测试成功。
我花了一天的时间去解密,效果很小,但好在我找到了一位大佬的文章,感谢这篇文章的博主,我才能看懂这个加密的过程,如果对解密过程感兴趣的,可以去看一下。
不会调试控制台可以参考一下这篇文章
3.2数据获取问题
现象:就算参数,网址是对的,但是网页中所需要的数据在Textarea里面,并且是以Json形式的字符串,只通过元素定位还是难以获得。
原因:还是反爬机制
解决方法:将获取到的类似JSON格式的字符串转化为JSON格式,通过JSON语法来获取内容。
在这里有一个坑,如果你想要把比较长的字符串转为JSON,一定要这样转!!否则会报错
result = json.loads(字符串,strict = False)
!!!在scrapy框架中,你如果想要获取二进制数据,一定是response.body,而不是response.content,我被这个给坑惨了,也算是自己基础知识不牢固造成的。
3.3数据存储的问题
现象:音频文件和文字不能直接放在同一个管道中,要分别存储
解决方法:本来是想要只写一个类,一起存储音频,文字,图片信息的,所以坚决不能分开存储,要放在同一个管道里面。这让我想到了二进制的IO流数据,所以用IO六数据传输,就避免麻烦了
media_io = io.BytesIO()
media_io.write(response.body)
media_io.getvalue()
3.2未解决的错误
现象:尽管参数是对的,但是总有那么几首歌是找不到歌词的,会发生获取的要转化为JSON格式的字符串为空。这个目前还没有解决
四、代码
4.1settings.py文件
# Scrapy settings for easecloud project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'easecloud'
SPIDER_MODULES = ['easecloud.spiders']
NEWSPIDER_MODULE = 'easecloud.spiders'
USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)",
'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
'Opera/9.25 (Windows NT 5.1; U; en)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',
"Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 "
]
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = USER_AGENT_LIST[0]
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_FILE = './log.log'
LOG_LEVEL = 'WARNING'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.25
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'easecloud.middlewares.EasecloudSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'easecloud.middlewares.EasecloudDownloaderMiddleware': 543,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'easecloud.pipelines.EasecloudPipeline': 300,
'easecloud.pipelines.ImgEasecloudPipeline': 301,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
REDIRECT_ENABLED = False
IMAGES_STORE='./images'
4.2 pipelines.py文件
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
import scrapy
from itemadapter import ItemAdapter
import logging,csv
from scrapy.pipelines.images import ImagesPipeline
log = logging.getLogger(__name__)
class EasecloudPipeline:
def open(self,spider):
log.warning("爬虫开始,运行")
self.headers = ['歌曲名字','歌手','专辑','歌曲时长','歌词']
self.headers_comment = ['评论人','评论内容','评论时间','评论点赞数']
self.info = open('./wycomment.csv','w+',encoding='utf-8',newline='')
self.info_comment = open('./wycomment_comment.csv','w+',encoding='utf-8')
def process_item(self, item, spider):
try:
media_io = item['media']
with open(f'./music/{item["album"]}.mp3','wb') as f:
f.write(media_io.getvalue())
f.close()
except Exception as e:
log.warning(f'出错了{e},音频的名字是{item["songname"]}')
finally:
data = []
data_comment = []
data_singer = item['singer']
data_songname = item['songname']
data_album= item['album']
data_duration = item['duration']
data_lyric = item['lyric']
data_comment_commenter_name = item['comment']['commenter_name']
data_comment_comment_datail = item['comment']['comment_datail']
data_comment_time = item['comment']['time']
data_comment_thumb_number = item['comment']['thumb_number']
data.append((data_singer,data_songname,data_album,data_duration,data_lyric))
data_comment.append(data_comment_commenter_name,data_comment_comment_datail,data_comment_time,data_comment_thumb_number)
writer = csv.writer(self.info)
writer.writerows(self.headers)
writer.writerows(data)
writer_comment = csv.writer(self.info_comment)
writer_comment.writerow(self.headers_comment)
writer_comment.writerows(data_comment)
return item
def close_spider(self,spider):
log.warning("爬虫结束,存储数据完毕")
self.info.close()
self.info_comment.close()
class ImgEasecloudPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
yield scrapy.Request(item['img_url'])
def file_path(self, request, response=None, info=None, *, item=None):
# 指定存储图片的名字以及路径
return item['songname']
def item_completed(self, results, item, info):
# 返回数据给下一个需要执行的管道类
return item
4.3 items.py文件
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class EasecloudItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
"""
要获取
1.歌曲的图片img_url
2.album
3.歌曲的时长duration
4.歌曲的时长名字songname
5.歌曲手singer
6.歌词lyric
7.评论comment
"""
img_url = scrapy.Field()
album = scrapy.Field()
singer = scrapy.Field()
duration = scrapy.Field()
songname = scrapy.Field()
lyric = scrapy.Field()
comment = scrapy.Field()
media = scrapy.Field()
class CommentDetail(scrapy.Item):
"""
评论的内容又包括
1.评论人
2.评论内容
3.评论时间
4.评论点赞数
不包括回复的内容
"""
commenter_name = scrapy.Field()
commenter_url = scrapy.Field()
comment_datail = scrapy.Field()
time = scrapy.Field()
thumb_number = scrapy.Field()
4.4 middlewares.py文件
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import random,time
from .settings import USER_AGENT_LIST
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
import logging
log = logging.getLogger(__name__)
class EasecloudSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class EasecloudDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
request.headers['User-Agent'] = random.choice(USER_AGENT_LIST)
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
4.5 wycomment.py文件
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import logging
from ..items import EasecloudItem, CommentDetail
import json
import io
log = logging.getLogger(__name__)
from.parse_params import eneryctAES
class WycommentSpider(CrawlSpider):
name = 'wycomment'
start_urls = ['https://music.163.com/discover/toplist?id=19723756']
rules = (
Rule(LinkExtractor(allow=r'toplist\?i'), callback='parse_item', follow=True),
)
encrypt = eneryctAES()
def parse_item(self, response):
"""
通过访问原始的https://music.163.com/discover/toplist?id=19723756这个网址,得到每一个排行榜的链接
接着对这些链接通过这个函数进行访问
因为信息在js文本中,所以通过js来获取
遍历每一首歌,获取歌曲song_id,歌曲所属的专辑album,歌曲的图片地址img_url,歌手singer,歌曲时长total_time,然后将剩下的信息交给下一个函数解析
:param response:
:return:
"""
log.warning(f'爬取的排行榜网址是{response.request.url}')
log.warning(f'UA标识:{response.request.headers["User-Agent"]}')
item = EasecloudItem()
try:
detail = response.xpath('//textarea[@id="song-list-pre-data"]/text()').extract_first()
log.warning(type(detail))
# 在这里会出现问题,detail可能会返回none,因为网站做了反爬机制
result = json.loads(detail)
for i in range(len(result)):
album =result[i]['album']['name']
img_url =result[i]['album']['picUrl']
singer = result[i]['artists'][0]['name']
total_time = result[i]['duration']
song_id = result[i]['id']
total_time = total_time // 1000
minute = total_time // 60
second = total_time % 60
item['album'] = album
item['img_url'] = img_url
item['singer'] = singer
item['duration'] = f'{minute}:{second}'
yield scrapy.Request(url=f'https://music.163.com/song?id={song_id}',callback=self.parse_detail_songname,meta={'song_id':song_id,'item':item})
break
except Exception as e:
log.warning(f'parse_item函数出现了问题,问题是{e},出现错误的网址是{response.request.url}')
def parse_detail_songname(self,response):
song_id = response.meta['song_id']
item = response.meta['item']
songname = response.xpath('//title/text()').extract_first()
item['songname'] = songname
liric_data = self.encrypt.resultEncrypt(self.encrypt.lyric % song_id)
yield scrapy.FormRequest(url=self.encrypt.lyric_url,
callback=self.parse_detail_liric, meta={'song_id': song_id, 'item': item},
formdata=liric_data)
def parse_detail_liric(self,response):
"""
1.含有歌词内容的url需要有提交的参数,params以及enSecKey,这涉及到JS逆向解密AES,rsa等,比较复杂,自己可以去了解
2.与解密有关的函数有create16RandomBytes,AESEncrypt,RSAEncrypt,resultEncrypt,目的就是得到合适的提交参数
3.获取歌词内容
:param response:
:return:
"""
try:
song_id = response.meta['song_id']
item = response.meta['item']
# 在这里会出现问题,detail可能会返回none,因为网站做了反爬机制
# 会出现一开始访问不到的情况,之后再解决
result = json.loads(response.text)
lyric = result["lrc"]["lyric"]
lyric = self.parse_lyric(lyric)
item['lyric'] = lyric
comment_data = self.encrypt.resultEncrypt(self.encrypt.comment % (song_id,song_id))
yield scrapy.FormRequest(url=self.encrypt.comment_url,
callback=self.parse_detail_comment, meta={'id': song_id, 'item': item},
formdata=comment_data)
except Exception as e:
log.warning(f'parse_detail_liric函数出现了错误,{e},出现错误的网址是{response.request.url}')
def parse_detail_comment(self,response):
"""
1.分析评论,发现评论的呢iron过于复杂所以就只爬取精彩的评论
2.在精彩的评论中爬取 1.评论人、评论内容、评论时间、评论点赞数、不包括回复的内容
:param response:
:return:
"""
try:
song_id = response.meta['id']
item = response.meta['item']
comment_item = CommentDetail()
result = json.loads(response.text,strict = False)
for i in range(len(result['data']['hotComments'])):
comment_item['comment_datail'] = result['data']['hotComments'][i]['content']
comment_item['time'] = result['data']['hotComments'][i]['timeStr']
comment_item['thumb_number'] = result['data']['hotComments'][i]['likedCount']
comment_item['commenter_name'] = result['data']['hotComments'][i]['user']['nickname']
comment_item['commenter_url'] = result['data']['hotComments'][i]['user']['avatarUrl']
item['comment'] = comment_item
meida_data = self.encrypt.resultEncrypt(self.encrypt.meida % song_id)
yield scrapy.FormRequest(url=self.encrypt.meida_url,
callback=self.parse_detail_media_url, meta={'song_id': song_id, 'item': item},
formdata=meida_data)
break
except Exception as e:
log.warning(f'parse_detail_comment函数出现了错误,错误是{e},出现错误的网址是{response.request.url}')
def parse_detail_media_url(self,response):
item = response.meta['item']
try:
result = json.loads(response.text,strict = False)
media_url = result['data'][0]['url']
log.warning(f"这是你找到的音频网址{media_url}")
if media_url:
yield scrapy.Request(url=media_url,callback=self.parse_detail_media,meta={'item':item})
except Exception as e:
log.warning(f"parse_detail_media_url函数出现了问题{e},找不到访问url失败,出错网址是{response.request.url}")
def parse_detail_media(self,response):
item = response.meta['item']
# log.warning(f'这是所有数据的和{item}')
try:
media_io = io.BytesIO()
media_io.write(response.body)
item['media'] = media_io
except Exception as e:
log.warning("没有二进制数据")
yield item
def parse_lyric(self,string):
lyric = ''.join(string.split())
lyric = lyric.replace(']', '[')
lyric = lyric.split('[')
lyric = [lyric[2 * i] for i in range(len(lyric)) if 2 * i <= len(lyric) and i != 0]
return ''.join(lyric)
"""
评论 https://music.163.com/weapi/comment/resource/comments/get?csrf_token=
音频 https://music.163.com/weapi/song/enhance/player/url/v1?csrf_token=
歌词 https://music.163.com/weapi/song/lyric?csrf_token=
"""
4.6 parse_params.py文件
这个文件是破解参数生成规则的关键,以防他人用作其他用途,在这里就不贴出来了
整个项目都放在了GitHub上面,感兴趣的可以去clone
四、优缺点
优点:
- 基本上实现了网易云音乐全栈爬取,爬取的信息十分庞大
- 项目简单
这个项目是在两天之内完成的,所以有些仓促,很多地方的BUG还未修复,之后会对其进行完善。
缺点:
- 网易云VIP歌曲还不能进行爬取,需要相关的cookie等
- 因为主降速延迟的效果,并没有很好的下载速度。但是如果可以用IP池,则可以不降速
- 下载的歌词是一长串的字符串,没有标点符号,十分难看。
- 爬取的评论信息过于简单,且只爬取了热门评论、
- 对所有的报错只是将错误写在了日志文件里面,并没有对其进行处理。