记录scrapy框架分析params和encSecKey参数的生成

Scrapy框架爬取网易云音乐项目总结

转载已于 2025-01-10 11:02:21 修改 · 210 阅读

CC 4.0 BY-SA版权

原文链接：https://blog.youkuaiyun.com/weixin_44530979/article/details/87925950?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522166367562416800182795243%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fall.%2522%257D&request_id=166367562416800182795243&bi

文章标签：

#爬虫 #python

于 2022-04-08 21:38:47 首次发布

python爬虫专栏收录该内容

9 篇文章

订阅专栏

一、目的

刚学了关于scrapy框架的各种琐碎的知识点，总感觉什么都没学，所以为了不辜负学习的时间，就在网上找了一个的爬虫项目做，本篇文章仅以用来记录自己的学习过程，切勿他用！！！

二、分析

以往都是单独的爬取歌单，或者音频，或者音乐的评论，比较不全面，所以我把这些整合了一下，基本上开始全栈爬取音乐。

思路：

根据排行榜的页面来爬取排行榜所展示的所有歌曲，相当于只爬取大众喜欢的歌曲。首先获取每一个排行榜页面的URL
在排行榜页面获取相关的歌曲信息以及歌曲对应的路由URL
在进行访问歌曲的URL，从而获取歌曲等相关信息。
将歌曲的歌词，作者，歌名等信息存储在csv文件当中，图片以jpg的格式来保存，音频以MP3的格式保存。

爬取的内容如下

要获取
1.歌曲的图片img_url
2.歌曲所属专辑album
3.歌曲的时长duration
4.歌曲的时长名字songname
5.歌曲手singer
6.歌词lyric
7.评论comment
    """
    评论的内容又包括
    1.评论人
    2.评论内容
    3.评论时间
    4.评论点赞数
    不包括回复的内容
    """
8.歌曲的音频文件media

三、遇到的困难

由于这是在项目完成之后所记的笔记，所以很多结果没有保存的截图，在这里只能口述，代码会放到后面，如果不想看，可以直接跳转到代码块查看

3.1网站的反爬机制

3.1.1禁止访问404，返回None

原因：scrapy框架实际上实现了异步，所以对网站的请求速度十分快，这就导致访问网站时出现过快，频率过高的现象，从而被禁止访问，返回不到信息。

解决方法：在此项目中用到了以下解决办法

随机更换UA标识
禁用cookie，scrapy默认是开启cookie的
限制访问速度

3.1.2 JS，Ajax等干扰

如果你认为你只需要发起请求给网站，网站就会给你返回你想要的内容，那么你就大错特错了。

现象：明明已经定位到相对应的元素，为什么返回的歌曲内容是一大堆看不懂的逻辑表达式呢？

原因：网易云网站进行了JS加密等一系列的操作，你所访问的网站，看似element元素上面有你想要的东西，但是那些内容都是通过JS，Ajax等手段从另一个网页返回给你的，你必须找到那个真正的有你想要的内容的网页

解决方法：通过F12对控制台进行调试找到相对应的网页，比如我想要歌词，那么找到的网站就是https://music.163.com/weapi/song/lyric?csrf_token= 这个，这样一来，如果你想获得歌曲的歌词，就要对这个网址发送请求，而不是你看到的那个网址。

你会发现存储音频的网站，存储歌词的网站，存储评论的网站都是同一个，不管歌词id是不是一样。一到这，你就会立马明白，要想获得对应歌曲的信息，就要对着同一个网站发起请求，但是要携带不同的参数，在这里很明显就是歌曲ID

以下是三个网站，这三个网站怎么得来的？通过控制台调试。

"""
评论  https://music.163.com/weapi/comment/resource/comments/get?csrf_token=
音频 https://music.163.com/weapi/song/enhance/player/url/v1?csrf_token=
歌词     https://music.163.com/weapi/song/lyric?csrf_token=
"""

3.1.3 JS加密参数

现象：你会发现无论怎么请求，不管用POST还是GET都没用，没有内容，参数错误。

原因：网易云对请求的数据参数进行了加密，当你以他的规则来进行参数生成并请求时才会有返回值

解决过程：只需要两个参数params，encSecKey:，自然要在控制台寻找这两个参数是怎么生成的。

这两个参数用到了AES加密，rsa加密等加密算法，通过在控制台调试最终得到了这两个参数的生成方法，并进行测试成功。

我花了一天的时间去解密，效果很小，但好在我找到了一位大佬的文章，感谢这篇文章的博主，我才能看懂这个加密的过程，如果对解密过程感兴趣的，可以去看一下。

不会调试控制台可以参考一下这篇文章

3.2数据获取问题

现象：就算参数，网址是对的，但是网页中所需要的数据在Textarea里面，并且是以Json形式的字符串，只通过元素定位还是难以获得。

原因：还是反爬机制

解决方法：将获取到的类似JSON格式的字符串转化为JSON格式，通过JSON语法来获取内容。

在这里有一个坑，如果你想要把比较长的字符串转为JSON，一定要这样转！！否则会报错

result = json.loads(字符串,strict = False)

！！！在scrapy框架中，你如果想要获取二进制数据，一定是response.body，而不是response.content，我被这个给坑惨了，也算是自己基础知识不牢固造成的。

3.3数据存储的问题

现象：音频文件和文字不能直接放在同一个管道中，要分别存储

解决方法：本来是想要只写一个类，一起存储音频，文字，图片信息的，所以坚决不能分开存储，要放在同一个管道里面。这让我想到了二进制的IO流数据，所以用IO六数据传输，就避免麻烦了

media_io = io.BytesIO()
media_io.write(response.body)
media_io.getvalue()

3.2未解决的错误

现象：尽管参数是对的，但是总有那么几首歌是找不到歌词的，会发生获取的要转化为JSON格式的字符串为空。这个目前还没有解决

四、代码

4.1settings.py文件

# Scrapy settings for easecloud project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'easecloud'

SPIDER_MODULES = ['easecloud.spiders']
NEWSPIDER_MODULE = 'easecloud.spiders'
USER_AGENT_LIST = [
    "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14",
    "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)",
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
    'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',
    "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
    "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 "
]

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = USER_AGENT_LIST[0]

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_FILE = './log.log'
LOG_LEVEL = 'WARNING'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.25
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16



# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'easecloud.middlewares.EasecloudSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'easecloud.middlewares.EasecloudDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'easecloud.pipelines.EasecloudPipeline': 300,
   'easecloud.pipelines.ImgEasecloudPipeline': 301,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
REDIRECT_ENABLED = False
IMAGES_STORE='./images'

4.2 pipelines.py文件

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import scrapy
from itemadapter import ItemAdapter
import logging,csv

from scrapy.pipelines.images import ImagesPipeline

log = logging.getLogger(__name__)

class EasecloudPipeline:
    def open(self,spider):
        log.warning("爬虫开始,运行")
        self.headers = ['歌曲名字','歌手','专辑','歌曲时长','歌词']
        self.headers_comment = ['评论人','评论内容','评论时间','评论点赞数']
        self.info = open('./wycomment.csv','w+',encoding='utf-8',newline='')
        self.info_comment = open('./wycomment_comment.csv','w+',encoding='utf-8')


    def process_item(self, item, spider):
        try:
            media_io = item['media']
            with open(f'./music/{item["album"]}.mp3','wb') as f:
                f.write(media_io.getvalue())
                f.close()
        except Exception as e:
            log.warning(f'出错了{e}，音频的名字是{item["songname"]}')
        finally:
            data = []
            data_comment = []
            data_singer =  item['singer']
            data_songname =  item['songname']
            data_album=  item['album']
            data_duration =  item['duration']
            data_lyric =  item['lyric']
            data_comment_commenter_name = item['comment']['commenter_name']
            data_comment_comment_datail = item['comment']['comment_datail']
            data_comment_time = item['comment']['time']
            data_comment_thumb_number = item['comment']['thumb_number']
            data.append((data_singer,data_songname,data_album,data_duration,data_lyric))
            data_comment.append(data_comment_commenter_name,data_comment_comment_datail,data_comment_time,data_comment_thumb_number)
            writer = csv.writer(self.info)
            writer.writerows(self.headers)
            writer.writerows(data)
            writer_comment = csv.writer(self.info_comment)
            writer_comment.writerow(self.headers_comment)
            writer_comment.writerows(data_comment)
        return item

    def close_spider(self,spider):
        log.warning("爬虫结束，存储数据完毕")
        self.info.close()
        self.info_comment.close()


class ImgEasecloudPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):

        yield scrapy.Request(item['img_url'])
    def file_path(self, request, response=None, info=None, *, item=None):
        # 指定存储图片的名字以及路径
        return item['songname']
    def item_completed(self, results, item, info):
        # 返回数据给下一个需要执行的管道类
        return item

4.3 items.py文件

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class EasecloudItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    """
    要获取
    1.歌曲的图片img_url
    2.album
    3.歌曲的时长duration
    4.歌曲的时长名字songname
    5.歌曲手singer
    6.歌词lyric
    7.评论comment

    """
    img_url = scrapy.Field()
    album = scrapy.Field()
    singer = scrapy.Field()
    duration = scrapy.Field()
    songname = scrapy.Field()
    lyric = scrapy.Field()
    comment = scrapy.Field()
    media = scrapy.Field()



class CommentDetail(scrapy.Item):
    """
    评论的内容又包括
    1.评论人
    2.评论内容
    3.评论时间
    4.评论点赞数
    不包括回复的内容
    """
    commenter_name = scrapy.Field()
    commenter_url = scrapy.Field()
    comment_datail = scrapy.Field()
    time = scrapy.Field()
    thumb_number = scrapy.Field()

4.4 middlewares.py文件

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import random,time
from .settings import USER_AGENT_LIST
from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
import logging
log = logging.getLogger(__name__)


class EasecloudSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class EasecloudDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.


    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        request.headers['User-Agent'] = random.choice(USER_AGENT_LIST)
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

4.5 wycomment.py文件

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import logging
from ..items import EasecloudItem, CommentDetail
import json
import io
log = logging.getLogger(__name__)
from.parse_params import eneryctAES

class WycommentSpider(CrawlSpider):
    name = 'wycomment'
    start_urls = ['https://music.163.com/discover/toplist?id=19723756']
    rules = (
        Rule(LinkExtractor(allow=r'toplist\?i'), callback='parse_item', follow=True),
    )
    encrypt = eneryctAES()


    def parse_item(self, response):
        """
        通过访问原始的https://music.163.com/discover/toplist?id=19723756这个网址，得到每一个排行榜的链接
        接着对这些链接通过这个函数进行访问
        因为信息在js文本中，所以通过js来获取

        遍历每一首歌，获取歌曲song_id，歌曲所属的专辑album，歌曲的图片地址img_url，歌手singer，歌曲时长total_time，然后将剩下的信息交给下一个函数解析

        :param response:
        :return:
        """
        log.warning(f'爬取的排行榜网址是{response.request.url}')
        log.warning(f'UA标识：{response.request.headers["User-Agent"]}')
        item = EasecloudItem()
        try:
            detail = response.xpath('//textarea[@id="song-list-pre-data"]/text()').extract_first()
            log.warning(type(detail))
            # 在这里会出现问题，detail可能会返回none，因为网站做了反爬机制
            result = json.loads(detail)
            for i in range(len(result)):
                album =result[i]['album']['name']
                img_url =result[i]['album']['picUrl']
                singer = result[i]['artists'][0]['name']
                total_time = result[i]['duration']
                song_id = result[i]['id']
                total_time = total_time // 1000

                minute = total_time // 60
                second = total_time % 60
                item['album'] = album
                item['img_url'] = img_url
                item['singer'] = singer
                item['duration'] = f'{minute}:{second}'
                yield scrapy.Request(url=f'https://music.163.com/song?id={song_id}',callback=self.parse_detail_songname,meta={'song_id':song_id,'item':item})
                break
        except Exception as e:
            log.warning(f'parse_item函数出现了问题，问题是{e},出现错误的网址是{response.request.url}')



    def parse_detail_songname(self,response):
        song_id = response.meta['song_id']
        item = response.meta['item']
        songname = response.xpath('//title/text()').extract_first()
        item['songname'] = songname
        liric_data = self.encrypt.resultEncrypt(self.encrypt.lyric % song_id)
        yield scrapy.FormRequest(url=self.encrypt.lyric_url,
                                 callback=self.parse_detail_liric, meta={'song_id': song_id, 'item': item},
                                 formdata=liric_data)

    def parse_detail_liric(self,response):
        """

        1.含有歌词内容的url需要有提交的参数，params以及enSecKey,这涉及到JS逆向解密AES，rsa等，比较复杂，自己可以去了解
        2.与解密有关的函数有create16RandomBytes，AESEncrypt，RSAEncrypt，resultEncrypt，目的就是得到合适的提交参数
        3.获取歌词内容
        :param response:
        :return:
        """
        try:
            song_id = response.meta['song_id']
            item = response.meta['item']
            # 在这里会出现问题，detail可能会返回none，因为网站做了反爬机制
            # 会出现一开始访问不到的情况，之后再解决
            result = json.loads(response.text)
            lyric = result["lrc"]["lyric"]
            lyric = self.parse_lyric(lyric)
            item['lyric'] = lyric

            comment_data = self.encrypt.resultEncrypt(self.encrypt.comment % (song_id,song_id))
            yield scrapy.FormRequest(url=self.encrypt.comment_url,
                                         callback=self.parse_detail_comment, meta={'id': song_id, 'item': item},
                                         formdata=comment_data)
        except Exception as e:
            log.warning(f'parse_detail_liric函数出现了错误,{e},出现错误的网址是{response.request.url}')


    def parse_detail_comment(self,response):
        """
        1.分析评论，发现评论的呢iron过于复杂所以就只爬取精彩的评论
        2.在精彩的评论中爬取 1.评论人、评论内容、评论时间、评论点赞数、不包括回复的内容
        :param response:
        :return:
        """
        try:
            song_id = response.meta['id']
            item = response.meta['item']
            comment_item = CommentDetail()
            result = json.loads(response.text,strict = False)
            for i in range(len(result['data']['hotComments'])):
                comment_item['comment_datail'] = result['data']['hotComments'][i]['content']
                comment_item['time'] = result['data']['hotComments'][i]['timeStr']
                comment_item['thumb_number'] = result['data']['hotComments'][i]['likedCount']
                comment_item['commenter_name'] = result['data']['hotComments'][i]['user']['nickname']
                comment_item['commenter_url'] = result['data']['hotComments'][i]['user']['avatarUrl']
                item['comment'] = comment_item
                meida_data = self.encrypt.resultEncrypt(self.encrypt.meida % song_id)
                yield scrapy.FormRequest(url=self.encrypt.meida_url,
                                            callback=self.parse_detail_media_url, meta={'song_id': song_id, 'item': item},
                                            formdata=meida_data)
                break
        except Exception as e:
            log.warning(f'parse_detail_comment函数出现了错误，错误是{e},出现错误的网址是{response.request.url}')

    def parse_detail_media_url(self,response):
        item = response.meta['item']
        try:
            result = json.loads(response.text,strict = False)
            media_url = result['data'][0]['url']
            log.warning(f"这是你找到的音频网址{media_url}")
            if media_url:
                yield scrapy.Request(url=media_url,callback=self.parse_detail_media,meta={'item':item})
        except Exception as e:
            log.warning(f"parse_detail_media_url函数出现了问题{e},找不到访问url失败,出错网址是{response.request.url}")


    def parse_detail_media(self,response):

        item = response.meta['item']
        # log.warning(f'这是所有数据的和{item}')
        try:
            media_io = io.BytesIO()
            media_io.write(response.body)
            item['media'] = media_io
        except Exception as e:
            log.warning("没有二进制数据")

        yield item
    def parse_lyric(self,string):
        lyric = ''.join(string.split())
        lyric = lyric.replace(']', '[')
        lyric = lyric.split('[')
        lyric = [lyric[2 * i] for i in range(len(lyric)) if 2 * i <= len(lyric) and i != 0]
        return ''.join(lyric)

"""
评论  https://music.163.com/weapi/comment/resource/comments/get?csrf_token=
音频 https://music.163.com/weapi/song/enhance/player/url/v1?csrf_token=
歌词     https://music.163.com/weapi/song/lyric?csrf_token=
"""

4.6 parse_params.py文件

这个文件是破解参数生成规则的关键，以防他人用作其他用途，在这里就不贴出来了

整个项目都放在了GitHub上面，感兴趣的可以去clone

四、优缺点

优点：

基本上实现了网易云音乐全栈爬取，爬取的信息十分庞大
项目简单

这个项目是在两天之内完成的，所以有些仓促，很多地方的BUG还未修复，之后会对其进行完善。

缺点：

网易云VIP歌曲还不能进行爬取，需要相关的cookie等
因为主降速延迟的效果，并没有很好的下载速度。但是如果可以用IP池，则可以不降速
下载的歌词是一长串的字符串，没有标点符号，十分难看。
爬取的评论信息过于简单，且只爬取了热门评论、
对所有的报错只是将错误写在了日志文件里面，并没有对其进行处理。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-405B9521-1649424632553)(C:\Users\admin\AppData\Roaming\Typora\typora-user-images\image-20220408213005952.png)]$