[爬虫]期末小项目-网易云_期末爬虫项目-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_52527444/article/details/122140579

本文详细介绍了如何使用Python的selenium和BeautifulSoup库爬取网易云音乐的飙升榜、新歌榜、原创榜和热歌榜的歌曲信息，包括歌名、时长、歌手和评论数据。通过无界面模式的Chrome浏览器模拟用户行为，进入排行榜页面，解析HTML获取一级页面数据，进一步获取每个歌曲详情页的评论内容、点赞数和日期。最后，将数据存储到CSV文件中，便于后续分析。文章还提供了处理特殊情况的代码，如处理日期和点赞数的转换。若要将数据存入MySQL，还需补充相关数据库操作代码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文导入所用的包有：selenium，BeautifulSoup，time，csv，datetime
import time, csv, datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
selenium需要单独下载安装配置,需要的可以去其他人博客参考一下

需求：

爬取网易云排行榜四个榜单的[歌名，时长，歌手]，并进入歌曲二级页面爬取每一首歌的[专辑，评论者，评论内容，赞数]，最后将两者爬取的数据写出到csv文件中。

需求分析：

1.导入chromedriver驱动，设置无界面模式，并定义浏览器对象

2.打开网易云首页

3.通过点击事件进入排行榜界面

4.定义一个函数获取一级页面

5.切换frame到正确的窗口，并选择四个中的一个排行榜进行操作

6.将当前页面全部获取下来

7.使用BeautifulSoup获取其中的字段

8.使用循环并处理字段

9.在每一次循环里调用获取二级页面中的字段的函数

10.定义一个函数获取二级界面的函数

11.定义第二个驱动，无界面模式，浏览器对象

12.打开从一级界面传过来的二级界面的url

13.切换frame到正确的窗口，并获取当前页面内容

14.使用BeautifulSoup处理其中字段

15.返回二级页面的list给一级页面

16.在一级页面函数中获取二级页面list中字段名，并返回到最终的list中

17.写出到csv

具体实现：

1.定义一个变量获取要爬取的排行榜序号

2.导入chromedriver驱动，设置无界面模式，并定义浏览器对象

3.打开网易云首页

4.通过点击事件进入排行榜界面

# 获取歌曲名，时长（分，秒），歌手，二级页面id
print('排行榜:')
print('1 飙升榜')
print('2 新歌榜')
print('3 原创榜')
print('4 热歌榜')
charlist = input("请输入要爬取的排行榜序号(1,2,3,4):")
while True:
    if int(charlist) < 1 or int(charlist) > 4:
        charlist = input('输入序号超出范围，请重新输入:')
    else:
        break
#定义网易云首页url
url = 'https://music.163.com'
#定义无界面模式
op = Options()
op.add_argument("-headless")
#定义浏览器对象，
b = webdriver.Chrome("E:\谷歌浏览器\Google\Chrome\Application\chromedriver.exe",options=op)
#打开网易云主页
b.get(url)
# 找到排行榜链接并点击排行榜
b.find_element(By.CSS_SELECTOR, 'div > ul > li:nth-child(2) > a').click()
#等待1s时间
time.sleep(1)

无界面不设置则运行时会弹出谷歌浏览器界面，设置是为了更方便一点

定义函数获取一级界面内容：

5.切换frame到正确的窗口，并选择四个中的一个排行榜进行操作

6.将当前页面全部获取下来

7.使用BeautifulSoup获取其中的字段

8.使用循环并处理字段

9.在每一次循环里调用获取二级页面中的字段的函数

def getCharts(chartslist):
    #将框架定位到页面
    b.switch_to.frame(b.find_element(By.ID, 'g_iframe'))
    #接收最开始传过来的变量并选择相应的排行榜
    listname = ""
    if int(charlist) == 1:
        b.find_element(By.CSS_SELECTOR, 'ul:nth-child(2) > li.mine.z-selected > div > div > a').click()  # 飙升榜
        listname = "飙升榜"
    elif int(charlist) == 2:
        b.find_element(By.CSS_SELECTOR, 'ul:nth-child(2) > li:nth-child(2) > div > div > a').click()  # 新歌榜
        listname = "新歌榜"
    elif int(charlist) == 3:
        b.find_element(By.CSS_SELECTOR, 'ul:nth-child(2) > li:nth-child(3) > div > div > a').click()  # 原创榜
        listname = "原创榜"
    else:
        b.find_element(By.CSS_SELECTOR, 'ul:nth-child(2) > li:nth-child(4) > div > div > a').click()  # 热歌榜
        listname = "热歌榜"
    #存储一个带标题的csv文件
    with open("%s.csv" % listname, "a", newline="",encoding='utf-8', errors='ignore') as f:
        w1 = csv.writer(f)
        title = ["歌名", "分", "秒", "歌手", "专辑" , "高赞评论者", "高赞评论", "赞数", "评论日期"]
        w1.writerow(title)
    #打印当前html页面
    html = b.page_source
    #解析
    soup = BeautifulSoup(html, 'lxml')
    #获取所有歌名
    soupname = soup.select(' div > div > div > span > a > b')
    #获取所有时间信息
    soupduration = soup.select('td.s-fc3 > span')
    #获取所有歌手
    soupsinger = soup.select(' td:nth-child(4) > div > span')
    #获取所有歌名链接的id
    soupid = soup.select(' div > div > div > span > a')
    #遍历并获取数据
    for n, d, s, i in zip(soupname, soupduration, soupsinger, soupid):
        name = n.get('title').replace('\xa0', ' ')
        minute = d.text.split(':')[0]
        second = d.text.split(':')[1]
        singer = s.get('title')
        id = i.get('href')
        newurl = url + id
        newlist = getnewHtml(newurl)
        album = newlist[0]
        commentator = newlist[1]
        comment = newlist[2].replace('\xa0', ' ').replace("\"","“")
        praise = newlist[3]
        untreatedYesterday = datetime.datetime.today().date() - datetime.timedelta(days=1)  # 获取昨天的日期
        yesterday = '%i月%i日' % (untreatedYesterday.month, untreatedYesterday.day)
        date = newlist[3].replace('昨天', '%s ' % yesterday)  # 将日期中包含‘昨天’的都改成昨天的x月x日
        #将每行数据存储到list中
        song_list = [name, minute, second, singer,album,commentator, comment, praise, date]
        #打印输出
        print(song_list)
        #存储到刚才保存的csv中
        with open("%s.csv" % listname, "a", newline="", encoding='utf-8', errors='ignore') as f:
            w1 = csv.writer(f)
            w1.writerow(song_list)

部分内容分析：

歌曲名称：用get获取标签中title的内容，并用replace方法将里面的‘xa0’ 替换成‘ ’。

时间：通过 ‘：’ 拆分歌曲时长为分，秒并保存到变量中

二级页面的url：通过刚刚获取的每个歌曲的id创建一个新的url变量，并通过新的url变量开始采集二级页面的数据内容，将返回回来的列表用newlist表示

专辑：newlist中第一个元素

评论者：newlist中第二个元素

评论内容：newlist中第三个元素，并将内容中的‘xa0’替换成‘ ’，且有的评论可能首位有‘ " ’引号，将‘ " ’ 改成中文的‘ “ ’

点赞数：newlist中第四个元素

日期：评论的日期有时不显示月日分秒，显示的是 ‘昨天 xx：xx’ ，那么就需要获取昨天的日期（今天的日期-1天），获取到月日之后，将newlist中第五个元素包含 ‘昨天’ 的重写成昨天的月日

二级页面内容代码：

10.定义一个函数获取二级界面的函数

11.定义第二个驱动，无界面模式，浏览器对象

12.打开从一级界面传过来的二级界面的url

13.切换frame到正确的窗口，并获取当前页面内容

14.使用BeautifulSoup处理其中字段

15.返回二级页面的list给一级页面

def getnewHtml(url):
    b2 = webdriver.Chrome("E:\谷歌浏览器\Google\Chrome\Application\chromedriver.exe",options=op)
    b2.get(url)
    b2.switch_to.frame(b2.find_element(By.ID, 'g_iframe'))
    html = b2.page_source
    soup2 = BeautifulSoup(html, 'lxml')
    soupalbum = soup2.select('div.cnt > p:nth-child(3) > a')
    soupcomment = soup2.select('div.cntwrap > div:nth-child(1) > div')
    souppraise = soup2.select(' div.cntwrap > div.rp > a:nth-child(3)')
    soupdate = soup2.select(' div.cntwrap > div.rp > div')
    album = soupalbum[0].text
    if souppraise[0].text > souppraise[1].text:
        # 评论按‘：’切分，有时候评论里也包含‘：’，那么除了第0个是评论者，其他评论补充在后面
        commentator = soupcomment[0].text.split('：')[0]
        comment = soupcomment[0].text.split('：')[1]
        i = 1
        while True:
            if i <= len(soupcomment[0].text.split('：')) - 2:
                comment = comment + "：" + soupcomment[0].text.split('：')[i + 1]
            else:
                break
            i += 1
        praise = souppraise[0].text.split('(')[1].split(')')[0]  # 点赞数
        if "万" in praise:
            praise.replace(".","").replace("万","000")
        date = soupdate[0].text  # 评论日期
    elif souppraise[0].text < souppraise[1].text:
        commentator = soupcomment[1].text.split('：')[0]
        comment = soupcomment[1].text.split('：')[1]
        i = 1
        while True:
            if i <= len(soupcomment[1].text.split('：')) - 2:
                comment = comment + "：" + soupcomment[1].text.split('：')[i + 1]
            else:
                break
            i += 1
        praise = souppraise[1].text.split('(')[1].split(')')[0]
        if "万" in praise:
            praise.replace(".","").replace("万","000")
        date = soupdate[1].text
    else :
        commentator = soupcomment[0].text.split('：')[0]
        comment = soupcomment[0].text.split('：')[1]
        i = 1
        while True:
            if i <= len(soupcomment[0].text.split('：')) - 2:
                comment = comment + "：" + soupcomment[0].text.split('：')[i + 1]
            else:
                break
            i += 1
        if "(" in souppraise[0].text:
            praise = souppraise[0].text.split('(')[1].split(')')[0]
            if "万" in praise:
                praise.replace(".","").replace("万","000")
        else:
            praise = 0
        date = soupdate[0].text
    song_list2 = []
    song_list2 = [album,commentator, comment, praise, date]
    return song_list2

部分内容分析：

if判断语句：如果点赞数第一个比第二个多，那就取第一个评论，否则取第二个（有的时候广告评论在高赞评论上面才写的判断）

评论：通过‘：’切分，索引为0的是评论者，0之后的是评论（一般评论只有两个元素，默认将索引为1的当做comment）

有时评论里包含的不只一个‘：’，除了索引为0的是评论者名称外，需要通过循环将所有评论结合到一起

while死循环：在外面定义一个 i = 1 ，当 i 小于评论list的长度-2时，那么将默认的comment 加上一个‘：’加上一个索引为i+1的内容，不满足时break跳出循环，并且进行一次循环 i += 1

点赞数：直接获取页面元素，并切掉‘（’和‘）’只保留数字；如果点赞数为xx.x万，则把‘.’去掉，‘万’改成000。

评论日期：直接获取页面元素，到一级界面中进行清理

除了第一个评论大于小于第二个评论两种情况外，还有第三种情况，两个评论没有点赞数或一样那么默认获取第一条评论

评论获取方法和上面一样

点赞数获取，如果‘（’ 在点赞数内容中，那么获取方法和上面一样，否则将 praise = 0

日期获取和上面一样

将所需要的写进song_list2中并返回给一级页面函数中

本文整体代码如下：

import time, csv, datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options

# 获取二级页面的评论，点赞，日期
def getnewHtml(url):
    b2 = webdriver.Chrome("E:\谷歌浏览器\Google\Chrome\Application\chromedriver.exe",options=op)
    b2.get(url)
    b2.switch_to.frame(b2.find_element(By.ID, 'g_iframe'))
    html = b2.page_source
    soup2 = BeautifulSoup(html, 'lxml')
    soupalbum = soup2.select('div.cnt > p:nth-child(3) > a')
    soupcomment = soup2.select('div.cntwrap > div:nth-child(1) > div')
    souppraise = soup2.select(' div.cntwrap > div.rp > a:nth-child(3)')
    soupdate = soup2.select(' div.cntwrap > div.rp > div')
    album = soupalbum[0].text
    # 如果点赞数第一个比第二个多，那就取第一个评论，否则取第二个 （有的时候广告评论在高赞评论上面才写的判断）
    if souppraise[0].text > souppraise[1].text:
        # 评论按‘：’切分，有时候评论里也包含‘：’，那么除了第0个是评论者，其他评论补充在后面
        commentator = soupcomment[0].text.split('：')[0]
        comment = soupcomment[0].text.split('：')[1]
        i = 1
        while True:
            if i <= len(soupcomment[0].text.split('：')) - 2:
                comment = comment + "：" + soupcomment[0].text.split('：')[i + 1]
            else:
                break
            i += 1
        praise = souppraise[0].text.split('(')[1].split(')')[0]  # 点赞数
        if "万" in praise:
            praise.replace(".","").replace("万","000")
        date = soupdate[0].text  # 评论日期
    elif souppraise[0].text < souppraise[1].text:
        commentator = soupcomment[1].text.split('：')[0]
        comment = soupcomment[1].text.split('：')[1]
        i = 1
        while True:
            if i <= len(soupcomment[1].text.split('：')) - 2:
                comment = comment + "：" + soupcomment[1].text.split('：')[i + 1]
            else:
                break
            i += 1
        praise = souppraise[1].text.split('(')[1].split(')')[0]
        if "万" in praise:
            praise.replace(".","").replace("万","000")
        date = soupdate[1].text
    else :
        commentator = soupcomment[0].text.split('：')[0]
        comment = soupcomment[0].text.split('：')[1]
        i = 1
        while True:
            if i <= len(soupcomment[0].text.split('：')) - 2:
                comment = comment + "：" + soupcomment[0].text.split('：')[i + 1]
            else:
                break
            i += 1
        if "(" in souppraise[0].text:
            praise = souppraise[0].text.split('(')[1].split(')')[0]
            if "万" in praise:
                praise.replace(".","").replace("万","000")
        else:
            praise = 0
        date = soupdate[0].text
    song_list2 = []
    song_list2 = [album,commentator, comment, praise, date]
    return song_list2

# 获取一级页面的歌名，时长，歌手，二级页面的网页id
def getCharts(chartslist):
    # 将框架定位到页面
    b.switch_to.frame(b.find_element(By.ID, 'g_iframe'))
    # 接收最开始传过来的变量并选择相应的排行榜
    listname = ""
    if int(charlist) == 1:
        b.find_element(By.CSS_SELECTOR, 'ul:nth-child(2) > li.mine.z-selected > div > div > a').click()  # 飙升榜
        listname = "飙升榜"
    elif int(charlist) == 2:
        b.find_element(By.CSS_SELECTOR, 'ul:nth-child(2) > li:nth-child(2) > div > div > a').click()  # 新歌榜
        listname = "新歌榜"
    elif int(charlist) == 3:
        b.find_element(By.CSS_SELECTOR, 'ul:nth-child(2) > li:nth-child(3) > div > div > a').click()  # 原创榜
        listname = "原创榜"
    else:
        b.find_element(By.CSS_SELECTOR, 'ul:nth-child(2) > li:nth-child(4) > div > div > a').click()  # 热歌榜
        listname = "热歌榜"
    # 存储一个带标题的csv文件
    with open("%s.csv" % listname, "a", newline="",encoding='utf-8', errors='ignore') as f:
        w1 = csv.writer(f)
        title = ["歌名", "分", "秒", "歌手", "专辑" , "高赞评论者", "高赞评论", "赞数", "评论日期"]
        w1.writerow(title)

    # 打印当前html页面
    html = b.page_source
    # 解析
    soup = BeautifulSoup(html, 'lxml')
    # 获取所有歌名
    soupname = soup.select(' div > div > div > span > a > b')
    # 获取所有时间信息
    soupduration = soup.select('td.s-fc3 > span')
    # 获取所有歌手
    soupsinger = soup.select(' td:nth-child(4) > div > span')
    # 获取所有歌名链接的id
    soupid = soup.select(' div > div > div > span > a')
    # 遍历并获取数据
    for n, d, s, i in zip(soupname, soupduration, soupsinger, soupid):
        name = n.get('title').replace('\xa0', ' ')
        minute = d.text.split(':')[0]
        second = d.text.split(':')[1]
        singer = s.get('title')
        id = i.get('href')
        newurl = url + id
        newlist = getnewHtml(newurl)
        album = newlist[0]
        commentator = newlist[1]
        comment = newlist[2].replace('\xa0', ' ')
        praise = newlist[3]
        untreatedYesterday = datetime.datetime.today().date() - datetime.timedelta(days=1)  # 获取昨天的日期
        yesterday = '%i月%i日' % (untreatedYesterday.month, untreatedYesterday.day)
        date = newlist[4].replace('昨天', '%s ' % yesterday)  # 将日期中包含‘昨天’的都改成昨天的x月x日
        song_list = [name, minute, second, singer,album,commentator, comment, praise, date]
        print(song_list)
        with open("%s.csv" % listname, "a", newline="", encoding='utf-8', errors='ignore') as f:
            w1 = csv.writer(f)
            w1.writerow(song_list)


# 获取歌曲名，时长（分，秒），歌手，二级页面id
print('排行榜:')
print('1 飙升榜')
print('2 新歌榜')
print('3 原创榜')
print('4 热歌榜')
charlist = input("请输入要爬取的排行榜序号(1,2,3,4):")
while True:
    if int(charlist) < 1 or int(charlist) > 4:
        charlist = input('输入序号超出范围，请重新输入:')
    else:
        break
#定义网易云首页url
url = 'https://music.163.com'

op = Options()
op.add_argument("-headless")
b = webdriver.Chrome("E:\谷歌浏览器\Google\Chrome\Application\chromedriver.exe",options=op)
b.get(url)
b.find_element(By.CSS_SELECTOR, 'div > ul > li:nth-child(2) > a').click()  # 点击排行榜
time.sleep(1)

getCharts(charlist)

如有不足之处，评论区咱们多多交流

补充：如果需要写入mysql需要补充以下代码

在getCharts(charlist)上方建立连接

在getCharts(charlist)函数末尾写入csv后再写sql语句

在getCharts(charlist)下方关闭连接

# 建立mysql连接
conn = pymysql.Connect(host="localhost",port=3306,user="root",password="123456",db="wyyyy",charset="utf8mb4")

getCharts(charlist)
# 关闭游标
conn.cursor().close()
#关闭mysql连接
conn.close()

在一级界面函数末尾写如下代码

        with open("%s.csv" % listname, "a", newline="", encoding='utf-8', errors='ignore') as f:
            w1 = csv.writer(f)
            w1.writerow(song_list)
# 写入mysql
        if int(charlist) == 1:
            sql = 'insert into bsb values ("%s","%s","%s","%s","%s","%s","%s","%s","%s");' % (name,minute,second,singer,album,commentator,comment,praise,date)
            conn.cursor().execute(sql)
            conn.commit()
        elif int(charlist) == 2:
            sql = 'insert into xgb values ("%s","%s","%s","%s","%s","%s","%s","%s","%s");' % (name,minute,second,singer,album,commentator,comment,praise,date)
            conn.cursor().execute(sql)
            conn.commit()
        elif int(charlist) == 3:
            sql = 'insert into ycb values ("%s","%s","%s","%s","%s","%s","%s","%s","%s");' % (name,minute,second,singer,album,commentator,comment,praise,date)
            conn.cursor().execute(sql)
            conn.commit()
        else:
            sql = 'insert into rgb values ("%s","%s","%s","%s","%s","%s","%s","%s","%s");' % (name,minute,second,singer,album,commentator,comment,praise,date)
            conn.cursor().execute(sql)
            conn.commit()

注意：创建mysql表时，需要将带文字的类型写为text，并字符集为utf8mb4，否则写入的时候会报错