python爬取b站字幕

天下<*+*>

于 2025-03-05 19:08:53 发布

阅读量510

点赞数 6

文章标签： python 开发语言

本文链接：https://blog.youkuaiyun.com/m0_73177400/article/details/146049839

版权

起初是这样的，我想到网上些408考研的，想做电子笔记，手敲的话实在是太累了，就想的怎么下载字幕，网上一搜快速提取视频字幕！适用B站、AI字幕等等。好用 - 哔哩哔哩 (bilibili.com)，发现这样做的话还是麻烦，就想的能不能用python爬取字幕，

我就找到这样的规律，发现了要找的字幕的文件就放在了这里，

又多次实验发现了只要bv和av就可以获取到，那么如何获取到bv或者av呢，我开始搜索到了的话

BV-AV号互转 - 碧梨工具箱 | bilitools.top。用python模拟其实现逻辑

# 定义所需的常量和函数以执行转换
XORCODE = 23442827791579
MASKCODE = 2251799813685247
BASE = 58
datastring = "FcwAPNKTMug3GV5Lj7EJnHpWsx4tb8haYeviqBz6rkCy12mUSDQX9RdoZf"
# 定义常量和字符串
base = 58

# Python中的reduce函数
from functools import reduce

# 定义reduce操作的函数
def reduceoperation(t, n):
    index = datastring.index(n)  # 获取字符在字符串中的索引
    return t * base + index  # 返回累加结果

# 使用reduce函数处理数组n，初始值为0


# 定义常量
BASE = 58
XOR_CODE = 23442827791579
MASK_CODE = 2251799813685247

# 将字符映射到索引值

# 示例数据


# 执行转换


# 打印结果


def bv2av(bv):
    n = list(bv)
    n[3], n[9] = n[9], n[3]
    n[4], n[7] = n[7], n[4]
    n = n[3:]
    o = reduce(reduceoperation, n, 0)

    return (o & MASKCODE) ^ XORCODE

# 将BV号转换为AV号
bv = "BV1tP9iYsEzq"
av = bv2av(bv)

但后面发现没有这么麻烦，直接提取页面中的信息就行了

这里window_playinfo.然后的话提取里面的aid和cids。

import json
import requests
from bs4 import BeautifulSoup
import re

# 目标网址
headers = {

    "Cookie": 

    "Origin": "https://www.bilibili.com",


    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0"

}
bvnane="BV1wANCebE8z"
pp="0"

url = f"https://www.bilibili.com/video/{bvnane}/?spm_id_from=333.1007.tianma.1-2-2.click&vd_source=efbc10526f5fa5642530923cf09ce506&p={pp}"

# 发送HTTP请求
response = requests.get(url, headers=headers)

aid_match = re.findall(r'.*,"aid":(\d+),', response.text)[0]
cids_match = re.findall(f'.*"{url.split("/")[4]}","cid":(\d+),', response.text)[0]
# print(response.text)
# print(cids_match)
htur=f"https://api.bilibili.com/x/player/wbi/v2?aid={aid_match}&cid={cids_match}"
print(htur)
#28248114986
response2 = requests.get(htur, headers=headers)
# print(response2.text)
pattern = r'"subtitle_url":"(.*?)","subtitle_url_v2"'

# 使用re.search()查找匹配的内容
match = re.search(pattern, response2.text)

# 如果找到匹配项，则提取第一个括号内的内容
if match:
    subtitle_url = match.group(1)
    print(subtitle_url)
else:
    print("No match found")


uul=f"https:{subtitle_url}"
responses1=requests.get(uul,headers=headers)
print(responses1.text)
data = json.loads(responses1.text)

# 提取content内容
print("\n\n\n\n")
contents = [item['content'] for item in data['body']]
print(contents)

注意每次请求的时候都要带上cookie，不然返回的数据有缺失的

我们请求