爬取百思不得姐网站上的MP4视频，有限制，必须是网页内有mp4才行，18年3月之后百思不得姐网页版改版之后不行了-优快云博客

本文链接：https://blog.youkuaiyun.com/WASEFADG/article/details/80907831

本文介绍了一个使用Python编写的简单爬虫程序，该程序能够从指定网站抓取包含视频的HTML页面，并从中解析出视频的URL及名称进行下载。涉及的技术包括requests库、正则表达式等。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

代码如下，贡献出来

#-*-coding:utf-8 -*-

import requests #第三方
import  re #正则模块
import urllib
def get_response(url):
    response=requests.get(url).text
    return response #返回网页源代码

def get_content(html): #在这个函数里解析出来包含视频的html，然后返回
    reg=re.compile(r'(<div class="j-r-list-c">.*?</div>.*?</div>)',re.S) #正则表达式的元字符.匹配任意字符，默认不匹配换行和制表
    # 加re.S能匹配换行， *匹配前面的元字符的0次到多次
    return re.findall(reg,html)

def get_mp4_url(response):
    reg=r'data-mp4="(.*?)"'
    return re.findall(reg,response)

def get_mp4_name(response):
    reg=re.compile('<a href="/detail-.{8}.html">(.*?)</a>')
    return re.findall(reg,response)
def download_mp4(mp4_url,path):
    path=''.join(path.split())
    path='D:\\xx\\{}.mp4'.format(path.decode('utf-8').encode('gbk')) #视频的存储路径
    #urllib.urlretrieve(mp4_url,path)
    content=get_response(mp4_url)
    with open(path,'wb') as f:
        f.write(content)
if __name__== '__main__':#判断是不是当前文件执行
    start_url='http://www.budejie.com/'
    #print(len(get_content(get_response(start_url))))
    content=get_content(get_response(start_url))
    for i in content:
        mp4_url=get_mp4_url(i)
        if mp4_url:
            mp4_name=get_mp4_name(i)
            print(mp4_url[0],mp4_name[0])