微博简单的免登陆获取

最新推荐文章于 2022-09-08 15:59:49 发布

原创最新推荐文章于 2022-09-08 15:59:49 发布 · 1.1w 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python #微博手机网页 #爬虫

笔记专栏收录该内容

26 篇文章

订阅专栏

这篇博客简单记录了如何通过Python免登陆获取微博手机网页的数据。虽然存在一些条数上的出入，但已确认程序无误。后续将分享登陆过程。

部署运行你感兴趣的模型镜像

这个记录一下，简单web端的登陆有空贴上来。此处是从浏览器手机一面走的。仅供参考

phone_url="***********&page={0}".format(h)
        header={
            "User-Agent":"*********",
        }
        req=requests.get(url=phone_url,headers=header)
        res_data=json.loads(req.text).get("data")
        detail_datas=res_data.get('c****ds')
        for detail_data in detail_datas:
            page_count += 1
            #非博客正文跳过
            if detail_data.get('car*****up'):
                continue
            detail_data_all=detail_data.get('m****og')
            """
            page_info media_info mp4_hd_url        """
            #点赞数
            zan_count=detail_data_all.get('attitudes_count',0)
            #评论数
            ping_count=detail_data_all.get('comments_count',0)
            #发布时间
            creat_time=detail_data_all.get('created_at')
            #图片
            pics=[i.get('url')  for i in detail_data_all.get('pics',[0]) if i !=[0]]
            #转发数
            reposts_count=detail_data_all.get('reposts_count','0')
            #视屏url
            video_infos=detail_data_all.get('page_info')
            if video_infos:
                video_info=video_infos.get('media_info')
                if video_info:
                    #视频url
                    video_url=video_info.get('m******rl')
                    #视频播放人数
                    video_see_count=video_info.get('play_count')
             #文本这处理稍微麻烦点
            #微博文本
            text_base = detail_data_all.get('t**t')
            weibo_text=text(header,detail_data_all,text_base)
            #转发文本
            retweeted_text=""
            retweeted_status=detail_data_all.get('retweeted_status')
            if retweeted_status:
                retweeted_texts=retweeted_status.get('text')
                retweeted_text=text(header,retweeted_status,retweeted_texts)

        h+=1

        res_until = res_data.get('cardlistInfo').get('page')
        if not isinstance(res_until,int):
            break

文本提取封装

def text(header,detail_data_all,text_base):
    isLongText = detail_data_all.get('isLongText')
    # 解决换行
    text_html = etree.HTML(text_base.replace("****", "\n"))
    # 有全文情况
    if isLongText:
        # 全文url
        all_text_url = "https://m.weibo.cn" + text_html.xpath(r"//a[contains(text(),'全文') and contains(@href,'/status/')]/@href")[0]
        # print(all_text_url)
        all_text_req = requests.get(all_text_url, headers=header)
        # print(all_text_req.text)
        text_base= re.findall(r'"text": "(.*?)\n', all_text_req.text)[0]
    imgs = re.findall(r'<img.*?>', text_base)
    img = {}
    # 标签体替换图片连接i##笑脸等符号位置不变
    if imgs:
        for i in imgs:
            img_style = re.findall(r"style='(.*?)'", i)
            img['style'] = img_style[0] if img_style else ""
            img['src'] = re.findall(r"src=.*(//.*?png)", i)[0]
            img_alt = re.findall(r'alt=(.*?) ', i)
            img['img_alt'] = img_alt[0] if img_alt else ""
            text_base = text_base.replace(i, str(img)+',')
    text_html = etree.HTML(text_base.replace("*****", "\n"))
    # 提取文本
    text = text_html.xpath('string(.)').strip()
    return text

简单记录下，结果和页面出来的结果格式都一致某些原因注释了部分。条数和接口展示的总条数有十几条出入，检查程序并没有问题。登陆稍后贴

您可能感兴趣的与本文相关的镜像

Python3.10

Conda

Python

Python 是一种高级、解释型、通用的编程语言，以其简洁易读的语法而闻名，适用于广泛的应用，包括Web开发、数据分析、人工智能和自动化脚本