网络爬虫-知乎Live-Live评论与观众-MongoDB数据库

本文介绍了一种使用Python爬取知乎Live数据的方法,包括动态加载地址解析和利用MongoDB存储数据。通过请求API获取Live链接及观众ID等信息,并采用随机延时策略避免被封禁。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.解析了AjAx动态加载地址

2.键值型MongoDB数据库

代码如下:

首先先获取zhihu-live中的各个Live链接地址

import json, time
import random
import requests
from pymongo import MongoClient

client = MongoClient('localhost', 27017)
db = client.zhihu_live
collection = db.zhihu_live

is_end = False
link = 'https://api.zhihu.com/lives/homefeed?includes=live'

def scrapy(link):
    headers = {
    'authority': 'api.zhihu.com',
    'origin': 'https: //www.zhihu.com',
    'referer': 'https: //www.zhihu.com/lives/897097999497437184/related',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)\
     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132\
     Safari/537.36'
    }
    try:
        r = requests.get(link, headers=headers)
        return r.text
    except Exception as e:
        print('Error:', e)
        scrapy(link)

while not is_end:
    html = scrapy(link)
    decodejson = json.loads(html)
    collection.insert_one(decodejson)
    link = decodejson['paging']['next']
    is_end = decodejson['paging']['is_end']
    time.sleep(random.randint(2, 3) + random.random())

 

然后获取各个live链接地址里的观众id等信息

import json, time
import random
import requests
from pymongo import MongoClient

client = MongoClient('localhost', 27017)
db = client.zhihu_live
collection = db.zhihu_live

def get_audience(live_id):
    headers = {
    'authority': 'api.zhihu.com',
    'origin': 'https: //www.zhihu.com',
    'referer': 'https: //www.zhihu.com/lives/897097999497437184/related',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)\
     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132\
     Safari/537.36'
    }
    link = 'https://api.zhihu.com/lives/' + live_id + '/members?limit=10&offset=0'

    is_end = False
    while not is_end:
        try:
            r = requests.get(link, headers=headers)
            html = r.text
            decodejson = json.loads(html)
            decodejson['live_id'] = live_id
            db.zhihu_live_audience.insert_one(decodejson)

            link = decodejson['paging']['next']
            is_end = decodejson['paging']['is_end']
            time.sleep(random.randint(2, 3) + random.random())
        except Exception as e:
            print('Error:', e)

def id_get():
    firt_page = collection.find_one()
    for each in firt_page['data']:
        live_id = each['live']['id']
        print(each['live']['id'], '\t', each['live']['speaker']['member']['name'])
        get_audience(live_id)


if __name__ == '__main__':
    id_get()

测试运行结果如下图

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值