Appium+Charles抓取今日头条APP数据

本文介绍了如何使用夜神模拟器、mitmproxy、Charles和Appium等工具抓取今日头条的内容,包括安装证书、自动化抓取标题、博主信息和分享链接,并通过多线程处理图片下载,以及使用Appium模拟滑动操作以持续抓取新数据。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

准备工作

下载夜神模拟器,安装今日头条极速版APP

安装mitmproxy,mitmdump,安装Android版mitmproxy CA证书,需要对接python脚本

下载Charles抓包工具,安装Android版Charles CA证书,需要对夜神模拟器打开的app进行抓包分析

下载Appium自动化工具,需要对Android版模拟器进行模拟滑动操作

Charles抓包分析

script.py

import time
import json
import requests
from mitmproxy import http
from mitmproxy.tools.main import mitmdump
import os
from threading import Thread


def response(flow: http.HTTPFlow):
    start_url = 'https://lf3-article.toutiaostatic.com/article/content/lite/17/1/'
    if flow.request.url.startswith(start_url):
        print(flow.request.url)

        data = json.loads(flow.response.text)['data']
        #print(data)
        title = data['h5_extra']['title']
        url_list = data['main_image_list']

        print(data['h5_extra']['title'])
        print(data['h5_extra']['name'])
        print(data['h5_extra']['media']['description'])
        print(data['h5_extra']['itemCell']['shareInfo']['shareURL'])

        os.mkdir('F:\\APP_project\\jinritoutiao\\' + title)
        file = open('F:\\APP_project\\jinritoutiao\\' + title + '\\' + '标题信息.txt', 'a', encoding='utf-8')
        file.write('标题: ' + title + '\n' + '博主: ' + data['h5_extra']['name'] + '\n' + '博主介绍: ' + data['h5_extra']['media']['description'] + '\n'
                + '分享链接: ' + data['h5_extra']['itemCell']['shareInfo']['shareURL'] + '\n')
        file.close()
        os.mkdir('F:\\APP_project\\jinritoutiao\\' + title + '\\image')
        save_img = img(url_list, title)
        sum = len(url_list)
        a = int(sum / 4)
        if sum < 4:
            save_img.save_image(0, sum)
        elif sum % 4 == 0:
            t1 = Thread(target=save_img.save_image, args=(0, a))
            t2 = Thread(target=save_img.save_image, args=(a, a * 2))
            t3 = Thread(target=save_img.save_image, args=(a * 2, a * 3))
            t4 = Thread(target=save_img.save_image, args=(a * 3, a * 4))
            t1.start()
            t2.start()
            t3.start()
            t4.start()
            t1.join()
            t2.join()
            t3.join()
            t4.join()
        elif sum % 4 != 0:
            t1 = Thread(target=save_img.save_image, args=(0, a))
            t2 = Thread(target=save_img.save_image, args=(a, a * 2))
            t3 = Thread(target=save_img.save_image, args=(a * 2, a * 3))
            t4 = Thread(target=save_img.save_image, args=(a * 3, (a * 4) + (sum % 4)))
            t1.start()
            t2.start()
            t3.start()
            t4.start()
            t1.join()
            t2.join()
            t3.join()
            t4.join()


class img():
    def __init__(self, url_list, title):
        self.url_list = url_list
        self.title = title

    def save_image(self, start, end):
        for num in range(start, end):
            image = requests.get(self.url_list[num]['url'])
            file = open('F:\\APP_project\\jinritoutiao\\' + self.title + '\\image' + '\\image_' + str(num+1) + '.jpg',
                        "wb")
            file.write(image.content)


addons = [response]

if __name__ == "__main__":
    # 运行 Mitmproxy,并传递命令行参数
    mitmdump(['-s', 'F:\APP_project\jinritoutiao\script.py'])

打开今日头条APP和Charles,分析抓包中的json格式,结合app标题和博主,可知请求url前缀,通过startswith函数匹配对应前缀的url,然后进行后续的解析,抓取。

 

将抓取的标题,博主,博主介绍,分享链接写入标题信息.txt文件

由于抓取的图片数量较大,下载时间比较长,所以采用多线程方法,使每一个线程抓取一定量的图片,提高抓取效率。

抓取的图片保存至image文件夹,与txt文本文件处于同一级目录

需要不断滑动今日头条,产生新的URL,Charles才能抓取新的数据包。而Appium自动化工具可以实现模拟滑动操作。

Appium模拟滑动操作

main.py

from appium import webdriver
import time
from selenium.webdriver.common.by import By

def main():
    desired_caps = {
        'newCommandTimeout': 3600,
        'platformName': 'Android',
        'deviceName': 'TAS-AN00',
        'platformVersion': '7.1.2',
        'appPackage': 'com.ss.android.article.lite',
        'appActivity': '.activity.SplashActivity'
    }
    broswer = webdriver.Remote('http://127.0.0.1:4723/wd/hub', desired_caps)
    time.sleep(5)


    el1 = broswer.find_element(By.ID, value="com.ss.android.article.lite:id/vj").click()
    time.sleep(3)

    el2 = broswer.find_element(By.ID, value="com.android.packageinstaller:id/permission_allow_button").click()
    time.sleep(3)

    el3 = broswer.find_element(By.ID, value="com.android.packageinstaller:id/permission_allow_button").click()
    time.sleep(3)

    el4 = broswer.find_element(By.ID, value="com.ss.android.article.lite:id/r9").click()
    time.sleep(3)


    el5 = broswer.tap([(720, 190)], 1000)
    time.sleep(5)


    window_size = broswer.get_window_size()
    width, height = window_size.get('width'), window_size.get('height')
    #print('width:', width)
    #print('height:', height)


    while True:
        #data_old = broswer.page_source

        broswer.swipe(width * 0.5, height * 0.75, width * 0.5, height * 0.05, 2000)
        time.sleep(5)

        #data_new = broswer.page_source

        #if data_new == data_old:
        #    break
        #else:
        #    continue

if __name__ == "__main__":
    main()

先获取模拟器屏幕尺寸,然后使用swipe函数实现向上滑动操作

首先在命令行输入mitmdump -s F:\APP_project\jinritoutiao\script.py(指定python脚本),然后在pycharm运行main.py文件。最后结果如下:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值