Python爬虫绕过Google reCAPTCHA终极指南

一、引言:Google reCAPTCHA——爬虫工程的最大挑战

在当今网络安全日益重要的背景下,Google reCAPTCHA已成为网站反爬体系中最强大的防线之一。根据2025年爬虫安全报告显示,93.7%的国际网站78.2%的国内网站已部署Google reCAPTCHA验证机制。

Google reCAPTCHA经历了多个版本的演进:

  • reCAPTCHA v1(2007-2018):文字识别验证码(已淘汰)
  • reCAPTCHA v2(2014-至今):"我不是机器人"复选框 + 图片验证
  • reCAPTCHA v3(2018-至今):无感验证,基于用户行为评分
  • reCAPTCHA Enterprise(2020-至今):企业级解决方案,更复杂的AI检测

面对如此强大的验证体系,传统的爬虫技术已完全失效。本文将系统性地讲解各类reCAPTCHA的绕过策略,从基础原理到高级技巧,从免费方案到付费服务,为你构建一套完整的解决方案。


二、reCAPTCHA技术深度解析

2.1 reCAPTCHA v2 工作原理

reCAPTCHA v2采用多层验证机制:

2.1.1 第一层:复选框验证
  • 用户点击"我不是机器人"复选框
  • Google收集浏览器指纹、鼠标轨迹、点击时间等行为数据
  • 如果风险评分较低,直接通过验证
  • 如果风险评分较高,进入第二层验证
2.1.2 第二层:图片验证
  • 显示9张或16张图片
  • 要求用户选择包含特定物体的图片
  • 可能需要多轮验证(选择所有包含交通灯的图片 → 选择所有包含人行横道的图片)
2.1.3 技术实现
<!-- reCAPTCHA v2 嵌入代码 -->
<div class="g-recaptcha" data-sitekey="6LcXAAAAA..."></div>
<script src="https://www.google.com/recaptcha/api.js"></script>

验证成功后,会在表单中生成一个隐藏字段:

<input type="hidden" name="g-recaptcha-response" value="03A...">

2.2 reCAPTCHA v3 工作原理

reCAPTCHA v3采用完全无感的验证方式:

2.2.1 核心特点
  • 无用户交互:用户完全感知不到验证过程
  • 行为评分:返回0.0-1.0的风险评分(1.0表示可信,0.0表示机器人)
  • API调用:通过JavaScript API在后台执行验证
  • 自定义阈值:网站可设置评分阈值(通常0.5)
2.2.2 技术实现
// reCAPTCHA v3 JavaScript调用
grecaptcha.execute('6LcXAAAAA...', {action: 'login'}).then(function(token) {
    // 将token发送到服务器验证
    document.getElementById('recaptchaResponse').value = token;
});

服务器端验证:

# Python服务器端验证示例
import requests

def verify_recaptcha_v3(token, secret_key):
    url = "https://www.google.com/recaptcha/api/siteverify"
    data = {
        'secret': secret_key,
        'response': token,
        'remoteip': request.remote_addr
    }
    response = requests.post(url, data=data)
    result = response.json()
    return result['success'] and result['score'] >= 0.5

2.3 reCAPTCHA检测机制详解

Google reCAPTCHA通过以下维度检测机器人:

检测维度具体指标绕过难度
浏览器指纹User-Agent、WebGL、Canvas、字体列表★★★★☆
行为分析鼠标轨迹、点击模式、页面停留时间★★★★★
网络特征IP信誉、请求频率、TLS指纹★★★★☆
自动化工具WebDriver属性、自动化框架特征★★★★★
环境检测浏览器插件、屏幕分辨率、时区★★★☆☆

三、绕过策略全景图

3.1 策略分类

策略类型适用场景成功率成本复杂度
环境伪装reCAPTCHA v2复选框60-70%★★☆☆☆
行为模拟reCAPTCHA v2图片验证80-90%★★★★☆
第三方服务所有类型95%+★★☆☆☆
Token复用reCAPTCHA v370-80%★★★☆☆
代理IP池配合其他策略提升10-20%★★★☆☆

3.2 选择策略的原则

  1. 成本优先:预算有限时优先尝试免费方案
  2. 成功率优先:关键业务场景选择付费服务
  3. 维护成本:考虑长期维护的复杂度
  4. 法律合规:确保使用方式符合网站条款

四、环境伪装技术深度实践

4.1 浏览器指纹绕过

4.1.1 WebDriver属性检测

Google会检测navigator.webdriver属性:

// 检测代码
if (navigator.webdriver) {
    // 认定为自动化工具
}

绕过方法

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def create_stealth_driver():
    """创建隐身Chrome驱动"""
    chrome_options = Options()
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    driver = webdriver.Chrome(options=chrome_options)
    
    # 删除webdriver属性
    driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
        'source': '''
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined
            });
        '''
    })
    
    return driver
4.1.2 Canvas/WebGL指纹绕过
def inject_canvas_fingerprint(driver):
    """注入Canvas指纹绕过脚本"""
    canvas_js = """
    // 重写Canvas相关方法
    const toBlob = HTMLCanvasElement.prototype.toBlob;
    const toDataURL = HTMLCanvasElement.prototype.toDataURL;
    const getImageData = CanvasRenderingContext2D.prototype.getImageData;

    HTMLCanvasElement.prototype.toBlob = function() {
        setTimeout(() => toBlob.apply(this, arguments), 1000);
    };

    HTMLCanvasElement.prototype.toDataURL = function(type, quality) {
        const result = toDataURL.apply(this, arguments);
        // 添加随机噪声
        return result.replace(/(.{10})$/, 'X$1');
    };

    CanvasRenderingContext2D.prototype.getImageData = function(x, y, w, h) {
        const imageData = getImageData.apply(this, arguments);
        if (imageData.data.length > 0) {
            // 添加微小噪声
            imageData.data[0] = (imageData.data[0] + Math.floor(Math.random() * 3)) % 256;
        }
        return imageData;
    };
    
    // WebGL指纹绕过
    const getContext = HTMLCanvasElement.prototype.getContext;
    HTMLCanvasElement.prototype.getContext = function(type, attributes) {
        const context = getContext.call(this, type, attributes);
        if (type === 'webgl' || type === 'experimental-webgl') {
            const getParameter = context.getParameter;
            context.getParameter = function(parameter) {
                if (parameter === 37445) { // UNMASKED_VENDOR_WEBGL
                    return 'Intel Inc.';
                }
                if (parameter === 37446) { // UNMASKED_RENDERER_WEBGL
                    return 'Intel Iris OpenGL Engine';
                }
                return getParameter.call(this, parameter);
            };
        }
        return context;
    };
    """
    
    driver.execute_script(canvas_js)
4.1.3 插件和语言检测绕过
def inject_plugin_fingerprint(driver):
    """注入插件指纹绕过脚本"""
    plugin_js = """
    // 模拟真实浏览器插件
    Object.defineProperty(navigator, 'plugins', {
        get: () => [
            {
                name: 'Chrome PDF Plugin',
                filename: 'internal-pdf-viewer',
                description: 'Portable Document Format'
            },
            {
                name: 'Chrome PDF Viewer',
                filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai',
                description: ''
            },
            {
                name: 'Native Client',
                filename: 'internal-nacl-plugin',
                description: ''
            }
        ]
    });
    
    // 设置语言
    Object.defineProperty(navigator, 'languages', {
        get: () => ['en-US', 'en']
    });
    
    // 设置硬件并发
    Object.defineProperty(navigator, 'hardwareConcurrency', {
        get: () => 8
    });
    
    // 设置设备内存
    Object.defineProperty(navigator, 'deviceMemory', {
        get: () => 8
    });
    """
    
    driver.execute_script(plugin_js)

4.2 完整环境伪装示例

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import random

class StealthBrowser:
    def __init__(self):
        self.driver = self._create_driver()
        self._inject_stealth_scripts()
    
    def _create_driver(self):
        """创建基础驱动"""
        chrome_options = Options()
        chrome_options.add_argument("--disable-blink-features=AutomationControlled")
        chrome_options.add_argument("--disable-infobars")
        chrome_options.add_argument("--disable-extensions")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("--window-size=1920,1080")
        chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
        
        # 禁用自动化特征
        chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
        chrome_options.add_experimental_option('useAutomationExtension', False)
        
        driver = webdriver.Chrome(options=chrome_options)
        return driver
    
    def _inject_stealth_scripts(self):
        """注入所有隐身脚本"""
        stealth_scripts = [
            self._get_webdriver_script(),
            self._get_canvas_script(),
            self._get_plugin_script(),
            self._get_chrome_script()
        ]
        
        for script in stealth_scripts:
            self.driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
                'source': script
            })
    
    def _get_webdriver_script(self):
        return """
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        });
        """
    
    def _get_canvas_script(self):
        return """
        const toBlob = HTMLCanvasElement.prototype.toBlob;
        const toDataURL = HTMLCanvasElement.prototype.toDataURL;
        const getImageData = CanvasRenderingContext2D.prototype.getImageData;
        
        HTMLCanvasElement.prototype.toBlob = function() {
            setTimeout(() => toBlob.apply(this, arguments), 1000);
        };
        
        HTMLCanvasElement.prototype.toDataURL = function(type, quality) {
            const result = toDataURL.apply(this, arguments);
            return result.replace(/(.{10})$/, 'X$1');
        };
        
        CanvasRenderingContext2D.prototype.getImageData = function(x, y, w, h) {
            const imageData = getImageData.apply(this, arguments);
            if (imageData.data.length > 0) {
                imageData.data[0] = (imageData.data[0] + Math.floor(Math.random() * 3)) % 256;
            }
            return imageData;
        };
        """
    
    def _get_plugin_script(self):
        return """
        Object.defineProperty(navigator, 'plugins', {
            get: () => [
                {name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer', description: 'Portable Document Format'},
                {name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai', description: ''},
                {name: 'Native Client', filename: 'internal-nacl-plugin', description: ''}
            ]
        });
        
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en']
        });
        
        Object.defineProperty(navigator, 'hardwareConcurrency', {
            get: () => 8
        });
        
        Object.defineProperty(navigator, 'deviceMemory', {
            get: () => 8
        });
        """
    
    def _get_chrome_script(self):
        return """
        window.chrome = {
            runtime: {},
            loadTimes: function() {
                return {
                    requestTime: new Date().getTime(),
                    startLoadTime: new Date().getTime(),
                    commitLoadTime: new Date().getTime(),
                    finishDocumentLoadTime: new Date().getTime(),
                    finishLoadTime: new Date().getTime(),
                    firstPaintTime: new Date().getTime(),
                    firstPaintAfterLoadTime: new Date().getTime(),
                    navigationType: "Other",
                    wasFetchedViaSpdy: false,
                    wasNpnNegotiated: false,
                    npnNegotiatedProtocol: "",
                    wasAlternateProtocolAvailable: false,
                    connectionInfo: "unknown"
                };
            }
        };
        """
    
    def get_driver(self):
        return self.driver
    
    def close(self):
        self.driver.quit()

# 使用示例
if __name__ == "__main__":
    browser = StealthBrowser()
    driver = browser.get_driver()
    
    try:
        driver.get("https://www.google.com/recaptcha/api2/demo")
        time.sleep(5)
        
        # 检查是否成功绕过检测
        if "recaptcha" in driver.page_source.lower():
            print("页面加载成功,可能需要进一步处理")
        else:
            print("可能被检测为机器人")
            
    finally:
        browser.close()

五、行为模拟技术深度实践

5.1 鼠标轨迹模拟

5.1.1 人类鼠标轨迹特征

真实用户的鼠标轨迹具有以下特征:

  • 加速度变化:开始慢,中间快,结束慢
  • 曲线轨迹:不是直线,有自然弯曲
  • 微小抖动:坐标有微小随机波动
  • 停顿点:在目标点附近有短暂停顿
5.1.2 贝塞尔曲线轨迹生成
import math
import random
from selenium.webdriver.common.action_chains import ActionChains

def generate_bezier_curve(start_x, start_y, end_x, end_y, points=20):
    """
    生成贝塞尔曲线轨迹
    :param start_x: 起始X坐标
    :param start_y: 起始Y坐标
    :param end_x: 结束X坐标
    :param end_y: 结束Y坐标
    :param points: 轨迹点数量
    :return: 轨迹点列表
    """
    # 生成控制点(在起始点和结束点之间随机偏移)
    control_x1 = start_x + random.randint(-100, 100)
    control_y1 = start_y + random.randint(-100, 100)
    control_x2 = end_x + random.randint(-100, 100)
    control_y2 = end_y + random.randint(-100, 100)
    
    trajectory = []
    for i in range(points + 1):
        t = i / points
        # 三次贝塞尔曲线公式
        x = (1 - t)**3 * start_x + 3 * (1 - t)**2 * t * control_x1 + 3 * (1 - t) * t**2 * control_x2 + t**3 * end_x
        y = (1 - t)**3 * start_y + 3 * (1 - t)**2 * t * control_y1 + 3 * (1 - t) * t**2 * control_y2 + t**3 * end_y
        
        # 添加微小抖动
        x += random.uniform(-2, 2)
        y += random.uniform(-2, 2)
        
        trajectory.append((int(x), int(y)))
    
    return trajectory

def human_like_move_to_element(driver, element, duration=2.0):
    """
    模拟人类移动到元素
    :param driver: WebDriver实例
    :param element: 目标元素
    :param duration: 移动总时间(秒)
    """
    # 获取当前鼠标位置(假设在(0,0))
    current_x, current_y = 0, 0
    
    # 获取目标元素位置
    location = element.location
    target_x = location['x'] + element.size['width'] // 2
    target_y = location['y'] + element.size['height'] // 2
    
    # 生成轨迹
    trajectory = generate_bezier_curve(current_x, current_y, target_x, target_y)
    
    # 执行移动
    actions = ActionChains(driver)
    actions.move_to_element_with_offset(element, -element.size['width'] // 2, -element.size['height'] // 2)
    actions.perform()
    
    # 分步移动
    total_points = len(trajectory)
    for i, (x, y) in enumerate(trajectory):
        if i == 0:
            continue
        
        prev_x, prev_y = trajectory[i-1]
        dx = x - prev_x
        dy = y - prev_y
        
        # 计算时间间隔(模拟加速度)
        if i < total_points * 0.3:  # 加速阶段
            time_delay = random.uniform(0.01, 0.03)
        elif i < total_points * 0.7:  # 匀速阶段
            time_delay = random.uniform(0.02, 0.05)
        else:  # 减速阶段
            time_delay = random.uniform(0.03, 0.08)
        
        actions = ActionChains(driver)
        actions.move_by_offset(dx, dy)
        actions.perform()
        time.sleep(time_delay)

5.2 reCAPTCHA v2 复选框点击模拟

def click_recaptcha_checkbox(driver, max_retries=3):
    """
    模拟人类点击reCAPTCHA复选框
    :param driver: WebDriver实例
    :param max_retries: 最大重试次数
    :return: 是否成功点击
    """
    for attempt in range(max_retries):
        try:
            # 等待复选框出现
            checkbox = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.CLASS_NAME, "recaptcha-checkbox-checkmark"))
            )
            
            # 模拟人类移动到复选框
            human_like_move_to_element(driver, checkbox)
            
            # 随机停顿
            time.sleep(random.uniform(0.5, 1.5))
            
            # 点击复选框
            checkbox.click()
            
            # 等待验证结果
            time.sleep(2)
            
            # 检查是否出现图片验证
            try:
                image_challenge = driver.find_element(By.CLASS_NAME, "rc-imageselect-payload")
                print("出现图片验证,需要进一步处理")
                return False
            except:
                # 检查是否验证成功
                try:
                    success_element = driver.find_element(By.CLASS_NAME, "recaptcha-checkbox-checked")
                    print("reCAPTCHA验证成功!")
                    return True
                except:
                    print(f"第{attempt+1}次尝试失败,重试中...")
                    continue
                    
        except Exception as e:
            print(f"点击复选框时发生错误: {str(e)}")
            continue
    
    print("所有重试都失败了")
    return False

5.3 图片验证处理

5.3.1 图片下载与识别
import requests
from PIL import Image
import io
import base64

def download_recaptcha_images(driver):
    """
    下载reCAPTCHA图片验证的图片
    :param driver: WebDriver实例
    :return: 图片字节数据列表
    """
    images = []
    
    # 等待图片加载
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "rc-image-tile-wrapper"))
    )
    
    # 获取所有图片元素
    image_elements = driver.find_elements(By.CLASS_NAME, "rc-image-tile-44")
    
    for element in image_elements:
        try:
            # 获取图片URL
            img_url = element.find_element(By.TAG_NAME, "img").get_attribute("src")
            
            # 下载图片
            if img_url.startswith("data:image"):
                # Base64图片
                img_data = img_url.split(",")[1]
                img_bytes = base64.b64decode(img_data)
            else:
                # 网络图片
                response = requests.get(img_url)
                img_bytes = response.content
            
            images.append(img_bytes)
            
        except Exception as e:
            print(f"下载图片时发生错误: {str(e)}")
            images.append(None)
    
    return images

def recognize_recaptcha_images(images, instruction):
    """
    识别reCAPTCHA图片(使用第三方服务或本地模型)
    :param images: 图片字节数据列表
    :param instruction: 验证指令(如"选择包含交通灯的图片")
    :return: 需要点击的图片索引列表
    """
    # 这里可以集成超级鹰、2Captcha等第三方服务
    # 或者使用本地训练的深度学习模型
    
    # 示例:使用伪随机选择(实际应用中需要真正的识别)
    selected_indices = []
    for i, img in enumerate(images):
        if img is not None:
            # 这里应该调用真正的识别服务
            # 为演示目的,随机选择50%的图片
            if random.random() > 0.5:
                selected_indices.append(i)
    
    return selected_indices
5.3.2 图片点击模拟
def click_recaptcha_images(driver, selected_indices):
    """
    点击reCAPTCHA图片
    :param driver: WebDriver实例
    :param selected_indices: 需要点击的图片索引列表
    """
    # 获取所有图片元素
    image_elements = driver.find_elements(By.CLASS_NAME, "rc-image-tile-44")
    
    for index in selected_indices:
        if index < len(image_elements):
            element = image_elements[index]
            
            # 模拟人类点击
            human_like_move_to_element(driver, element)
            time.sleep(random.uniform(0.3, 0.8))
            element.click()
            
            # 随机停顿
            time.sleep(random.uniform(0.2, 0.5))
    
    # 点击验证按钮
    try:
        verify_button = driver.find_element(By.ID, "recaptcha-verify-button")
        human_like_move_to_element(driver, verify_button)
        time.sleep(random.uniform(0.5, 1.0))
        verify_button.click()
    except:
        print("未找到验证按钮")

六、第三方服务集成方案

6.1 2Captcha服务集成

6.1.1 2Captcha简介

2Captcha是专业的验证码识别服务,支持reCAPTCHA v2/v3,识别率高达95%以上。

6.1.2 Python集成代码
import requests
import time
import json

class TwoCaptchaSolver:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "http://2captcha.com"
    
    def solve_recaptcha_v2(self, site_key, page_url, invisible=0):
        """
        解决reCAPTCHA v2
        :param site_key: 网站的site key
        :param page_url: 页面URL
        :param invisible: 是否为隐形reCAPTCHA (0或1)
        :return: reCAPTCHA响应token
        """
        # 1. 提交验证码任务
        task_data = {
            "key": self.api_key,
            "method": "userrecaptcha",
            "googlekey": site_key,
            "pageurl": page_url,
            "invisible": invisible,
            "json": 1
        }
        
        response = requests.post(f"{self.base_url}/in.php", data=task_data)
        result = response.json()
        
        if result["status"] == 1:
            task_id = result["request"]
            print(f"验证码任务提交成功,任务ID: {task_id}")
            
            # 2. 轮询获取结果
            for _ in range(30):  # 最多等待30秒
                time.sleep(5)
                result = self._get_result(task_id)
                if result["status"] == 1:
                    return result["request"]
                elif result["request"] == "CAPCHA_NOT_READY":
                    continue
                else:
                    raise Exception(f"验证码识别失败: {result['request']}")
            
            raise Exception("验证码识别超时")
        else:
            raise Exception(f"提交验证码任务失败: {result['request']}")
    
    def solve_recaptcha_v3(self, site_key, page_url, action="verify", min_score=0.3):
        """
        解决reCAPTCHA v3
        :param site_key: 网站的site key
        :param page_url: 页面URL
        :param action: 验证动作
        :param min_score: 最小分数要求
        :return: reCAPTCHA响应token
        """
        task_data = {
            "key": self.api_key,
            "method": "userrecaptcha",
            "version": "v3",
            "googlekey": site_key,
            "pageurl": page_url,
            "action": action,
            "min_score": min_score,
            "json": 1
        }
        
        response = requests.post(f"{self.base_url}/in.php", data=task_data)
        result = response.json()
        
        if result["status"] == 1:
            task_id = result["request"]
            print(f"reCAPTCHA v3任务提交成功,任务ID: {task_id}")
            
            # 轮询获取结果
            for _ in range(20):
                time.sleep(3)
                result = self._get_result(task_id)
                if result["status"] == 1:
                    return result["request"]
                elif result["request"] == "CAPCHA_NOT_READY":
                    continue
                else:
                    raise Exception(f"reCAPTCHA v3识别失败: {result['request']}")
            
            raise Exception("reCAPTCHA v3识别超时")
        else:
            raise Exception(f"提交reCAPTCHA v3任务失败: {result['request']}")
    
    def _get_result(self, task_id):
        """获取验证码识别结果"""
        params = {
            "key": self.api_key,
            "action": "get",
            "id": task_id,
            "json": 1
        }
        response = requests.get(f"{self.base_url}/res.php", params=params)
        return response.json()
    
    def report_bad(self, task_id):
        """报告错误的验证码结果"""
        params = {
            "key": self.api_key,
            "action": "reportbad",
            "id": task_id,
            "json": 1
        }
        response = requests.get(f"{self.base_url}/res.php", params=params)
        return response.json()

# 使用示例
if __name__ == "__main__":
    solver = TwoCaptchaSolver("your_2captcha_api_key")
    
    try:
        # 解决reCAPTCHA v2
        site_key = "6LcXAAAAA..."  # 从网页源码中获取
        page_url = "https://example.com/login"
        
        recaptcha_token = solver.solve_recaptcha_v2(site_key, page_url)
        print(f"reCAPTCHA token: {recaptcha_token}")
        
        # 使用token进行登录
        login_data = {
            "username": "your_username",
            "password": "your_password",
            "g-recaptcha-response": recaptcha_token
        }
        
        response = requests.post(page_url, data=login_data)
        print(f"登录结果: {response.status_code}")
        
    except Exception as e:
        print(f"发生错误: {str(e)}")

6.2 Anti-Captcha服务集成

class AntiCaptchaSolver:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://api.anti-captcha.com"
    
    def solve_recaptcha_v2(self, site_key, page_url):
        """解决reCAPTCHA v2"""
        # 创建任务
        task_data = {
            "clientKey": self.api_key,
            "task": {
                "type": "NoCaptchaTaskProxyless",
                "websiteURL": page_url,
                "websiteKey": site_key
            }
        }
        
        response = requests.post(f"{self.base_url}/createTask", json=task_data)
        result = response.json()
        
        if result["errorId"] == 0:
            task_id = result["taskId"]
            print(f"Anti-Captcha任务创建成功,任务ID: {task_id}")
            
            # 轮询获取结果
            for _ in range(30):
                time.sleep(5)
                solution = self._get_task_result(task_id)
                if solution["status"] == "ready":
                    return solution["solution"]["gRecaptchaResponse"]
                elif solution["status"] == "processing":
                    continue
                else:
                    raise Exception(f"Anti-Captcha任务失败: {solution}")
            
            raise Exception("Anti-Captcha任务超时")
        else:
            raise Exception(f"创建Anti-Captcha任务失败: {result['errorDescription']}")
    
    def _get_task_result(self, task_id):
        """获取Anti-Captcha任务结果"""
        data = {
            "clientKey": self.api_key,
            "taskId": task_id
        }
        response = requests.post(f"{self.base_url}/getTaskResult", json=data)
        return response.json()

七、reCAPTCHA v3 绕过策略

7.1 Token复用技术

reCAPTCHA v3的token有一定的有效期(通常2分钟),可以在有效期内复用。

import time
from collections import defaultdict

class RecaptchaV3TokenCache:
    def __init__(self, expiration_time=120):  # 2分钟过期
        self.cache = {}
        self.expiration_time = expiration_time
    
    def get_token(self, site_key, action):
        """获取缓存的token"""
        key = f"{site_key}:{action}"
        if key in self.cache:
            token, timestamp = self.cache[key]
            if time.time() - timestamp < self.expiration_time:
                return token
            else:
                del self.cache[key]
        return None
    
    def set_token(self, site_key, action, token):
        """缓存token"""
        key = f"{site_key}:{action}"
        self.cache[key] = (token, time.time())
    
    def clear_expired(self):
        """清理过期的token"""
        current_time = time.time()
        expired_keys = []
        for key, (token, timestamp) in self.cache.items():
            if current_time - timestamp >= self.expiration_time:
                expired_keys.append(key)
        
        for key in expired_keys:
            del self.cache[key]

# 使用示例
token_cache = RecaptchaV3TokenCache()

def get_recaptcha_v3_token(site_key, action, page_url):
    """获取reCAPTCHA v3 token"""
    # 先检查缓存
    cached_token = token_cache.get_token(site_key, action)
    if cached_token:
        return cached_token
    
    # 使用第三方服务获取新token
    solver = TwoCaptchaSolver("your_api_key")
    new_token = solver.solve_recaptcha_v3(site_key, page_url, action)
    
    # 缓存新token
    token_cache.set_token(site_key, action, new_token)
    
    return new_token

7.2 本地模拟reCAPTCHA v3

对于某些简单的reCAPTCHA v3实现,可以尝试本地模拟:

import hashlib
import time
import random

def generate_fake_recaptcha_v3_token(site_key, action):
    """
    生成伪造的reCAPTCHA v3 token(仅适用于某些简单实现)
    注意:这种方法对Google官方的reCAPTCHA v3无效
    """
    # 生成基于时间戳和随机数的token
    timestamp = str(int(time.time()))
    random_str = ''.join(random.choices('abcdefghijklmnopqrstuvwxyz0123456789', k=32))
    
    # 创建token(这只是示例,实际token是JWT格式)
    token_data = f"{site_key}.{action}.{timestamp}.{random_str}"
    fake_token = hashlib.sha256(token_data.encode()).hexdigest()
    
    return fake_token

八、代理IP池集成

8.1 代理IP的重要性

使用代理IP可以:

  • 避免IP被封禁
  • 模拟不同地理位置的用户
  • 提高reCAPTCHA通过率

8.2 代理IP池实现

import requests
from random import choice

class ProxyPool:
    def __init__(self, proxy_list=None, proxy_api=None):
        self.proxy_list = proxy_list or []
        self.proxy_api = proxy_api
        self.current_index = 0
    
    def get_proxy(self):
        """获取一个代理"""
        if self.proxy_api:
            # 从API获取动态代理
            response = requests.get(self.proxy_api)
            if response.status_code == 200:
                proxy = response.text.strip()
                return {"http": f"http://{proxy}", "https": f"http://{proxy}"}
        
        if self.proxy_list:
            # 从列表中轮询获取
            proxy = self.proxy_list[self.current_index]
            self.current_index = (self.current_index + 1) % len(self.proxy_list)
            return {"http": f"http://{proxy}", "https": f"http://{proxy}"}
        
        return None
    
    def test_proxy(self, proxy, timeout=5):
        """测试代理是否可用"""
        try:
            response = requests.get("http://httpbin.org/ip", proxies=proxy, timeout=timeout)
            return response.status_code == 200
        except:
            return False

# 使用示例
proxy_pool = ProxyPool(
    proxy_list=[
        "1.2.3.4:8080",
        "5.6.7.8:8080",
        "9.10.11.12:8080"
    ]
)

def make_request_with_proxy(url, **kwargs):
    """使用代理发送请求"""
    proxy = proxy_pool.get_proxy()
    if proxy:
        kwargs['proxies'] = proxy
    
    return requests.get(url, **kwargs)

8.3 Selenium集成代理

def create_driver_with_proxy(proxy_host, proxy_port):
    """创建带代理的WebDriver"""
    chrome_options = Options()
    chrome_options.add_argument(f"--proxy-server={proxy_host}:{proxy_port}")
    
    # 其他隐身设置...
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    driver = webdriver.Chrome(options=chrome_options)
    
    # 注入隐身脚本
    driver.execute_cdp_cmd('Page.addScriptToEvaluate
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ZTLJQ

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值