一、引言:Google reCAPTCHA——爬虫工程的最大挑战
在当今网络安全日益重要的背景下,Google reCAPTCHA已成为网站反爬体系中最强大的防线之一。根据2025年爬虫安全报告显示,93.7%的国际网站和78.2%的国内网站已部署Google reCAPTCHA验证机制。
Google reCAPTCHA经历了多个版本的演进:
- reCAPTCHA v1(2007-2018):文字识别验证码(已淘汰)
- reCAPTCHA v2(2014-至今):"我不是机器人"复选框 + 图片验证
- reCAPTCHA v3(2018-至今):无感验证,基于用户行为评分
- reCAPTCHA Enterprise(2020-至今):企业级解决方案,更复杂的AI检测
面对如此强大的验证体系,传统的爬虫技术已完全失效。本文将系统性地讲解各类reCAPTCHA的绕过策略,从基础原理到高级技巧,从免费方案到付费服务,为你构建一套完整的解决方案。
二、reCAPTCHA技术深度解析
2.1 reCAPTCHA v2 工作原理
reCAPTCHA v2采用多层验证机制:
2.1.1 第一层:复选框验证
- 用户点击"我不是机器人"复选框
- Google收集浏览器指纹、鼠标轨迹、点击时间等行为数据
- 如果风险评分较低,直接通过验证
- 如果风险评分较高,进入第二层验证
2.1.2 第二层:图片验证
- 显示9张或16张图片
- 要求用户选择包含特定物体的图片
- 可能需要多轮验证(选择所有包含交通灯的图片 → 选择所有包含人行横道的图片)
2.1.3 技术实现
<!-- reCAPTCHA v2 嵌入代码 -->
<div class="g-recaptcha" data-sitekey="6LcXAAAAA..."></div>
<script src="https://www.google.com/recaptcha/api.js"></script>
验证成功后,会在表单中生成一个隐藏字段:
<input type="hidden" name="g-recaptcha-response" value="03A...">
2.2 reCAPTCHA v3 工作原理
reCAPTCHA v3采用完全无感的验证方式:
2.2.1 核心特点
- 无用户交互:用户完全感知不到验证过程
- 行为评分:返回0.0-1.0的风险评分(1.0表示可信,0.0表示机器人)
- API调用:通过JavaScript API在后台执行验证
- 自定义阈值:网站可设置评分阈值(通常0.5)
2.2.2 技术实现
// reCAPTCHA v3 JavaScript调用
grecaptcha.execute('6LcXAAAAA...', {action: 'login'}).then(function(token) {
// 将token发送到服务器验证
document.getElementById('recaptchaResponse').value = token;
});
服务器端验证:
# Python服务器端验证示例
import requests
def verify_recaptcha_v3(token, secret_key):
url = "https://www.google.com/recaptcha/api/siteverify"
data = {
'secret': secret_key,
'response': token,
'remoteip': request.remote_addr
}
response = requests.post(url, data=data)
result = response.json()
return result['success'] and result['score'] >= 0.5
2.3 reCAPTCHA检测机制详解
Google reCAPTCHA通过以下维度检测机器人:
| 检测维度 | 具体指标 | 绕过难度 |
|---|---|---|
| 浏览器指纹 | User-Agent、WebGL、Canvas、字体列表 | ★★★★☆ |
| 行为分析 | 鼠标轨迹、点击模式、页面停留时间 | ★★★★★ |
| 网络特征 | IP信誉、请求频率、TLS指纹 | ★★★★☆ |
| 自动化工具 | WebDriver属性、自动化框架特征 | ★★★★★ |
| 环境检测 | 浏览器插件、屏幕分辨率、时区 | ★★★☆☆ |
三、绕过策略全景图
3.1 策略分类
| 策略类型 | 适用场景 | 成功率 | 成本 | 复杂度 |
|---|---|---|---|---|
| 环境伪装 | reCAPTCHA v2复选框 | 60-70% | 低 | ★★☆☆☆ |
| 行为模拟 | reCAPTCHA v2图片验证 | 80-90% | 中 | ★★★★☆ |
| 第三方服务 | 所有类型 | 95%+ | 高 | ★★☆☆☆ |
| Token复用 | reCAPTCHA v3 | 70-80% | 低 | ★★★☆☆ |
| 代理IP池 | 配合其他策略 | 提升10-20% | 中 | ★★★☆☆ |
3.2 选择策略的原则
- 成本优先:预算有限时优先尝试免费方案
- 成功率优先:关键业务场景选择付费服务
- 维护成本:考虑长期维护的复杂度
- 法律合规:确保使用方式符合网站条款
四、环境伪装技术深度实践
4.1 浏览器指纹绕过
4.1.1 WebDriver属性检测
Google会检测navigator.webdriver属性:
// 检测代码
if (navigator.webdriver) {
// 认定为自动化工具
}
绕过方法:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_stealth_driver():
"""创建隐身Chrome驱动"""
chrome_options = Options()
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=chrome_options)
# 删除webdriver属性
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
'source': '''
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
'''
})
return driver
4.1.2 Canvas/WebGL指纹绕过
def inject_canvas_fingerprint(driver):
"""注入Canvas指纹绕过脚本"""
canvas_js = """
// 重写Canvas相关方法
const toBlob = HTMLCanvasElement.prototype.toBlob;
const toDataURL = HTMLCanvasElement.prototype.toDataURL;
const getImageData = CanvasRenderingContext2D.prototype.getImageData;
HTMLCanvasElement.prototype.toBlob = function() {
setTimeout(() => toBlob.apply(this, arguments), 1000);
};
HTMLCanvasElement.prototype.toDataURL = function(type, quality) {
const result = toDataURL.apply(this, arguments);
// 添加随机噪声
return result.replace(/(.{10})$/, 'X$1');
};
CanvasRenderingContext2D.prototype.getImageData = function(x, y, w, h) {
const imageData = getImageData.apply(this, arguments);
if (imageData.data.length > 0) {
// 添加微小噪声
imageData.data[0] = (imageData.data[0] + Math.floor(Math.random() * 3)) % 256;
}
return imageData;
};
// WebGL指纹绕过
const getContext = HTMLCanvasElement.prototype.getContext;
HTMLCanvasElement.prototype.getContext = function(type, attributes) {
const context = getContext.call(this, type, attributes);
if (type === 'webgl' || type === 'experimental-webgl') {
const getParameter = context.getParameter;
context.getParameter = function(parameter) {
if (parameter === 37445) { // UNMASKED_VENDOR_WEBGL
return 'Intel Inc.';
}
if (parameter === 37446) { // UNMASKED_RENDERER_WEBGL
return 'Intel Iris OpenGL Engine';
}
return getParameter.call(this, parameter);
};
}
return context;
};
"""
driver.execute_script(canvas_js)
4.1.3 插件和语言检测绕过
def inject_plugin_fingerprint(driver):
"""注入插件指纹绕过脚本"""
plugin_js = """
// 模拟真实浏览器插件
Object.defineProperty(navigator, 'plugins', {
get: () => [
{
name: 'Chrome PDF Plugin',
filename: 'internal-pdf-viewer',
description: 'Portable Document Format'
},
{
name: 'Chrome PDF Viewer',
filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai',
description: ''
},
{
name: 'Native Client',
filename: 'internal-nacl-plugin',
description: ''
}
]
});
// 设置语言
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
// 设置硬件并发
Object.defineProperty(navigator, 'hardwareConcurrency', {
get: () => 8
});
// 设置设备内存
Object.defineProperty(navigator, 'deviceMemory', {
get: () => 8
});
"""
driver.execute_script(plugin_js)
4.2 完整环境伪装示例
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import random
class StealthBrowser:
def __init__(self):
self.driver = self._create_driver()
self._inject_stealth_scripts()
def _create_driver(self):
"""创建基础驱动"""
chrome_options = Options()
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--window-size=1920,1080")
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
# 禁用自动化特征
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=chrome_options)
return driver
def _inject_stealth_scripts(self):
"""注入所有隐身脚本"""
stealth_scripts = [
self._get_webdriver_script(),
self._get_canvas_script(),
self._get_plugin_script(),
self._get_chrome_script()
]
for script in stealth_scripts:
self.driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
'source': script
})
def _get_webdriver_script(self):
return """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
"""
def _get_canvas_script(self):
return """
const toBlob = HTMLCanvasElement.prototype.toBlob;
const toDataURL = HTMLCanvasElement.prototype.toDataURL;
const getImageData = CanvasRenderingContext2D.prototype.getImageData;
HTMLCanvasElement.prototype.toBlob = function() {
setTimeout(() => toBlob.apply(this, arguments), 1000);
};
HTMLCanvasElement.prototype.toDataURL = function(type, quality) {
const result = toDataURL.apply(this, arguments);
return result.replace(/(.{10})$/, 'X$1');
};
CanvasRenderingContext2D.prototype.getImageData = function(x, y, w, h) {
const imageData = getImageData.apply(this, arguments);
if (imageData.data.length > 0) {
imageData.data[0] = (imageData.data[0] + Math.floor(Math.random() * 3)) % 256;
}
return imageData;
};
"""
def _get_plugin_script(self):
return """
Object.defineProperty(navigator, 'plugins', {
get: () => [
{name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer', description: 'Portable Document Format'},
{name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai', description: ''},
{name: 'Native Client', filename: 'internal-nacl-plugin', description: ''}
]
});
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
Object.defineProperty(navigator, 'hardwareConcurrency', {
get: () => 8
});
Object.defineProperty(navigator, 'deviceMemory', {
get: () => 8
});
"""
def _get_chrome_script(self):
return """
window.chrome = {
runtime: {},
loadTimes: function() {
return {
requestTime: new Date().getTime(),
startLoadTime: new Date().getTime(),
commitLoadTime: new Date().getTime(),
finishDocumentLoadTime: new Date().getTime(),
finishLoadTime: new Date().getTime(),
firstPaintTime: new Date().getTime(),
firstPaintAfterLoadTime: new Date().getTime(),
navigationType: "Other",
wasFetchedViaSpdy: false,
wasNpnNegotiated: false,
npnNegotiatedProtocol: "",
wasAlternateProtocolAvailable: false,
connectionInfo: "unknown"
};
}
};
"""
def get_driver(self):
return self.driver
def close(self):
self.driver.quit()
# 使用示例
if __name__ == "__main__":
browser = StealthBrowser()
driver = browser.get_driver()
try:
driver.get("https://www.google.com/recaptcha/api2/demo")
time.sleep(5)
# 检查是否成功绕过检测
if "recaptcha" in driver.page_source.lower():
print("页面加载成功,可能需要进一步处理")
else:
print("可能被检测为机器人")
finally:
browser.close()
五、行为模拟技术深度实践
5.1 鼠标轨迹模拟
5.1.1 人类鼠标轨迹特征
真实用户的鼠标轨迹具有以下特征:
- 加速度变化:开始慢,中间快,结束慢
- 曲线轨迹:不是直线,有自然弯曲
- 微小抖动:坐标有微小随机波动
- 停顿点:在目标点附近有短暂停顿
5.1.2 贝塞尔曲线轨迹生成
import math
import random
from selenium.webdriver.common.action_chains import ActionChains
def generate_bezier_curve(start_x, start_y, end_x, end_y, points=20):
"""
生成贝塞尔曲线轨迹
:param start_x: 起始X坐标
:param start_y: 起始Y坐标
:param end_x: 结束X坐标
:param end_y: 结束Y坐标
:param points: 轨迹点数量
:return: 轨迹点列表
"""
# 生成控制点(在起始点和结束点之间随机偏移)
control_x1 = start_x + random.randint(-100, 100)
control_y1 = start_y + random.randint(-100, 100)
control_x2 = end_x + random.randint(-100, 100)
control_y2 = end_y + random.randint(-100, 100)
trajectory = []
for i in range(points + 1):
t = i / points
# 三次贝塞尔曲线公式
x = (1 - t)**3 * start_x + 3 * (1 - t)**2 * t * control_x1 + 3 * (1 - t) * t**2 * control_x2 + t**3 * end_x
y = (1 - t)**3 * start_y + 3 * (1 - t)**2 * t * control_y1 + 3 * (1 - t) * t**2 * control_y2 + t**3 * end_y
# 添加微小抖动
x += random.uniform(-2, 2)
y += random.uniform(-2, 2)
trajectory.append((int(x), int(y)))
return trajectory
def human_like_move_to_element(driver, element, duration=2.0):
"""
模拟人类移动到元素
:param driver: WebDriver实例
:param element: 目标元素
:param duration: 移动总时间(秒)
"""
# 获取当前鼠标位置(假设在(0,0))
current_x, current_y = 0, 0
# 获取目标元素位置
location = element.location
target_x = location['x'] + element.size['width'] // 2
target_y = location['y'] + element.size['height'] // 2
# 生成轨迹
trajectory = generate_bezier_curve(current_x, current_y, target_x, target_y)
# 执行移动
actions = ActionChains(driver)
actions.move_to_element_with_offset(element, -element.size['width'] // 2, -element.size['height'] // 2)
actions.perform()
# 分步移动
total_points = len(trajectory)
for i, (x, y) in enumerate(trajectory):
if i == 0:
continue
prev_x, prev_y = trajectory[i-1]
dx = x - prev_x
dy = y - prev_y
# 计算时间间隔(模拟加速度)
if i < total_points * 0.3: # 加速阶段
time_delay = random.uniform(0.01, 0.03)
elif i < total_points * 0.7: # 匀速阶段
time_delay = random.uniform(0.02, 0.05)
else: # 减速阶段
time_delay = random.uniform(0.03, 0.08)
actions = ActionChains(driver)
actions.move_by_offset(dx, dy)
actions.perform()
time.sleep(time_delay)
5.2 reCAPTCHA v2 复选框点击模拟
def click_recaptcha_checkbox(driver, max_retries=3):
"""
模拟人类点击reCAPTCHA复选框
:param driver: WebDriver实例
:param max_retries: 最大重试次数
:return: 是否成功点击
"""
for attempt in range(max_retries):
try:
# 等待复选框出现
checkbox = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.CLASS_NAME, "recaptcha-checkbox-checkmark"))
)
# 模拟人类移动到复选框
human_like_move_to_element(driver, checkbox)
# 随机停顿
time.sleep(random.uniform(0.5, 1.5))
# 点击复选框
checkbox.click()
# 等待验证结果
time.sleep(2)
# 检查是否出现图片验证
try:
image_challenge = driver.find_element(By.CLASS_NAME, "rc-imageselect-payload")
print("出现图片验证,需要进一步处理")
return False
except:
# 检查是否验证成功
try:
success_element = driver.find_element(By.CLASS_NAME, "recaptcha-checkbox-checked")
print("reCAPTCHA验证成功!")
return True
except:
print(f"第{attempt+1}次尝试失败,重试中...")
continue
except Exception as e:
print(f"点击复选框时发生错误: {str(e)}")
continue
print("所有重试都失败了")
return False
5.3 图片验证处理
5.3.1 图片下载与识别
import requests
from PIL import Image
import io
import base64
def download_recaptcha_images(driver):
"""
下载reCAPTCHA图片验证的图片
:param driver: WebDriver实例
:return: 图片字节数据列表
"""
images = []
# 等待图片加载
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "rc-image-tile-wrapper"))
)
# 获取所有图片元素
image_elements = driver.find_elements(By.CLASS_NAME, "rc-image-tile-44")
for element in image_elements:
try:
# 获取图片URL
img_url = element.find_element(By.TAG_NAME, "img").get_attribute("src")
# 下载图片
if img_url.startswith("data:image"):
# Base64图片
img_data = img_url.split(",")[1]
img_bytes = base64.b64decode(img_data)
else:
# 网络图片
response = requests.get(img_url)
img_bytes = response.content
images.append(img_bytes)
except Exception as e:
print(f"下载图片时发生错误: {str(e)}")
images.append(None)
return images
def recognize_recaptcha_images(images, instruction):
"""
识别reCAPTCHA图片(使用第三方服务或本地模型)
:param images: 图片字节数据列表
:param instruction: 验证指令(如"选择包含交通灯的图片")
:return: 需要点击的图片索引列表
"""
# 这里可以集成超级鹰、2Captcha等第三方服务
# 或者使用本地训练的深度学习模型
# 示例:使用伪随机选择(实际应用中需要真正的识别)
selected_indices = []
for i, img in enumerate(images):
if img is not None:
# 这里应该调用真正的识别服务
# 为演示目的,随机选择50%的图片
if random.random() > 0.5:
selected_indices.append(i)
return selected_indices
5.3.2 图片点击模拟
def click_recaptcha_images(driver, selected_indices):
"""
点击reCAPTCHA图片
:param driver: WebDriver实例
:param selected_indices: 需要点击的图片索引列表
"""
# 获取所有图片元素
image_elements = driver.find_elements(By.CLASS_NAME, "rc-image-tile-44")
for index in selected_indices:
if index < len(image_elements):
element = image_elements[index]
# 模拟人类点击
human_like_move_to_element(driver, element)
time.sleep(random.uniform(0.3, 0.8))
element.click()
# 随机停顿
time.sleep(random.uniform(0.2, 0.5))
# 点击验证按钮
try:
verify_button = driver.find_element(By.ID, "recaptcha-verify-button")
human_like_move_to_element(driver, verify_button)
time.sleep(random.uniform(0.5, 1.0))
verify_button.click()
except:
print("未找到验证按钮")
六、第三方服务集成方案
6.1 2Captcha服务集成
6.1.1 2Captcha简介
2Captcha是专业的验证码识别服务,支持reCAPTCHA v2/v3,识别率高达95%以上。
6.1.2 Python集成代码
import requests
import time
import json
class TwoCaptchaSolver:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "http://2captcha.com"
def solve_recaptcha_v2(self, site_key, page_url, invisible=0):
"""
解决reCAPTCHA v2
:param site_key: 网站的site key
:param page_url: 页面URL
:param invisible: 是否为隐形reCAPTCHA (0或1)
:return: reCAPTCHA响应token
"""
# 1. 提交验证码任务
task_data = {
"key": self.api_key,
"method": "userrecaptcha",
"googlekey": site_key,
"pageurl": page_url,
"invisible": invisible,
"json": 1
}
response = requests.post(f"{self.base_url}/in.php", data=task_data)
result = response.json()
if result["status"] == 1:
task_id = result["request"]
print(f"验证码任务提交成功,任务ID: {task_id}")
# 2. 轮询获取结果
for _ in range(30): # 最多等待30秒
time.sleep(5)
result = self._get_result(task_id)
if result["status"] == 1:
return result["request"]
elif result["request"] == "CAPCHA_NOT_READY":
continue
else:
raise Exception(f"验证码识别失败: {result['request']}")
raise Exception("验证码识别超时")
else:
raise Exception(f"提交验证码任务失败: {result['request']}")
def solve_recaptcha_v3(self, site_key, page_url, action="verify", min_score=0.3):
"""
解决reCAPTCHA v3
:param site_key: 网站的site key
:param page_url: 页面URL
:param action: 验证动作
:param min_score: 最小分数要求
:return: reCAPTCHA响应token
"""
task_data = {
"key": self.api_key,
"method": "userrecaptcha",
"version": "v3",
"googlekey": site_key,
"pageurl": page_url,
"action": action,
"min_score": min_score,
"json": 1
}
response = requests.post(f"{self.base_url}/in.php", data=task_data)
result = response.json()
if result["status"] == 1:
task_id = result["request"]
print(f"reCAPTCHA v3任务提交成功,任务ID: {task_id}")
# 轮询获取结果
for _ in range(20):
time.sleep(3)
result = self._get_result(task_id)
if result["status"] == 1:
return result["request"]
elif result["request"] == "CAPCHA_NOT_READY":
continue
else:
raise Exception(f"reCAPTCHA v3识别失败: {result['request']}")
raise Exception("reCAPTCHA v3识别超时")
else:
raise Exception(f"提交reCAPTCHA v3任务失败: {result['request']}")
def _get_result(self, task_id):
"""获取验证码识别结果"""
params = {
"key": self.api_key,
"action": "get",
"id": task_id,
"json": 1
}
response = requests.get(f"{self.base_url}/res.php", params=params)
return response.json()
def report_bad(self, task_id):
"""报告错误的验证码结果"""
params = {
"key": self.api_key,
"action": "reportbad",
"id": task_id,
"json": 1
}
response = requests.get(f"{self.base_url}/res.php", params=params)
return response.json()
# 使用示例
if __name__ == "__main__":
solver = TwoCaptchaSolver("your_2captcha_api_key")
try:
# 解决reCAPTCHA v2
site_key = "6LcXAAAAA..." # 从网页源码中获取
page_url = "https://example.com/login"
recaptcha_token = solver.solve_recaptcha_v2(site_key, page_url)
print(f"reCAPTCHA token: {recaptcha_token}")
# 使用token进行登录
login_data = {
"username": "your_username",
"password": "your_password",
"g-recaptcha-response": recaptcha_token
}
response = requests.post(page_url, data=login_data)
print(f"登录结果: {response.status_code}")
except Exception as e:
print(f"发生错误: {str(e)}")
6.2 Anti-Captcha服务集成
class AntiCaptchaSolver:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "https://api.anti-captcha.com"
def solve_recaptcha_v2(self, site_key, page_url):
"""解决reCAPTCHA v2"""
# 创建任务
task_data = {
"clientKey": self.api_key,
"task": {
"type": "NoCaptchaTaskProxyless",
"websiteURL": page_url,
"websiteKey": site_key
}
}
response = requests.post(f"{self.base_url}/createTask", json=task_data)
result = response.json()
if result["errorId"] == 0:
task_id = result["taskId"]
print(f"Anti-Captcha任务创建成功,任务ID: {task_id}")
# 轮询获取结果
for _ in range(30):
time.sleep(5)
solution = self._get_task_result(task_id)
if solution["status"] == "ready":
return solution["solution"]["gRecaptchaResponse"]
elif solution["status"] == "processing":
continue
else:
raise Exception(f"Anti-Captcha任务失败: {solution}")
raise Exception("Anti-Captcha任务超时")
else:
raise Exception(f"创建Anti-Captcha任务失败: {result['errorDescription']}")
def _get_task_result(self, task_id):
"""获取Anti-Captcha任务结果"""
data = {
"clientKey": self.api_key,
"taskId": task_id
}
response = requests.post(f"{self.base_url}/getTaskResult", json=data)
return response.json()
七、reCAPTCHA v3 绕过策略
7.1 Token复用技术
reCAPTCHA v3的token有一定的有效期(通常2分钟),可以在有效期内复用。
import time
from collections import defaultdict
class RecaptchaV3TokenCache:
def __init__(self, expiration_time=120): # 2分钟过期
self.cache = {}
self.expiration_time = expiration_time
def get_token(self, site_key, action):
"""获取缓存的token"""
key = f"{site_key}:{action}"
if key in self.cache:
token, timestamp = self.cache[key]
if time.time() - timestamp < self.expiration_time:
return token
else:
del self.cache[key]
return None
def set_token(self, site_key, action, token):
"""缓存token"""
key = f"{site_key}:{action}"
self.cache[key] = (token, time.time())
def clear_expired(self):
"""清理过期的token"""
current_time = time.time()
expired_keys = []
for key, (token, timestamp) in self.cache.items():
if current_time - timestamp >= self.expiration_time:
expired_keys.append(key)
for key in expired_keys:
del self.cache[key]
# 使用示例
token_cache = RecaptchaV3TokenCache()
def get_recaptcha_v3_token(site_key, action, page_url):
"""获取reCAPTCHA v3 token"""
# 先检查缓存
cached_token = token_cache.get_token(site_key, action)
if cached_token:
return cached_token
# 使用第三方服务获取新token
solver = TwoCaptchaSolver("your_api_key")
new_token = solver.solve_recaptcha_v3(site_key, page_url, action)
# 缓存新token
token_cache.set_token(site_key, action, new_token)
return new_token
7.2 本地模拟reCAPTCHA v3
对于某些简单的reCAPTCHA v3实现,可以尝试本地模拟:
import hashlib
import time
import random
def generate_fake_recaptcha_v3_token(site_key, action):
"""
生成伪造的reCAPTCHA v3 token(仅适用于某些简单实现)
注意:这种方法对Google官方的reCAPTCHA v3无效
"""
# 生成基于时间戳和随机数的token
timestamp = str(int(time.time()))
random_str = ''.join(random.choices('abcdefghijklmnopqrstuvwxyz0123456789', k=32))
# 创建token(这只是示例,实际token是JWT格式)
token_data = f"{site_key}.{action}.{timestamp}.{random_str}"
fake_token = hashlib.sha256(token_data.encode()).hexdigest()
return fake_token
八、代理IP池集成
8.1 代理IP的重要性
使用代理IP可以:
- 避免IP被封禁
- 模拟不同地理位置的用户
- 提高reCAPTCHA通过率
8.2 代理IP池实现
import requests
from random import choice
class ProxyPool:
def __init__(self, proxy_list=None, proxy_api=None):
self.proxy_list = proxy_list or []
self.proxy_api = proxy_api
self.current_index = 0
def get_proxy(self):
"""获取一个代理"""
if self.proxy_api:
# 从API获取动态代理
response = requests.get(self.proxy_api)
if response.status_code == 200:
proxy = response.text.strip()
return {"http": f"http://{proxy}", "https": f"http://{proxy}"}
if self.proxy_list:
# 从列表中轮询获取
proxy = self.proxy_list[self.current_index]
self.current_index = (self.current_index + 1) % len(self.proxy_list)
return {"http": f"http://{proxy}", "https": f"http://{proxy}"}
return None
def test_proxy(self, proxy, timeout=5):
"""测试代理是否可用"""
try:
response = requests.get("http://httpbin.org/ip", proxies=proxy, timeout=timeout)
return response.status_code == 200
except:
return False
# 使用示例
proxy_pool = ProxyPool(
proxy_list=[
"1.2.3.4:8080",
"5.6.7.8:8080",
"9.10.11.12:8080"
]
)
def make_request_with_proxy(url, **kwargs):
"""使用代理发送请求"""
proxy = proxy_pool.get_proxy()
if proxy:
kwargs['proxies'] = proxy
return requests.get(url, **kwargs)
8.3 Selenium集成代理
def create_driver_with_proxy(proxy_host, proxy_port):
"""创建带代理的WebDriver"""
chrome_options = Options()
chrome_options.add_argument(f"--proxy-server={proxy_host}:{proxy_port}")
# 其他隐身设置...
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=chrome_options)
# 注入隐身脚本
driver.execute_cdp_cmd('Page.addScriptToEvaluate
1万+

被折叠的 条评论
为什么被折叠?



