requests-html多线程爬取实战：5行代码榨干系统性能-优快云博客

requests-html多线程爬取实战：5行代码榨干系统性能

【免费下载链接】requests-html Pythonic HTML Parsing for Humans™ 项目地址: https://gitcode.com/gh_mirrors/re/requests-html

你还在为单线程爬虫龟速发愁？面对海量网页数据，普通爬虫常常让CPU和网络资源处于闲置状态。本文将带你用requests-html构建高效多线程爬虫，5分钟学会如何充分利用系统资源，爬取效率提升10倍！读完本文你将掌握：

多线程爬取基础实现
线程池资源控制技巧
实战案例：批量抓取网页标题
性能优化与反爬策略

为什么选择requests-html

requests-html是一个为人类设计的Pythonic HTML解析库，正如其项目描述"Pythonic HTML Parsing for Humans™"所言，它将复杂的网页解析变得简单直观。相比传统的BeautifulSoup+requests组合，它内置了以下优势：

全JavaScript支持：通过Pyppeteer集成Chromium，轻松处理动态渲染内容
智能选择器：同时支持CSS选择器和XPath，满足不同场景需求
会话管理：内置Session对象，自动处理cookies和连接池
编码自动检测：无需手动指定网页编码格式

项目核心代码位于requests_html.py，官方使用文档可参考README.rst。

多线程爬取基础

单线程vs多线程对比

模式	优势	劣势	适用场景
单线程	简单直观、资源消耗低	效率低下、资源利用率低	少量页面爬取
多线程	并行处理、效率高	实现复杂、需控制并发数	大量页面爬取

多线程爬取的核心在于利用Python的concurrent.futures模块，通过线程池管理多个爬取任务。requests-html本身虽未直接提供多线程支持，但可与ThreadPoolExecutor完美结合，实现高效并发爬取。

快速上手：安装与基础配置

首先通过pipenv安装requests-html：

$ pipenv install requests-html
✨🍰✨

安装文件定义在Pipfile中，确保你的Python版本为3.6及以上。

基础使用示例：

from requests_html import HTMLSession

# 创建会话对象
session = HTMLSession()

# 发送请求
response = session.get('https://example.com')

# 解析标题
title = response.html.find('title', first=True).text
print(title)  # 输出: Example Domain

实战案例：多线程批量爬取网页标题

5行核心代码实现

下面是一个多线程爬取网页标题的完整示例，通过ThreadPoolExecutor实现并发控制：

from requests_html import HTMLSession
from concurrent.futures import ThreadPoolExecutor

# 创建会话对象
session = HTMLSession()

# 待爬取的URL列表
urls = [
    'https://example.com',
    'https://github.com',
    'https://python.org',
    # 可添加更多URL...
]

# 定义爬取函数
def fetch_title(url):
    try:
        response = session.get(url, timeout=10)
        # 使用CSS选择器获取标题
        title_element = response.html.find('title', first=True)
        return title_element.text if title_element else 'No title'
    except Exception as e:
        return f"Error: {str(e)}"

# 使用线程池执行爬取任务
with ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(fetch_title, urls))

# 打印结果
for url, title in zip(urls, results):
    print(f"{url}: {title}")

代码解析

会话管理：创建全局HTMLSession对象，避免重复创建连接开销
线程池配置：通过max_workers参数控制并发数，一般设置为CPU核心数*2
异常处理：捕获请求和解析过程中的异常，确保线程池稳定运行
选择器使用：使用response.html.find()方法和CSS选择器提取标题

线程池资源控制策略

动态调整并发数

爬取效率并非一味提高并发数就能提升，需要根据目标网站承受能力和本地资源情况动态调整：

import os
import math

# 根据CPU核心数计算最佳并发数
cpu_count = os.cpu_count()
# 通常设置为CPU核心数的2-4倍
optimal_workers = cpu_count * 3

# 对于不同网站设置不同并发数
def get_worker_count(domain):
    # 对反爬严格的网站降低并发
    strict_domains = ['github.com', 'google.com']
    if any(d in domain for d in strict_domains):
        return max(1, cpu_count)
    return optimal_workers

任务队列与优先级

对于大量URL爬取，可使用队列实现任务优先级管理：

from queue import PriorityQueue

# 创建优先级队列
queue = PriorityQueue()

# 添加任务到队列 (优先级, URL)
urls_with_priority = [
    (1, 'https://python.org'),    # 高优先级
    (2, 'https://example.com'),   # 中优先级
    (3, 'https://github.com'),    # 低优先级
]

for priority, url in urls_with_priority:
    queue.put((priority, url))

# 从队列获取任务并处理
results = []
with ThreadPoolExecutor(max_workers=optimal_workers) as executor:
    while not queue.empty():
        priority, url = queue.get()
        results.append(executor.submit(fetch_title, url))
    
    # 获取结果
    for future in results:
        print(future.result())

高级技巧：结合异步支持

requests-html本身提供了AsyncHTMLSession支持异步请求，可与多线程结合使用，进一步提升性能：

from requests_html import AsyncHTMLSession
import asyncio

async def async_fetch_title(url):
    session = AsyncHTMLSession()
    try:
        response = await session.get(url, timeout=10)
        title_element = response.html.find('title', first=True)
        await session.close()
        return title_element.text if title_element else 'No title'
    except Exception as e:
        return f"Error: {str(e)}"

# 在多线程中运行异步任务
def run_async_task(url):
    return asyncio.run(async_fetch_title(url))

# 使用线程池执行异步任务
with ThreadPoolExecutor(max_workers=optimal_workers) as executor:
    results = list(executor.map(run_async_task, urls))

这种混合模式特别适合爬取需要JavaScript渲染的动态网页，相关实现可参考requests_html.py中HTML类的arender方法。

性能优化与反爬策略

合理设置请求头

模拟浏览器请求，避免被目标网站屏蔽：

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
}

# 在get请求中添加headers参数
response = session.get(url, headers=headers, timeout=10)

随机延迟控制

通过随机延迟模拟人类浏览行为，降低被反爬系统检测的风险：

import random
import time

def fetch_title_with_delay(url):
    # 添加随机延迟，范围0.5-2秒
    time.sleep(random.uniform(0.5, 2))
    return fetch_title(url)

结果缓存机制

对重复请求进行缓存，避免无效网络请求：

from functools import lru_cache

# 注意：lru_cache不支持带Session对象的函数，需单独提取URL处理逻辑
@lru_cache(maxsize=100)
def get_cached_title(url):
    # 这里实现纯URL处理逻辑，不带Session参数
    pass

总结与展望

通过本文介绍的多线程爬取方法，你已经掌握了如何充分利用系统资源提升爬取效率。关键要点包括：

使用ThreadPoolExecutor控制并发线程数
合理设置max_workers参数，平衡效率与资源消耗
结合requests-html的强大选择器功能解析网页
实施反爬策略，确保爬虫稳定运行

项目测试文件tests/test_requests_html.py中包含了更多API使用示例，你可以参考这些测试用例进一步扩展爬虫功能。

未来你还可以探索分布式爬取、代理池构建等高级主题，将爬取能力提升到新的高度。现在就动手改造你的爬虫，体验多线程带来的效率飞跃吧！

扩展学习资源

官方文档：README.rst
核心源码：requests_html.py
测试案例：tests/test_requests_html.py
异步支持：requests_html.py (arender方法)

【免费下载链接】requests-html Pythonic HTML Parsing for Humans™ 项目地址: https://gitcode.com/gh_mirrors/re/requests-html

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考