从单机到分布式：Python 爬虫架构演进

最新推荐文章于 2026-01-09 10:43:42 发布

原创

最新推荐文章于 2026-01-09 10:43:42 发布 · 3.5k 阅读

44 ·

CC 4.0 BY-SA版权

文章标签：

#分布式 #python #爬虫

4. 抗脆弱：重试、超时、随机 UA、礼貌抓取

5. 三种解析方式：CSS / XPath / 正则

6. 持久化：CSV / SQLite / MongoDB（示例：CSV）

7. 单机并发入门：ThreadPoolExecutor + 限速

8. 合规与风控清单（单机阶段必须养成的习惯）

9. 小结

10. 练习与思考题

第二章：框架化爬虫——Scrapy 提升工程化能力

5. Item Pipeline：数据清洗与存储

6. Middleware：请求增强

7. Scrapy 的优势与不足

8. 小结

第三章：异步与高并发——打破 I/O 瓶颈

3. Scrapy-Redis：工程化分布式改造

4. Funboost：通用分布式任务调度

5. 分布式爬虫的存储与扩展

6. 适用场景

第五章：反爬对抗与智能化——攻守之间的演进

1. 常见反爬手段

2. 常见应对策略

3. 智能化与自动化趋势

4. 示例：破解字体反爬的 Python 逻辑

5. 适用场景与演进趋势

总结

第一章：单机爬虫起点与局限

1. 目标与读者

目标：写出稳定、可维护的单机爬虫；建立“工程化”的基础（日志、重试、限速、持久化）。
适合谁：已会 Python 基础语法，想把“脚本”升级为“靠谱工具”的同学。

2. 环境准备

Python ≥ 3.9
推荐库：requests, beautifulsoup4, lxml, tenacity（重试）, loguru（日志，可选）

pip install requests beautifulsoup4 lxml tenacity loguru

3. 最小可用爬虫（MVP）

import requests
from bs4 import BeautifulSoup

resp = requests.get("https://example.com", timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
print(soup.title.get_text(strip=True))

要点：务必设置 timeout；用 raise_for_status() 让异常显式暴露。

4. 抗脆弱：重试、超时、随机 UA、礼貌抓取

import random, time
import requests
from tenacity import retry, stop_after_attempt, wait_exponential

UAS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/118 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Version/15.5 Safari/605.1.15",
]

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=8))
def fetch(url: str) -> str:
    headers = {"User-Agent": random.choice(UAS)}
    r = requests.get(url, headers=headers, timeout=10)
    r.raise_for_status()
    # 礼貌：简单限速，避免打爆网站
    time.sleep(random.uniform(0.5, 1.5))
    return r.text

要点：指数退避（wait_exponential）对临时性错误（429/5xx）更友好；加入随机延迟与随机 UA。

5. 三种解析方式：CSS / XPath / 正则

from bs4 import BeautifulSoup
from lxml import etree
import re

html = fetch("https://example.com")

# 1) CSS（BS4）
soup = BeautifulSoup(html, "lxml")
title_css = soup.select_one("title").get_text(strip=True)

# 2) XPath（lxml）
dom = etree.HTML(html)
title_xpath = dom.xpath("string(//title)")

# 3) 正则（兜底方案，不推荐首选）
match = re.search(r"<title>(.*?)</title>", html, flags=re.I|re.S)
title_re = match.group(1).strip() if match else None

print(title_css, title_xpath, title_re)

建议：优先 CSS/XPath；正则仅作兜底或局部抽取。

6. 持久化：CSV / SQLite / MongoDB（示例：CSV）

import csv, pathlib
from datetime import datetime

OUTPUT = pathlib.Path("data.csv")

def save_csv(rows):
    exists = OUTPUT.exists()
    with OUTPUT.open("a", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=["url", "title", "ts"])
        if not exists:
            w.writeheader()
        for r in rows:
            w.writerow(r)

rows = [{
    "url": "https://example.com",
    "title": title_css,
    "ts": datetime.utcnow().isoformat(),
}]
save_csv(rows)

要点：统一字段；保存 UTC 时间；文件追加写入并自动建表头。

7. 单机并发入门：`ThreadPoolExecutor` + 限速

from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import urljoin

BASE = "https://example.com/"
PATHS = ["/", "#", "/?page=2", "/about"]  # 示例路径
URLS = [urljoin(BASE, p) for p in PATHS]

def parse_title(url: str) -> dict:
    html = fetch(url)
    soup = BeautifulSoup(html, "lxml")
    return {"url": url, "title": soup.title.get_text(strip=True)}

results = []
with ThreadPoolExecutor(max_workers=8) as pool:
    futs = [pool.submit(parse_title, u) for u in URLS]
    for fut in as_completed(futs):
        try:
            data = fut.result()
            results.append({**data, "ts": datetime.utcnow().isoformat()})
        except Exception as e:
            print("error:", e)

save_csv(results)

建议：