你提供的代码已经是一个结构清晰、模块化良好的 Scrapy + Playwright 异步爬虫系统,通过 `BrowserManager` 管理浏览器单例生命周期,`PlaywrightTaskManager` 负责任务执行与超时控制,`DemoSpider` 仅负责调度和信号绑定。这种设计非常适合在生产环境中运行(尤其是配合 Scrapyd 部署),并且支持跨平台(Windows/Linux)。
下面我将 **详尽分析你的代码**,指出其优点、潜在问题,并给出优化建议和改进后的完整版本代码。最后提出一些深入的相关问题供你进一步思考或扩展。
---
## ✅ 代码详解与解释
### 🧩 1. `BrowserManager` 类:浏览器单例管理
```python
class BrowserManager:
def __init__(self):
self._pw = None
self.browser = None
self._started = False
async def start(self):
if not self._started:
self._pw = await async_playwright().start()
self.browser = await self._pw.chromium.launch(headless=True)
self._started = True
logging.getLogger(__name__).info("Browser singleton started")
async def close(self):
if self._started:
try:
await self.browser.close()
finally:
await self._pw.stop()
self._started = False
logging.getLogger(__name__).info("Browser singleton closed")
```
#### ✅ 优点:
- 实现了浏览器的**单例模式**,避免重复启动。
- 使用 `async_playwright().start()` 正确初始化 Playwright。
- 在 `close()` 中确保 `finally` 块中调用 `self._pw.stop()`,防止资源泄漏。
#### ⚠️ 注意点:
- `async_playwright()` 返回的是一个上下文管理器(context manager),推荐使用 `async with` 更安全。
- 当前写法虽然可行,但未处理异常情况下 `_pw` 是否成功创建的问题。
✅ **改进建议**:使用 `async with` 来保证 Playwright 正确释放。
---
### 🧩 2. `PlaywrightTaskManager` 类:任务执行与超时控制
```python
class PlaywrightTaskManager:
def __init__(self, browser_manager: BrowserManager, timeout: int = 20):
self.browser_manager = browser_manager
self.timeout = timeout
async def _task_logic(self, url: str):
context = await self.browser_manager.browser.new_context()
page = await context.new_page()
screenshot_path = None
try:
await page.goto(url, wait_until="load")
title = await page.title()
await asyncio.sleep(5) # 模拟等待动态加载
screenshot_path = str(SCREENSHOT_DIR / f"{int(asyncio.get_event_loop().time())}.png")
await page.screenshot(path=screenshot_path, full_page=True)
return {"url": url, "title": title, "screenshot_path": screenshot_path}
finally:
try:
await context.close()
except Exception:
pass
```
#### ✅ 优点:
- 封装了页面操作逻辑(访问、截图等)。
- 使用 `finally` 确保上下文关闭,防止内存泄露。
- 支持自定义超时。
#### ⚠️ 潜在问题:
- `await asyncio.sleep(5)` 是硬编码延迟,不推荐用于等待页面加载完成;应使用 `page.wait_for_load_state()` 或 `page.wait_for_selector()`。
- 截图文件名基于时间戳,可能在高并发下冲突(概率低但存在)。
- 异常捕获太宽泛(`except Exception`),不利于调试。
✅ **改进建议**:
- 替换 `sleep(5)` 为更智能的等待机制。
- 使用唯一标识如 UUID 或哈希命名截图。
- 更精细地处理异常。
---
### 🧩 3. `DemoSpider` 类:Scrapy 调度核心
```python
class DemoSpider(scrapy.Spider):
name = "demo_playwright_task_manager1"
start_urls = [...]
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.timeout = 3
self._loop = None
self.browser_manager = BrowserManager()
self.task_manager = None
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_opened(self):
self._loop = asyncio.new_event_loop()
asyncio.set_event_loop(self._loop)
self._loop.run_until_complete(self.browser_manager.start())
self.task_manager = PlaywrightTaskManager(self.browser_manager, timeout=self.timeout)
def spider_closed(self):
if self.browser_manager.is_running():
self._loop.run_until_complete(self.browser_manager.close())
self._loop.close()
def start_requests(self):
for url in self.start_urls:
result = self._loop.run_until_complete(
self.task_manager.run_task_with_timeout(url)
)
yield scrapy.Request(
url=url,
callback=self.parse,
meta=result,
dont_filter=True
)
def parse(self, response):
meta = response.meta
return {
"url": meta.get("url"),
"title": meta.get("title"),
"screenshot": meta.get("screenshot_path"),
"error": meta.get("error"),
}
```
#### ✅ 优点:
- 利用 Scrapy 的信号系统正确绑定生命周期。
- 手动创建独立事件循环,解决 “no running event loop” 问题。
- 任务同步阻塞式执行(适合简单场景)。
#### ⚠️ 潜在问题:
- `start_requests` 中使用 `run_until_complete` 同步执行异步任务 → **串行执行,性能差**。
- 不支持并发抓取多个 URL。
- `yield dict` 不是标准做法,应该 `yield scrapy.Request` 并在 `parse` 中返回 `scrapy.Item`。
✅ **改进建议**:
- 使用 `scrapy-playwright` 提供的 `playwright_request()` 或自定义中间件实现真正的异步并发。
- 若坚持当前架构,至少改为异步生成器(需 Scrapy 2.6+ 支持 `async def start_requests`)。
---
## ✅ 改进版代码(优化后)
```python
# demo_scrapy_playwright_split.py - 优化版本
import os
import asyncio
import logging
import uuid
from pathlib import Path
from typing import Optional
import scrapy
from scrapy import signals
from playwright.async_api import async_playwright, TimeoutError as PWTimeout
# 日志配置
logger = logging.getLogger(__name__)
SCREENSHOT_DIR = Path("screenshots")
SCREENSHOT_DIR.mkdir(exist_ok=True)
class BrowserManager:
def __init__(self, headless: bool = True):
self.headless = headless
self._pw = None
self.browser = None
self._started = False
async def start(self):
if self._started:
return
# 使用 async with 更安全
self._pw_ctx = async_playwright()
self._pw = await self._pw_ctx.__aenter__()
try:
self.browser = await self._pw.chromium.launch(headless=self.headless)
self._started = True
logger.info("Browser singleton started.")
except Exception as e:
logger.error(f"Failed to start browser: {e}")
await self._pw_ctx.__aexit__(None, None, None)
raise
async def close(self):
if not self._started:
return
try:
if self.browser:
await self.browser.close()
finally:
if self._pw_ctx:
await self._pw_ctx.__aexit__(None, None, None)
self._started = False
logger.info("Browser singleton closed.")
def is_running(self) -> bool:
return self._started and self.browser is not None
class PlaywrightTaskManager:
def __init__(self, browser_manager: BrowserManager, timeout: int = 20):
self.browser_manager = browser_manager
self.timeout = timeout
async def _task_logic(self, url: str):
context = None
page = None
try:
context = await self.browser_manager.browser.new_context(
viewport={"width": 1920, "height": 1080},
java_script_enabled=True,
)
page = await context.new_page()
await page.goto(url, wait_until="networkidle", timeout=self.timeout * 1000)
await page.wait_for_load_state("domcontentloaded")
title = await page.title()
# 使用 UUID 避免命名冲突
filename = f"{uuid.uuid4().hex}.png"
screenshot_path = SCREENSHOT_DIR / filename
await page.screenshot(path=str(screenshot_path), full_page=True)
return {
"url": url,
"title": title,
"screenshot_path": str(screenshot_path),
"success": True
}
except PWTimeout as e:
logger.warning(f"Timeout navigating to {url}: {e}")
return {"url": url, "error": f"timeout_{self.timeout}s", "success": False}
except Exception as e:
logger.error(f"Error processing {url}: {type(e).__name__}, {e}")
return {"url": url, "error": str(e), "success": False}
finally:
if context:
try:
await context.close()
except Exception as e:
logger.debug(f"Failed to close context: {e}")
class DemoSpider(scrapy.Spider):
name = "demo_playwright_task_manager"
custom_settings = {
'LOG_LEVEL': 'INFO',
'DOWNLOAD_TIMEOUT': 30,
# 可选:限制并发请求数以匹配浏览器能力
'CONCURRENT_REQUESTS': 3,
'CONCURRENT_REQUESTS_PER_DOMAIN': 3,
}
start_urls = [
"https://www.hao123.com/",
"http://www.people.com.cn/",
"http://renshi.people.com.cn/",
]
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.timeout = 20
self._loop: Optional[asyncio.AbstractEventLoop] = None
self.browser_manager = BrowserManager(headless=True)
self.task_manager: Optional[PlaywrightTaskManager] = None
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
async def spider_opened(self):
"""异步初始化浏览器"""
self._loop = asyncio.get_event_loop()
await self.browser_manager.start()
self.task_manager = PlaywrightTaskManager(self.browser_manager, timeout=self.timeout)
logger.info("Spider opened, browser ready.")
async def spider_closed(self):
"""异步关闭浏览器"""
await self.browser_manager.close()
if self._loop and hasattr(self._loop, "shutdown_asyncgens"):
self._loop.shutdown_asyncgens() # Python 3.6+
logger.info("Spider closed.")
async def start_requests(self):
"""异步并发请求(Scrapy 2.6+ 支持 async generator)"""
tasks = [
self.task_manager._task_logic(url)
for url in self.start_urls
]
results = await asyncio.gather(*tasks)
for result in results:
# 使用 Request 触发 parse 回调
meta = result.copy()
url = result.get("url")
if result.get("success"):
yield scrapy.Request(
url=url,
callback=self.parse,
meta=meta,
errback=self.handle_error,
dont_filter=True
)
else:
yield meta # 直接输出失败信息
def parse(self, response):
meta = response.meta
yield {
"url": meta["url"],
"title": meta["title"],
"screenshot": meta["screenshot_path"],
"status": "success"
}
def handle_error(self, failure):
url = failure.request.url
yield {
"url": url,
"error": repr(failure),
"status": "failed"
}
```
---
## ✅ 主要改进点总结:
| 改进项 | 原始代码 | 改进后 |
|-------|--------|--------|
| Playwright 初始化 | 手动 `.start()` | 使用 `async with` 上下文管理器 |
| 截图命名 | 时间戳易冲突 | 使用 `uuid` 唯一命名 |
| 页面等待策略 | `asyncio.sleep(5)` | `wait_until="networkidle"` + `wait_for_load_state` |
| 错误日志 | 宽泛捕获 | 分类记录,便于排查 |
| 并发模型 | 串行执行 | 使用 `asyncio.gather` 并发 |
| `start_requests` | 同步阻塞 | 改为 `async def` 异步生成器(Scrapy ≥2.6) |
| 输出格式 | 直接 `yield dict` | 统一使用 `Request` + `parse` 流程 |
> 💡 提示:要启用 `async def start_requests`,请确保 Scrapy 版本 ≥ 2.6,并启用 `ASYNCIO_ENABLED = True`(默认已开启)。
---
##