并发基础(1):wait、sleep、await、yield的区别

本文对比了sleep、yield、await/wait方法。在是否释放锁方面,sleep和yield不释放,await/wait释放;调用后恢复情况不同,sleep指定时间后就绪,yield立即就绪,await/wait需notify/signal唤醒;它们所属类不同,执行环境也有差异,await/wait需在同步块内。

是否释放锁:调用sleep和yield的时候不释放当前线程所获得的锁,但是调用await/wait的时候却释放了其获取的锁并阻塞等待。

 

调用后何时恢复:

# sleep让线程阻塞,且在指定的时间之内都不会执行,时间到了之后恢复到就绪状态,也不一定被立即调度执行;

# yield只是让当前对象回到就绪状态,还是有可能马上被再次被调用执行。

# await/wait,它会一直阻塞在条件队列之上,之后某个线程调用对应的notify/signal方法,才会使得await/wait的线程回到就绪状态,也是不一定立即执行。

 

谁的方法:yield和sleep方法都是Thread类的,而wait方法是Object类的,await方法是Condition显示条件队列的。

 

执行环境:yield和sleep方法可以放在线程中的任意位置,而await/wait方法必须放在同步块里面,否则会产生运行时异常。

 

await/wait

Sleep

Yield

是否释放持有的锁

释放

不释放

不释放

调用后何时恢复

唤醒后进入就绪态

指定时间后

立刻进入就绪态

谁的方法

Condition/Object

Thread

Thread

执行环境

同步代码块

任意位置

任意位置

# demo_scrapy_playwright_split.py # 拆分版本:Scrapy 爬虫管理浏览器生命周期(启动单例、关闭、可选监控) # BrowserManager 独立管理浏览器,任务类管理上下文和超时 # Scrapyd 友好,支持 Windows/Linux import os import asyncio import logging import time from pathlib import Path import scrapy from scrapy import signals from playwright.async_api import async_playwright, TimeoutError as PWTimeout ''' 我已经把浏览器管理逻辑独立成 BrowserManager 类: BrowserManager → 负责启动、关闭、监控浏览器; PlaywrightTaskManager → 负责任务执行与超时控制; DemoSpider → 只负责调度和生命周期绑定。 这样结构更清晰,后续要扩展资源监控(CPU、内存、上下文数等)时,可以直接在 BrowserManager 里实现。 ''' SCREENSHOT_DIR = Path("screenshots") SCREENSHOT_DIR.mkdir(exist_ok=True) class BrowserManager: def __init__(self): self._pw = None self.browser = None self._started = False async def start(self): if not self._started: self._pw = await async_playwright().start() self.browser = await self._pw.chromium.launch(headless=True) self._started = True logging.getLogger(__name__).info("Browser singleton started") async def close(self): if self._started: try: await self.browser.close() finally: await self._pw.stop() self._started = False logging.getLogger(__name__).info("Browser singleton closed") def is_running(self): return self._started and self.browser is not None class PlaywrightTaskManager: def __init__(self, browser_manager: BrowserManager, timeout: int = 20): self.browser_manager = browser_manager self.timeout = timeout async def _task_logic(self, url: str): context = await self.browser_manager.browser.new_context() page = await context.new_page() screenshot_path = None try: await page.goto(url, wait_until="load") title = await page.title() # time.sleep(5) await asyncio.sleep(5) screenshot_path = str(SCREENSHOT_DIR / f"{int(asyncio.get_event_loop().time())}.png") await page.screenshot(path=screenshot_path, full_page=True) return {"url": url, "title": title, "screenshot_path": screenshot_path} finally: try: await context.close() except Exception: pass async def run_task_with_timeout(self, url: str): try: result = await asyncio.wait_for(self._task_logic(url), timeout=self.timeout) return result except asyncio.TimeoutError: return {"url": url, "error": f"timeout after {self.timeout}s"} except PWTimeout: return {"url": url, "error": f"Playwright timeout after {self.timeout}s"} except Exception as e: return {"url": url, "error": str(e)} class DemoSpider(scrapy.Spider): name = "demo_playwright_task_manager1" start_urls = [ "https://www.hao123.com/", "http://www.people.com.cn/", "http://renshi.people.com.cn/", ] def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.timeout = 3 self._loop = None self.browser_manager = BrowserManager() self.task_manager = None @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super(DemoSpider, cls).from_crawler(crawler, *args, **kwargs) crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened) crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed) return spider def spider_opened(self): # 独立事件循环,避免 "no running event loop" self._loop = asyncio.new_event_loop() asyncio.set_event_loop(self._loop) self._loop.run_until_complete(self.browser_manager.start()) self.task_manager = PlaywrightTaskManager(self.browser_manager, timeout=self.timeout) def spider_closed(self): if self.browser_manager.is_running(): self._loop.run_until_complete(self.browser_manager.close()) self._loop.close() def start_requests(self): for url in self.start_urls: self.logger.info("Dispatching task for %s", url) result = self._loop.run_until_complete(self.task_manager.run_task_with_timeout(url)) print(result) yield { "url": result.get("url"), "title": result.get("title"), "screenshot": result.get("screenshot_path"), "error": result.get("error"), }
最新发布
10-12
你提供的代码已经是一个结构清晰、模块化良好的 Scrapy + Playwright 异步爬虫系统,通过 `BrowserManager` 管理浏览器单例生命周期,`PlaywrightTaskManager` 负责任务执行与超时控制,`DemoSpider` 仅负责调度和信号绑定。这种设计非常适合在生产环境中运行(尤其是配合 Scrapyd 部署),并且支持跨平台(Windows/Linux)。 下面我将 **详尽分析你的代码**,指出其优点、潜在问题,并给出优化建议和改进后的完整版本代码。最后提出一些深入的相关问题供你进一步思考或扩展。 --- ## ✅ 代码详解与解释 ### 🧩 1. `BrowserManager` 类:浏览器单例管理 ```python class BrowserManager: def __init__(self): self._pw = None self.browser = None self._started = False async def start(self): if not self._started: self._pw = await async_playwright().start() self.browser = await self._pw.chromium.launch(headless=True) self._started = True logging.getLogger(__name__).info("Browser singleton started") async def close(self): if self._started: try: await self.browser.close() finally: await self._pw.stop() self._started = False logging.getLogger(__name__).info("Browser singleton closed") ``` #### ✅ 优点: - 实现了浏览器的**单例模式**,避免重复启动。 - 使用 `async_playwright().start()` 正确初始化 Playwright。 - 在 `close()` 中确保 `finally` 块中调用 `self._pw.stop()`,防止资源泄漏。 #### ⚠️ 注意点: - `async_playwright()` 返回的是一个上下文管理器(context manager),推荐使用 `async with` 更安全。 - 当前写法虽然可行,但未处理异常情况下 `_pw` 是否成功创建的问题。 ✅ **改进建议**:使用 `async with` 来保证 Playwright 正确释放。 --- ### 🧩 2. `PlaywrightTaskManager` 类:任务执行与超时控制 ```python class PlaywrightTaskManager: def __init__(self, browser_manager: BrowserManager, timeout: int = 20): self.browser_manager = browser_manager self.timeout = timeout async def _task_logic(self, url: str): context = await self.browser_manager.browser.new_context() page = await context.new_page() screenshot_path = None try: await page.goto(url, wait_until="load") title = await page.title() await asyncio.sleep(5) # 模拟等待动态加载 screenshot_path = str(SCREENSHOT_DIR / f"{int(asyncio.get_event_loop().time())}.png") await page.screenshot(path=screenshot_path, full_page=True) return {"url": url, "title": title, "screenshot_path": screenshot_path} finally: try: await context.close() except Exception: pass ``` #### ✅ 优点: - 封装了页面操作逻辑(访问、截图等)。 - 使用 `finally` 确保上下文关闭,防止内存泄露。 - 支持自定义超时。 #### ⚠️ 潜在问题: - `await asyncio.sleep(5)` 是硬编码延迟,不推荐用于等待页面加载完成;应使用 `page.wait_for_load_state()` 或 `page.wait_for_selector()`。 - 截图文件名基于时间戳,可能在高并发下冲突(概率低但存在)。 - 异常捕获太宽泛(`except Exception`),不利于调试。 ✅ **改进建议**: - 替换 `sleep(5)` 为更智能的等待机制。 - 使用唯一标识如 UUID 或哈希命名截图。 - 更精细地处理异常。 --- ### 🧩 3. `DemoSpider` 类:Scrapy 调度核心 ```python class DemoSpider(scrapy.Spider): name = "demo_playwright_task_manager1" start_urls = [...] def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.timeout = 3 self._loop = None self.browser_manager = BrowserManager() self.task_manager = None @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super().from_crawler(crawler, *args, **kwargs) crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened) crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed) return spider def spider_opened(self): self._loop = asyncio.new_event_loop() asyncio.set_event_loop(self._loop) self._loop.run_until_complete(self.browser_manager.start()) self.task_manager = PlaywrightTaskManager(self.browser_manager, timeout=self.timeout) def spider_closed(self): if self.browser_manager.is_running(): self._loop.run_until_complete(self.browser_manager.close()) self._loop.close() def start_requests(self): for url in self.start_urls: result = self._loop.run_until_complete( self.task_manager.run_task_with_timeout(url) ) yield scrapy.Request( url=url, callback=self.parse, meta=result, dont_filter=True ) def parse(self, response): meta = response.meta return { "url": meta.get("url"), "title": meta.get("title"), "screenshot": meta.get("screenshot_path"), "error": meta.get("error"), } ``` #### ✅ 优点: - 利用 Scrapy 的信号系统正确绑定生命周期。 - 手动创建独立事件循环,解决 “no running event loop” 问题。 - 任务同步阻塞式执行(适合简单场景)。 #### ⚠️ 潜在问题: - `start_requests` 中使用 `run_until_complete` 同步执行异步任务 → **串行执行,性能差**。 - 不支持并发抓取多个 URL。 - `yield dict` 不是标准做法,应该 `yield scrapy.Request` 并在 `parse` 中返回 `scrapy.Item`。 ✅ **改进建议**: - 使用 `scrapy-playwright` 提供的 `playwright_request()` 或自定义中间件实现真正的异步并发。 - 若坚持当前架构,至少改为异步生成器(需 Scrapy 2.6+ 支持 `async def start_requests`)。 --- ## ✅ 改进版代码(优化后) ```python # demo_scrapy_playwright_split.py - 优化版本 import os import asyncio import logging import uuid from pathlib import Path from typing import Optional import scrapy from scrapy import signals from playwright.async_api import async_playwright, TimeoutError as PWTimeout # 日志配置 logger = logging.getLogger(__name__) SCREENSHOT_DIR = Path("screenshots") SCREENSHOT_DIR.mkdir(exist_ok=True) class BrowserManager: def __init__(self, headless: bool = True): self.headless = headless self._pw = None self.browser = None self._started = False async def start(self): if self._started: return # 使用 async with 更安全 self._pw_ctx = async_playwright() self._pw = await self._pw_ctx.__aenter__() try: self.browser = await self._pw.chromium.launch(headless=self.headless) self._started = True logger.info("Browser singleton started.") except Exception as e: logger.error(f"Failed to start browser: {e}") await self._pw_ctx.__aexit__(None, None, None) raise async def close(self): if not self._started: return try: if self.browser: await self.browser.close() finally: if self._pw_ctx: await self._pw_ctx.__aexit__(None, None, None) self._started = False logger.info("Browser singleton closed.") def is_running(self) -> bool: return self._started and self.browser is not None class PlaywrightTaskManager: def __init__(self, browser_manager: BrowserManager, timeout: int = 20): self.browser_manager = browser_manager self.timeout = timeout async def _task_logic(self, url: str): context = None page = None try: context = await self.browser_manager.browser.new_context( viewport={"width": 1920, "height": 1080}, java_script_enabled=True, ) page = await context.new_page() await page.goto(url, wait_until="networkidle", timeout=self.timeout * 1000) await page.wait_for_load_state("domcontentloaded") title = await page.title() # 使用 UUID 避免命名冲突 filename = f"{uuid.uuid4().hex}.png" screenshot_path = SCREENSHOT_DIR / filename await page.screenshot(path=str(screenshot_path), full_page=True) return { "url": url, "title": title, "screenshot_path": str(screenshot_path), "success": True } except PWTimeout as e: logger.warning(f"Timeout navigating to {url}: {e}") return {"url": url, "error": f"timeout_{self.timeout}s", "success": False} except Exception as e: logger.error(f"Error processing {url}: {type(e).__name__}, {e}") return {"url": url, "error": str(e), "success": False} finally: if context: try: await context.close() except Exception as e: logger.debug(f"Failed to close context: {e}") class DemoSpider(scrapy.Spider): name = "demo_playwright_task_manager" custom_settings = { 'LOG_LEVEL': 'INFO', 'DOWNLOAD_TIMEOUT': 30, # 可选:限制并发请求数以匹配浏览器能力 'CONCURRENT_REQUESTS': 3, 'CONCURRENT_REQUESTS_PER_DOMAIN': 3, } start_urls = [ "https://www.hao123.com/", "http://www.people.com.cn/", "http://renshi.people.com.cn/", ] def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.timeout = 20 self._loop: Optional[asyncio.AbstractEventLoop] = None self.browser_manager = BrowserManager(headless=True) self.task_manager: Optional[PlaywrightTaskManager] = None @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super().from_crawler(crawler, *args, **kwargs) crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened) crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed) return spider async def spider_opened(self): """异步初始化浏览器""" self._loop = asyncio.get_event_loop() await self.browser_manager.start() self.task_manager = PlaywrightTaskManager(self.browser_manager, timeout=self.timeout) logger.info("Spider opened, browser ready.") async def spider_closed(self): """异步关闭浏览器""" await self.browser_manager.close() if self._loop and hasattr(self._loop, "shutdown_asyncgens"): self._loop.shutdown_asyncgens() # Python 3.6+ logger.info("Spider closed.") async def start_requests(self): """异步并发请求(Scrapy 2.6+ 支持 async generator)""" tasks = [ self.task_manager._task_logic(url) for url in self.start_urls ] results = await asyncio.gather(*tasks) for result in results: # 使用 Request 触发 parse 回调 meta = result.copy() url = result.get("url") if result.get("success"): yield scrapy.Request( url=url, callback=self.parse, meta=meta, errback=self.handle_error, dont_filter=True ) else: yield meta # 直接输出失败信息 def parse(self, response): meta = response.meta yield { "url": meta["url"], "title": meta["title"], "screenshot": meta["screenshot_path"], "status": "success" } def handle_error(self, failure): url = failure.request.url yield { "url": url, "error": repr(failure), "status": "failed" } ``` --- ## ✅ 主要改进点总结: | 改进项 | 原始代码 | 改进后 | |-------|--------|--------| | Playwright 初始化 | 手动 `.start()` | 使用 `async with` 上下文管理器 | | 截图命名 | 时间戳易冲突 | 使用 `uuid` 唯一命名 | | 页面等待策略 | `asyncio.sleep(5)` | `wait_until="networkidle"` + `wait_for_load_state` | | 错误日志 | 宽泛捕获 | 分类记录,便于排查 | | 并发模型 | 串行执行 | 使用 `asyncio.gather` 并发 | | `start_requests` | 同步阻塞 | 改为 `async def` 异步生成器(Scrapy ≥2.6) | | 输出格式 | 直接 `yield dict` | 统一使用 `Request` + `parse` 流程 | > 💡 提示:要启用 `async def start_requests`,请确保 Scrapy 版本 ≥ 2.6,并启用 `ASYNCIO_ENABLED = True`(默认已开启)。 --- ##
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值