使用Crawlee-Python的PlaywrightCrawler实现电商网站数据爬取-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00235/article/details/148490821

使用Crawlee-Python的PlaywrightCrawler实现电商网站数据爬取

crawlee-python Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation. 项目地址: https://gitcode.com/gh_mirrors/cr/crawlee-python

项目概述

Crawlee-Python是一个强大的Python网络爬虫框架，它基于Playwright等现代浏览器自动化工具，提供了高效、可靠的网页抓取能力。本文将深入解析如何使用Crawlee-Python中的PlaywrightCrawler来构建一个完整的电商网站数据爬取解决方案。

核心组件解析

PlaywrightCrawler类

PlaywrightCrawler是Crawlee-Python框架中的核心爬虫类，它基于Playwright实现，具有以下特点：

支持现代JavaScript渲染的网页
自动处理请求队列和并发控制
提供完善的页面交互API
内置请求去重和错误处理机制

爬虫工作流程

示例代码展示了一个典型的电商网站爬取流程，分为三个层次：

起始页面：发现并收集所有分类页面
分类页面：收集产品详情页链接并处理分页
详情页面：提取产品具体信息

代码实现详解

爬虫初始化

crawler = PlaywrightCrawler(
    max_requests_per_crawl=10,  # 限制最大请求数以控制爬取范围
)

这里我们创建了一个PlaywrightCrawler实例，并通过max_requests_per_crawl参数限制了最大请求数量，这在开发和测试阶段非常有用。

请求处理器设计

爬虫的核心是请求处理器，我们使用装饰器@crawler.router.default_handler来定义默认处理逻辑：

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    context.log.info(f'Processing {context.request.url}')

处理器接收一个PlaywrightCrawlingContext对象，它封装了当前请求的所有上下文信息。

详情页面处理

当遇到标记为'DETAIL'的请求时，我们从产品页面提取结构化数据：

if context.request.label == 'DETAIL':
    # 提取制造商信息
    url_part = context.request.url.split('/').pop()
    manufacturer = url_part.split('-')[0]

    # 提取产品标题
    title = await context.page.locator('.product-meta h1').text_content()

    # 提取SKU编号
    sku = await context.page.locator('span.product-meta__sku-number').text_content()

    # 提取价格信息
    price_element = context.page.locator('span.price', has_text='$').first
    current_price_string = await price_element.text_content() or ''
    raw_price = current_price_string.split('$')[1]
    price = float(raw_price.replace(',', ''))

    # 检查库存状态
    in_stock_element = context.page.locator(
        selector='span.product-form__inventory',
        has_text='In stock',
    ).first
    in_stock = await in_stock_element.count() > 0

    # 构建数据对象并存储
    data = {
        'manufacturer': manufacturer,
        'title': title,
        'sku': sku,
        'price': price,
        'in_stock': in_stock,
    }
    await context.push_data(data)

分类页面处理

分类页面负责发现产品链接和分页处理：

elif context.request.label == 'CATEGORY':
    # 等待产品列表加载
    await context.page.wait_for_selector('.product-item > a')

    # 收集产品链接
    await context.enqueue_links(
        selector='.product-item > a',
        label='DETAIL',
    )

    # 处理分页
    next_button = await context.page.query_selector('a.pagination__next')
    if next_button:
        await context.enqueue_links(
            selector='a.pagination__next',
            label='CATEGORY',
        )

起始页面处理

起始页面负责发现所有分类：

else:
    # 等待分类区块加载
    await context.page.wait_for_selector('.collection-block-item')

    # 收集分类链接
    await context.enqueue_links(
        selector='.collection-block-item',
        label='CATEGORY',
    )

爬虫执行

最后，我们启动爬虫并指定起始URL：

await crawler.run(['https://warehouse-theme-metal.myshopify.com/collections'])

技术要点总结

分层处理：将爬取逻辑分为起始页、分类页和详情页三个层次，使代码结构清晰
智能等待：使用wait_for_selector确保元素加载完成后再进行操作
精确选择器：结合CSS选择器和文本内容过滤确保定位准确
数据标准化：对提取的原始数据进行清洗和转换（如价格处理）
异步处理：充分利用Python的异步特性提高爬取效率

实际应用建议

错误处理：在实际应用中应增加异常处理逻辑，应对网络波动和页面结构变化
请求限制：合理设置请求间隔，避免对目标网站造成过大压力
数据存储：考虑将数据存储到数据库或文件中，而不仅仅是内存数据集
配置管理：将选择器等易变部分提取为配置，便于维护

通过这个示例，我们可以看到Crawlee-Python框架如何简化复杂网站的数据爬取工作，同时保持代码的可读性和可维护性。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考