深入理解apify/crawlee-python中的网页爬取技术-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00100/article/details/148490813

深入理解apify/crawlee-python中的网页爬取技术

crawlee-python Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation. 项目地址: https://gitcode.com/gh_mirrors/cr/crawlee-python

概述

apify/crawlee-python是一个强大的Python网络爬虫框架，它提供了高效、可靠的方式来抓取网页数据。本文将重点介绍如何使用该框架进行网页爬取，特别是针对电子商务网站这类具有层级结构的页面。

爬取列表页

在电子商务网站中，通常会有产品分类列表页和产品详情页两种主要页面类型。我们先来看如何爬取列表页。

基本爬取方法

在之前的教程中，我们使用了enqueue_links()函数来简单地收集页面上的所有链接。但在实际应用中，我们需要更精确地控制爬取行为：

async def parse_category(response):
    print(f"Processing category: {response.url}")
    await enqueue_links(
        response=response,
        selector=".collection-block-item",
        label="CATEGORY"
    )

关键参数解析

selector参数：这个参数允许我们指定CSS选择器，只收集符合特定条件的链接。在上例中，我们使用.collection-block-item来只选择分类链接。
label参数：这是一个非常有用的标记系统，可以为不同类型的请求打上标签。例如，我们将分类页请求标记为"CATEGORY"，这样在后续处理中可以轻松识别请求类型。

爬取详情页

爬取产品详情页的原理与列表页类似，但选择器和标签会有所不同：

async def parse_product(response):
    print(f"Processing product: {response.url}")
    await enqueue_links(
        response=response,
        selector=".product-item > a",
        label="DETAIL"
    )