Crawl4AI简单实用

最新推荐文章于 2025-06-11 18:07:49 发布

shykevin

最新推荐文章于 2025-06-11 18:07:49 发布

阅读量836

点赞数 19

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/shykevin/article/details/147525271

一、概述

Crawl4AI 是一个开源的网页爬虫和数据抓取工具，一个python项目，主要为大型语言模型(LLM)和 AI 应用提供数据采集和处理服务。

特性

开源免费：遵循 MIT 许可协议或 Apache-2.0 许可协议，开发人员可自由使用、修改和分发源代码，无需支付费用；

专为 LLM 设计：能够将网页数据处理和清洗成适合 LLM 使用的格式，如 JSON、干净的 HTML 和 Markdown 等，便于后续直接应用于模型训练；

高效性能：支持并行处理多个 URL，可同时抓取和处理多个网页，极大地提高了数据收集效率，减少大规模数据收集所需时间；

多功能支持：可以提取网页中的文本、图片、音频、视频等媒体标签，以及元数据、内外部链接等，并能对页面进行截图等操作；

高度可定制：用户可自定义认证、请求头信息、爬取前页面修改、用户代理以及 JavaScript 脚本执行等，还能根据特定需求自定义爬取深度、频率和提取规则，以适应不同网页结构和数据类型。

项目地址

github地址: https://github.com/unclecode/crawl4ai

Crawl4ai官网: https://crawl4ai.com/

二、安装

安装 crawl4ai 库

pip3 install crawl4ai

设置浏览器

python -m playwright install --with-deps chromium

三、爬取网页

这里以36氪为例子，打开网页：https://36kr.com/information/AI/

注意：中间的新闻是需要js加载才能显示出来的。

36kr.py

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import asyncio
import json


async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        # 通过JavaScript动态加载的内容
        result = await crawler.arun(
            url="https://36kr.com/information/AI/",
            js_code="window.scrollTo(0, document.body.scrollHeight);",
            wait_for="document.querySelector('.information-flow-list')",
        )
        assert result.success, "爬取失败"
        # 返回内容转换为json格式
        html_content = result.model_dump_json()
        # 保存到 JSON 文件
        with open('news.json', 'w', encoding='utf-8') as f:
            f.write(html_content)


asyncio.run(main())

说明：

window.scrollTo(0, document.body.scrollHeight); 是一段 JavaScript 代码，用于将浏览器窗口的滚动条滚动到页面的底部。

wait_for="document.querySelector('.information-flow-list')"，等待页面中出现一个具有特定类名 .information-flow-list 的元素。这段代码通常用于网页自动化工具(如 Playwright)中，表示在执行后续操作之前，等待这个特定元素被渲染到页面上。

具体类名，需要查看网页源代码，使用浏览器控制台工具就可以看到

执行python代码

python3 36kr.py

输出：

[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://36kr.com/information/AI/... | Status: True | Time: 2.74s
[SCRAPE].. ◆ https://36kr.com/information/AI/... | Time: 0.182s
[COMPLETE] ● https://36kr.com/information/AI/... | Status: True | Total: 2.93s

执行之后，会得到一个文件news.json

打开json内容，将内容复制到，在线json格式化网页：https://www.sojson.com/

效果如下：

可以看到json的内容j结构体，其中alt字段就是文章标题。

接下来提取标题，修改代码

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import asyncio
import json


async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        # 通过JavaScript动态加载的内容
        result = await crawler.arun(
            url="https://36kr.com/information/AI/",
            js_code="window.scrollTo(0, document.body.scrollHeight);",
            wait_for="document.querySelector('.information-flow-list')",
        )
        assert result.success, "爬取失败"
        # 返回内容转换为json格式
        html_content = result.model_dump_json()
        # print("html_content", html_content)
        # # 保存到 JSON 文件
        # with open('news.json', 'w', encoding='utf-8') as f:
        #     f.write(html_content)
        
        news_data = json.loads(html_content)
        for i in news_data["media"]["images"]:
            # 判断非广告信息
            if not i["src"].startswith("//static.36krcdn.com"):
                print(i["alt"])


asyncio.run(main())

再次执行python代码，输出：

被DeepSeek打蒙的豆包，发起反攻了
将Agentic AI嵌入家庭网关，如何改变运营商在物联网市场的游戏规则?
有了一天涨万星的开源项目 Codex，OpenAI为何仍砸 30 亿美元重金收购 Windsurf ？
智谱获2亿元新融资，连发3款开源模型，拿3亿元支持全球开源社区
科技大厂掀起医疗界的AI革命，谁更有胜算？
鹅厂的 AI 大招，真的落在微信上
字节快手，AI视频“狭路又相逢”
英伟达CEO黄仁勋突然访华，都不穿皮衣了，还见了梁文锋

可以看到，标题能够正常提取了。