本地部署 Firecrawl 爬虫让 AI 知识库更丰满

最新推荐文章于 2025-05-10 08:47:56 发布

测试游记

最新推荐文章于 2025-05-10 08:47:56 发布

阅读量1.1k

点赞数 3

文章标签：爬虫人工智能

本文链接：https://blog.youkuaiyun.com/weixin_37786060/article/details/146929196

版权

https://www.firecrawl.dev/

firecrawl-logo-with-fire.png

什么是Firecrawl

Firecrawl 是一款 可以将网站转换为便于AI处理的Markdown 格式的爬虫工具 ，主要 提供 API 服务 ，无需站点地图，只需要接收一个 URL 地址就可以爬取网站及网站下可访问的所有子页面内容。

本地部署Firecrawl

https://github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md

For a simpler setup, you can use Docker Compose to run all services:

Prerequisites: Make sure you have Docker and Docker Compose installed
Copy the .env.example file to .env in the /apps/api/ directory and configure as needed
From the root directory, run: docker compose up
This will start Redis, the API server, and workers automatically in the correct configuration.

git clone https://github.com/mendableai/firecrawl.git
cd firecrawl

创建.env文件

cp apps/api/.env.example apps/api/.env

需要使用LLM的话修改一下OPENAI_API_KEY和OPENAI_BASE_URL

OPENAI_API_KEY=xxx 
OPENAI_BASE_URL=xxx

构建并启动

docker compose build
docker compose up -d

国内可能下载playwright很慢，可以修改「apps/playwright-service-ts/Dockerfile」

RUN echo "deb http://mirrors.aliyun.com/debian/ bookworm main non-free contrib\n\  
deb http://mirrors.aliyun.com/debian/ bookworm-updates main non-free contrib\n\  
deb http://mirrors.aliyun.com/debian-security bookworm-security main non-free contrib" > /etc/apt/sources.list  

# Install Playwright dependencies  
ENV PLAYWRIGHT_DOWNLOAD_HOST=https://npmmirror.com/mirrors/playwright/  
RUN npx playwright install --with-deps

测试一下

curl -X GET http://localhost:3002/test

使用python调用

pip install firecrawl-py

import logging  
from firecrawl import FirecrawlApp  

logging.basicConfig(level=logging.INFO)  
logger = logging.getLogger(__name__)  


def main():  
    try:  
        app = FirecrawlApp(api_key=None, api_url="http://localhost:3002")  
        params = {  
            'formats': ['markdown'],  
        }  
        logger.info("开始抓取网页...")  
        scrape_status = app.scrape_url('https://www.kujiale.com/', params=params)  
        logger.info("抓取结果：")  
        print(scrape_status)  
    except Exception as e:  
        logger.error(f"抓取过程中发生错误: {str(e)}")  
        raise  


if __name__ == "__main__":  
    main()