FireCrawl：基于 Docker Compose 的 AI 爬虫系统部署与优化实践

最新推荐文章于 2025-11-13 21:26:45 发布

原创最新推荐文章于 2025-11-13 21:26:45 发布 · 2.3k 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#docker #人工智能 #爬虫 #AI #Docker #Python

部署运行你感兴趣的模型镜像

摘要

随着人工智能技术的快速发展，传统的网页爬虫已经无法满足现代应用对数据质量和结构化的需求。FireCrawl作为一个新兴的AI驱动爬虫系统，能够将网页内容转换为结构化数据，为大语言模型（LLM）应用提供高质量的数据源。本文将详细介绍如何使用Docker Compose部署和优化FireCrawl系统，涵盖环境搭建、配置优化、性能调优等关键环节。通过本文的指导，AI应用开发者可以快速搭建一套高效、稳定的网页数据采集系统，并根据实际需求进行定制化优化。

正文

1. FireCrawl 系统概述

FireCrawl是一个开源的AI爬虫系统，专为大语言模型应用设计。它能够将网页内容转换为干净的Markdown格式，并提取相关的元数据，为AI应用提供结构化的数据输入。

1.1 核心特性

FireCrawl具有以下核心特性：

AI驱动的数据提取：利用AI技术提取网页的核心内容，过滤无关信息
结构化数据输出：将网页内容转换为Markdown格式，便于后续处理
分布式架构：支持水平扩展，提高爬取效率
多种爬取模式：支持单页爬取、网站地图爬取和智能爬取
API接口：提供RESTful API，便于集成到各种应用中

1.2 系统架构

FireCrawl采用微服务架构，主要包含以下几个核心组件：

API服务：提供对外的RESTful API接口
Worker服务：负责执行具体的爬取任务
Playwright服务：用于处理JavaScript渲染的网页
Redis：作为任务队列和缓存系统

下面通过架构图展示FireCrawl的整体架构：

2. 环境准备

在部署FireCrawl之前，我们需要准备好相应的运行环境。

2.1 系统要求

FireCrawl对系统的要求相对较低，推荐配置如下：

操作系统：Linux（推荐Ubuntu 20.04+）、macOS或Windows（WSL2）
内存：至少4GB RAM（推荐8GB以上）
存储：至少10GB可用磁盘空间
Docker：版本19.03或更高
Docker Compose：版本1.27或更高

2.2 安装 Docker 和 Docker Compose

在开始部署之前，确保你的系统已经安装了Docker和Docker Compose。以下是Ubuntu系统上的安装步骤：

# 更新包索引
sudo apt-get update

# 安装必要的包
sudo apt-get install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release

# 添加Docker官方GPG密钥
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

# 设置Docker仓库
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# 更新包索引
sudo apt-get update

# 安装Docker Engine
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin

# 验证Docker安装
sudo docker run hello-world

安装Docker Compose（如果未随Docker一起安装）：

# 下载Docker Compose
sudo curl -L "https://github.com/docker/compose/releases/download/v2.20.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

# 添加执行权限
sudo chmod +x /usr/local/bin/docker-compose

# 验证安装
docker-compose --version

3. FireCrawl 部署

FireCrawl的部署采用Docker Compose方式，便于管理和扩展。

3.1 配置 Docker Compose 文件

创建 [docker-compose.yml](file:///C:/Users/13532/Desktop/%E5%8D%9A%E5%AE%A2/.history/docker-compose.yml) 文件，配置FireCrawl的服务：

name: firecrawl

x-common-service: &common-service
  image: swr.cn-north-4.myhuaweicloud.com/ddn-k8s/ghcr.io/mendableai/firecrawl:latest
  ulimits:
    nofile:
      soft: 65535
      hard: 65535
  networks:
    - backend
  extra_hosts:
    - "host.docker.internal:host-gateway"
  deploy:
    resources:
      limits:
        memory: 2G
        cpus: '1.0'

x-common-env: &common-env
  REDIS_URL: ${REDIS_URL:-redis://redis:6381}
  REDIS_RATE_LIMIT_URL: ${REDIS_URL:-redis://redis:6381}
  PLAYWRIGHT_MICROSERVICE_URL: ${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000/scrape}
  USE_DB_AUTHENTICATION: ${USE_DB_AUTHENTICATION}
  OPENAI_API_KEY: ${OPENAI_API_KEY}
  LOGGING_LEVEL: ${LOGGING_LEVEL:-DEBUG}

services:
  playwright-service:
    image: swr.cn-north-4.myhuaweicloud.com/ddn-k8s/ghcr.io/mendableai/playwright-service:latest
    environment:
      PORT: 3000
      BLOCK_MEDIA: ${BLOCK_MEDIA:-true}
    networks:
      - backend
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '1.0'
    shm_size: 2gb

  api:
    <<: *common-service
    environment:
      <<: *common-env
      HOST: "0.0.0.0"
      PORT: ${INTERNAL_PORT:-8083}
    depends_on:
      - redis
      - playwright-service
    ports:
      - "${PORT:-8083}:${INTERNAL_PORT:-8083}"
    command: [ "pnpm", "run", "start:production" ]
    deploy:
      resources:
        limits:
          memory: 1G
          cpus: '0.5'

  worker:
    <<: *common-service
    environment:
      <<: *common-env
    depends_on:
      - redis
      - playwright-service
      - api
    command: [ "pnpm", "run", "workers" ]
    deploy:
      replicas: ${NUM_WORKER_REPLICAS:-2}
      resources:
        limits:
          memory: 2G
          cpus: '1.0'

  redis:
    image: swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/library/redis:7.0.12
    networks:
      - backend
    ports:
      - "6381:6381"
    deploy:
      resources:
        limits:
          memory: 2G

networks:
  backend:
    driver: bridge

3.2 配置环境变量

创建 [.env](file:///C:/Users/13532/Desktop/%E5%8D%9A%E5%AE%A2/.history/.env) 文件，配置必要的环境变量：

# ===== Required ENVS ======
NUM_WORKER_REPLICAS=2
PORT=8083
HOST=0.0.0.0
REDIS_URL=redis://redis:6381
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/scrape
USE_DB_AUTHENTICATION=false
LOGGING_LEVEL=DEBUG

# 如果需要使用OpenAI API进行内容处理，需要配置API密钥
# OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

3.3 启动服务

在项目目录下运行以下命令启动服务：

# 启动所有服务
docker-compose up -d

# 查看服务状态
docker-compose ps

启动完成后，可以通过以下命令检查服务日志：

# 查看所有服务日志
docker-compose logs

# 查看特定服务日志
docker-compose logs api

4. 功能验证与使用

服务启动后，我们可以通过API接口来验证系统功能。

4.1 基础爬取功能

使用curl命令测试基础爬取功能：

curl -X POST http://localhost:8083/v1/scrape \
     -H "Content-Type: application/json" \
     -d '{"url":"https://example.com"}'

预期返回结果：

{
  "success": true,
  "data": {
    "markdown": "Example Domain\n==============\n\nThis domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)",
    "metadata": {
      "viewport": "width=device-width, initial-scale=1",
      "title": "Example Domain",
      "scrapeId": "147b760a-fb56-437d-86f6-651167c165b0",
      "sourceURL": "https://example.com",
      "url": "https://example.com",
      "statusCode": 200
    }
  }
}

4.2 Python 客户端示例

为了更方便地使用FireCrawl，我们可以编写一个Python客户端：

import requests
import json
import time

class FireCrawlClient:
    """
    FireCrawl API 客户端
    用于与 FireCrawl 服务进行交互
    """
    
    def __init__(self, base_url="http://localhost:8083"):
        """
        初始化客户端
        
        Args:
            base_url (str): FireCrawl API 的基础 URL
        """
        self.base_url = base_url
        self.session = requests.Session()
        
    def scrape_url(self, url, formats=None, headers=None):
        """
        爬取指定 URL 的内容
        
        Args:
            url (str): 要爬取的网页 URL
            formats (list): 需要返回的数据格式，如 ['markdown', 'html']
            headers (dict): 自定义请求头
            
        Returns:
            dict: 爬取结果
        """
        endpoint = f"{self.base_url}/v1/scrape"
        
        # 构造请求数据
        data = {
            "url": url
        }
        
        # 如果指定了返回格式
        if formats:
            data["formats"] = formats
            
        # 如果指定了请求头
        if headers:
            data["headers"] = headers
            
        try:
            # 发送 POST 请求
            response = self.session.post(
                endpoint,
                headers={"Content-Type": "application/json"},
                data=json.dumps(data),
                timeout=30
            )
            
            # 检查响应状态
            response.raise_for_status()
            
            # 返回 JSON 数据
            return response.json()
            
        except requests.exceptions.RequestException as e:
            print(f"请求失败: {e}")
            return None
        except json.JSONDecodeError as e:
            print(f"JSON 解析失败: {e}")
            return None
    
    def crawl_url(self, url, max_depth=1, limit=50):
        """
        爬取整个网站（根据网站地图）
        
        Args:
            url (str): 要爬取的网站根 URL
            max_depth (int): 最大爬取深度
            limit (int): 最大爬取页面数
            
        Returns:
            dict: 爬取结果
        """
        endpoint = f"{self.base_url}/v1/crawl"
        
        # 构造请求数据
        data = {
            "url": url,
            "maxDepth": max_depth,
            "limit": limit
        }
        
        try:
            # 发送 POST 请求
            response = self.session.post(
                endpoint,
                headers={"Content-Type": "application/json"},
                data=json.dumps(data),
                timeout=30
            )
            
            # 检查响应状态
            response.raise_for_status()
            
            # 返回 JSON 数据
            return response.json()
            
        except requests.exceptions.RequestException as e:
            print(f"请求失败: {e}")
            return None
        except json.JSONDecodeError as e:
            print(f"JSON 解析失败: {e}")
            return None

def main():
    """
    主函数，演示如何使用 FireCrawlClient
    """
    # 创建客户端实例
    client = FireCrawlClient()
    
    # 示例1: 爬取单个页面
    print("=== 爬取单个页面 ===")
    result = client.scrape_url("https://example.com")
    if result and result.get("success"):
        data = result.get("data", {})
        print(f"标题: {data.get('metadata', {}).get('title')}")
        print(f"状态码: {data.get('metadata', {}).get('statusCode')}")
        print("Markdown 内容:")
        print(data.get('markdown', '')[:200] + "...")
    else:
        print("爬取失败:", result)
    
    print("\n" + "="*50 + "\n")
    
    # 示例2: 爬取网站（限制页面数）
    print("=== 爬取网站 ===")
    result = client.crawl_url("https://example.com", max_depth=1, limit=5)
    if result and result.get("success"):
        print("爬取任务已提交，任务ID:", result.get("id"))
        # 这里可以轮询获取结果
    else:
        print("爬取失败:", result)

if __name__ == "__main__":
    main()

5. 性能优化策略

为了充分发挥FireCrawl的性能，我们需要根据实际使用场景进行优化。

5.1 资源分配优化

根据实际需求调整 [docker-compose.yml](file:///C:/Users/13532/Desktop/%E5%8D%9A%E5%AE%A2/.history/docker-compose.yml) 中的资源限制。例如，增加worker的内存和CPU限制：

worker:
  deploy:
    resources:
      limits:
        memory: 4G
        cpus: '2.0'

5.2 并发处理优化

增加worker的副本数，以提高并发处理能力：

# 动态调整worker数量
docker-compose up -d --scale worker=4

5.3 Redis 性能优化

调整Redis的内存限制，确保其有足够的资源处理任务队列：

redis:
  deploy:
    resources:
      limits:
        memory: 4G

5.4 系统监控

实时监控容器资源使用情况，确保系统稳定运行：

# 监控所有容器资源使用情况
watch -n 2 docker stats

6. 实践案例

以下是一个实际应用案例，展示如何使用FireCrawl构建知识库。

6.1 案例背景

假设我们需要为一个AI客服系统构建FAQ知识库，需要从公司官网爬取相关页面内容。

6.2 实现方案

import requests
import json
import time
from typing import List, Dict

class KnowledgeBaseBuilder:
    """
    知识库构建器
    用于从网站爬取内容并构建知识库
    """
    
    def __init__(self, firecrawl_url="http://localhost:8083"):
        """
        初始化知识库构建器
        
        Args:
            firecrawl_url (str): FireCrawl API 地址
        """
        self.firecrawl_url = firecrawl_url
        self.session = requests.Session()
        
    def scrape_faq_pages(self, urls: List[str]) -> List[Dict]:
        """
        爬取FAQ页面内容
        
        Args:
            urls (List[str]): FAQ页面URL列表
            
        Returns:
            List[Dict]: 爬取结果列表
        """
        results = []
        
        for url in urls:
            try:
                print(f"正在爬取: {url}")
                
                # 调用FireCrawl API爬取页面
                response = self.session.post(
                    f"{self.firecrawl_url}/v1/scrape",
                    headers={"Content-Type": "application/json"},
                    data=json.dumps({"url": url}),
                    timeout=30
                )
                
                response.raise_for_status()
                result = response.json()
                
                if result.get("success"):
                    data = result.get("data", {})
                    # 提取有用信息
                    faq_item = {
                        "url": url,
                        "title": data.get("metadata", {}).get("title", ""),
                        "content": data.get("markdown", ""),
                        "scrape_time": time.time()
                    }
                    results.append(faq_item)
                    print(f"成功爬取: {url}")
                else:
                    print(f"爬取失败: {url}, 错误: {result}")
                    
            except Exception as e:
                print(f"处理URL {url} 时出错: {e}")
                
            # 添加延迟，避免请求过于频繁
            time.sleep(1)
            
        return results
    
    def save_to_file(self, data: List[Dict], filename: str):
        """
        将数据保存到文件
        
        Args:
            data (List[Dict]): 要保存的数据
            filename (str): 文件名
        """
        try:
            with open(filename, 'w', encoding='utf-8') as f:
                json.dump(data, f, ensure_ascii=False, indent=2)
            print(f"数据已保存到 {filename}")
        except Exception as e:
            print(f"保存文件时出错: {e}")

def main():
    """
    主函数，演示如何构建知识库
    """
    # FAQ页面URL列表
    faq_urls = [
        "https://example.com/faq/general",
        "https://example.com/faq/account",
        "https://example.com/faq/billing",
        "https://example.com/faq/technical"
    ]
    
    # 创建知识库构建器
    builder = KnowledgeBaseBuilder()
    
    # 爬取FAQ页面
    print("开始爬取FAQ页面...")
    faq_data = builder.scrape_faq_pages(faq_urls)
    
    # 保存到文件
    if faq_data:
        builder.save_to_file(faq_data, "faq_knowledge_base.json")
        print(f"成功构建知识库，共 {len(faq_data)} 条记录")
    else:
        print("未能获取任何数据")

if __name__ == "__main__":
    main()

7. 注意事项

在使用FireCrawl过程中需要注意以下几点：

7.1 资源监控

定期监控容器资源使用情况，确保系统稳定运行：

# 查看容器资源使用情况
docker stats

# 查看容器日志
docker-compose logs -f --tail 100

7.2 错误处理

处理常见的错误，如 Cant accept connection 和 WORKER STALLED，通过增加资源限制和优化配置解决。

7.3 合规性考虑

在爬取网站时，需要遵守网站的robots.txt规则和相关法律法规：

import urllib.robotparser

def check_robots_txt(url, user_agent="*"):
    """
    检查网站的robots.txt规则
    
    Args:
        url (str): 网站URL
        user_agent (str): 用户代理
        
    Returns:
        bool: 是否允许爬取
    """
    try:
        # 解析URL获取基础URL
        from urllib.parse import urlparse
        parsed_url = urlparse(url)
        base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
        
        # 获取robots.txt
        rp = urllib.robotparser.RobotFileParser()
        rp.set_url(f"{base_url}/robots.txt")
        rp.read()
        
        # 检查是否允许访问
        return rp.can_fetch(user_agent, url)
    except Exception as e:
        print(f"检查robots.txt时出错: {e}")
        # 出错时默认不允许访问
        return False

# 使用示例
url = "https://example.com/page"
if check_robots_txt(url):
    print("允许爬取该页面")
else:
    print("robots.txt禁止爬取该页面")

8. 最佳实践

8.1 资源限制

合理设置资源限制，避免系统过载：

# 在 docker-compose.yml 中设置合理的资源限制
services:
  worker:
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '1.0'
        reservations:
          memory: 1G
          cpus: '0.5'

8.2 并发处理

根据需求调整worker的副本数，提升并发处理能力：

# 根据负载情况动态调整worker数量
docker-compose up -d --scale worker=3

8.3 日志管理

合理设置日志级别，便于问题排查：

# 在 .env 文件中设置日志级别
LOGGING_LEVEL=INFO

8.4 数据备份

定期备份重要数据：

# 备份Redis数据
docker exec firecrawl-redis-1 redis-cli BGSAVE

# 备份容器卷
docker run --rm -v firecrawl_data:/data -v $(pwd):/backup alpine tar czf /backup/data.tar.gz -C /data .

9. 常见问题解答

9.1 如何解决 `Cant accept connection` 错误？

这个错误通常是由于资源不足导致的，可以通过以下方式解决：

增加worker的资源限制：

worker:
  deploy:
    resources:
      limits:
        memory: 4G
        cpus: '2.0'

增加worker副本数：
```
docker-compose up -d --scale worker=4
```

9.2 如何优化 Redis 性能？

Redis性能优化可以从以下几个方面入手：

增加内存限制：

redis:
  deploy:
    resources:
      limits:
        memory: 4G

调整Redis配置：

# 查看Redis内存使用情况
docker exec firecrawl-redis-1 redis-cli info memory

# 清理过期键
docker exec firecrawl-redis-1 redis-cli EXPIRE key 3600

9.3 如何处理大文件爬取超时问题？

对于大文件或加载较慢的页面，可以通过调整超时设置来解决：

# 在客户端请求中增加超时时间
response = requests.post(
    endpoint,
    headers={"Content-Type": "application/json"},
    data=json.dumps(data),
    timeout=60  # 增加到60秒
)