Elastic Open Web Crawler 作为代码

原创于 2025-09-23 09:19:23 发布 · 1.3k 阅读

25 ·

CC 4.0 BY-SA版权

本文为博主原创文章，未经博主允许不得转载。

文章标签：

#elasticsearch #大数据 #搜索引擎 #全文检索 #数据库

Elastic 同时被 2 个专栏收录

2168 篇文章

订阅专栏

Elasticsearch

1451 篇文章

订阅专栏

学习如何使用 GitHub Actions 管理 Elastic Open Crawler 配置，这样每次我们将更改推送到仓库时，更改都会自动应用到已部署的 crawler 实例。

Elasticsearch 允许你快速并灵活地索引数据。你可以在云端免费试用，也可以在本地运行，看看索引有多么容易。

通过 Elastic Open Web Crawler 及其 CLI 驱动的架构，实现版本化的 crawler 配置和带有本地测试的 CI/CD 流水线现在变得非常简单。

传统上，管理 crawlers 是一个手动且容易出错的过程。它涉及在 UI 中直接编辑配置，并在克隆爬取配置、回滚、版本控制等方面遇到困难。将 crawler 配置当作代码来处理解决了这些问题，因为它提供了我们在软件开发中所期望的相同优势：可重复性、可追溯性和自动化。

这个工作流让你更容易将 Open Web Crawler 引入 CI/CD 流水线，用于回滚、备份和迁移 —— 这些任务在早期的 Elastic Crawlers（如 Elastic Web Crawler 或 App Search Crawler）中要麻烦得多。

在本文中，我们将学习如何：

使用 GitHub 管理我们的爬取配置
搭建本地环境来在部署前测试流水线
创建生产环境，每次将更改推送到主分支时使用新设置运行 web crawler

你可以在这里找到项目仓库。撰写时，我使用的是 Elasticsearch 9.1.3 和 Open Web Crawler 0.4.2。

前提条件

Docker Desktop
Elasticsearch 实例
带有 SSH 访问权限并安装了 Docker 的虚拟机（例如 AWS EC2）

步骤

文件夹结构
Crawler 配置
Docker-compose 文件（本地环境）
Github Actions
本地测试
部署到生产环境
修改并重新部署

文件夹结构

对于这个项目，我们将有以下文件结构：

├── docker-compose.yml # Local elasticsearch + crawler
├── config/crawler-config.yml # Crawler config
├── .github/workflows/deploy.yml # GH Action to deploy changes
├── local.sh # Script to run our local crawler

Crawler 配置

在 crawler-config.yml 下，我们将放入以下内容：

output_sink: elasticsearch
output_index: web-crawl-index
max_crawl_depth: 1

elasticsearch:
  host: ${ES_HOST}
  api_key: ${ES_API_KEY}
     
domains:
  - url: https://web-scraping.dev
    seed_urls:
      - https://web-scraping.dev/product/1
      - https://web-scraping.dev/product/2
      - https://web-scraping.dev/product/3

这将从 https://web-scraping.dev/products 抓取，这是一个用于产品的模拟网站。我们只会抓取前三个产品页面。max_crawl_depth 设置将阻止 crawler 发现超出 seed_urls 定义的页面，因为它不会打开其中的链接。

Elasticsearch host 和 api_key 将根据我们运行脚本的环境动态填充。

Docker-compose 文件（本地环境）

对于本地的 docker-compose.yml，我们将部署 crawler 和一个单节点的 Elasticsearch 集群 + Kibana，这样我们可以在部署到生产环境之前轻松地可视化爬取结果。

services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:9.1.3
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
    ports:
      - "9200:9200"
    networks: [esnet]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9200"]
      interval: 5s
      timeout: 5s
      retries: 10

  kibana:
    image: docker.elastic.co/kibana/kibana:9.1.3
    environment:
      - ELASTICSEARCH_HOSTS=http://es01:9200
    ports:
      - "5601:5601"
    networks: [esnet]
    depends_on: [es01]

  crawler:
    image: docker.elastic.co/integrations/crawler:0.4.2
    environment:
      - ES_HOST=http://es01:9200
      - CRAWLER_JRUBY_OPTS=--server
    container_name: crawler
    volumes:
      - ./config:/home/app/config
    networks: [esnet]
    entrypoint: ["/home/app/bin/crawler", "crawl", "/home/app/config/crawl-config-final.yml"]
    stdin_open: true
    tty: true

networks:
  esnet:
    driver: bridge

注意 crawler 会等待 Elasticsearch 就绪后才运行。

Github Actions

现在我们需要创建一个 GitHub Action，它会在每次 push 到 main 时将新设置复制并在虚拟机中运行 crawler。这样可以确保我们始终部署的是最新配置，而不需要手动进入虚拟机更新文件并运行 crawler。我们将使用 AWS EC2 作为虚拟机提供商。

第一步是将主机 (VM_HOST)、机器用户 (VM_USER)、SSH RSA 密钥 (VM_KEY)、Elasticsearch 主机 (ES_HOST) 和 Elasticsearch API Key (ES_API_KEY) 添加到 GitHub Action secrets：

这样，action 就能访问我们的服务器来复制新文件并运行爬取。

现在，让我们创建 .github/workflows/deploy.yml 文件：

name: Deploy

on:
  push:
    branches: [main]

jobs:
  Deploy:
    name: Deploy to EC2
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v5

      - name: Deploy crawler
        env:
          HOSTNAME: ${{ secrets.VM_HOST }}
          USER_NAME: ${{ secrets.VM_USER }}
          PRIVATE_KEY: ${{ secrets.VM_KEY }}
          ES_HOST: ${{ secrets.ES_HOST }}
          ES_API_KEY: ${{ secrets.ES_API_KEY }}
        run: |
          # Save private key
          echo "$PRIVATE_KEY" > private_key
          chmod 600 private_key

          # Generate final config locally
          envsubst < config/crawler-config.yml > config/crawl-config-final.yml

          # Copy the config folder to VM
          scp -o StrictHostKeyChecking=no -i private_key -r config ${USER_NAME}@${HOSTNAME}:~/config

          # SSH into VM and run crawler
          ssh -o StrictHostKeyChecking=no -i private_key ${USER_NAME}@${HOSTNAME} << EOF
            docker run --rm \
              -v ~/config:/config \
              docker.elastic.co/integrations/crawler:latest jruby \
              bin/crawler crawl /config/crawl-config-final.yml
          EOF

每次我们推送对 crawler 配置文件的更改时，此 action 将执行以下步骤：

在 yml 配置中填充 Elasticsearch host 和 API Key
将 config 文件夹复制到我们的 VM
通过 SSH 连接到 VM
使用刚从仓库复制的配置运行爬取

本地测试

为了在本地测试我们的 crawler，我们创建了一个 bash 脚本，它会用 Docker 中的本地 Elasticsearch host 填充配置并启动爬取。你可以运行 ./local.sh 来执行它。

#!/bin/bash

# Exit on any error
set -e

# Load environment variables
export ES_HOST="http://es01:9200"

# Generate final crawler config
envsubst < ./config/crawler-config.yml > ./config/crawl-config-final.yml

# Bring everything up
docker compose up --build

让我们查看 Kibana DevTools，以确认 web-crawler-index 是否已正确填充：

部署到生产环境

现在我们准备将更改推送到 main 分支，这将会在你的虚拟机中部署 crawler，并开始将日志发送到你的 Serverless Elasticsearch 实例。

git add .
git commit -m "First commit"
git push

这将触发 GitHub Action，它会在虚拟机中执行部署脚本并开始爬取。

你可以通过访问 GitHub 仓库并进入 “Actions” 标签页来确认 action 是否已执行：

修改并重新部署

你可能已经注意到，每个产品的价格是文档 body 字段的一部分。理想情况下，我们应将价格存储在单独的字段中，以便对其运行过滤器。

让我们将此更改添加到 crawler.yml 文件中，使用提取规则从 product-price CSS 类中提取价格：

output_sink: elasticsearch
output_index: web-crawl-index
max_crawl_depth: 1

elasticsearch:
  host: ${ES_HOST}
  api_key: ${ES_API_KEY}
     
  # Index ingest pipeline to process documents before indexing          
  pipeline_enabled: true
  pipeline: pricing-pipeline

domains:
  - url: https://web-scraping.dev
    seed_urls:
      - https://web-scraping.dev/product/1
      - https://web-scraping.dev/product/2
      - https://web-scraping.dev/product/3
    extraction_rulesets:
      - url_filters:
          - type: ends
            pattern: /product/*
        rules:
          - action: extract
            field_name: price
            selector: .product-price
            join_as: string
            source: html

我们还注意到价格包含美元符号 ($)，如果我们想运行范围查询，就必须将其去掉。我们可以使用 ingest pipeline 来实现。注意，我们在上面的新 crawler 配置文件中引用了它：

PUT _ingest/pipeline/pricing-pipeline
{
  "processors": [
    {
      "script": {
        "source": """
                ctx['price'] = ctx['price'].replace("$","")
            """
      }
    }
  ]
}

我们可以在生产 Elasticsearch 集群中运行该命令。对于开发环境，由于它是临时的，我们可以将 pipeline 创建作为 docker-compose.yml 文件的一部分，通过添加以下服务来实现。注意，我们还在 crawler 服务中添加了 depends_on，以便它在 pipeline 成功创建后再启动。

 crawler:
    image: docker.elastic.co/integrations/crawler:0.4.2
    environment:
      - ES_HOST=http://es01:9200
      - CRAWLER_JRUBY_OPTS=--server
    container_name: crawler
    volumes:
      - ./config:/home/app/config
    networks: [esnet]
    entrypoint: ["/home/app/bin/crawler", "crawl", "/home/app/config/crawl-config-final.yml"]
    depends_on:
      pipeline-init:
        condition: service_completed_successfully
    stdin_open: true
    tty: true  


  pipeline-init:
    image: curlimages/curl:latest
    depends_on:
      es01:
        condition: service_healthy
    networks: [esnet]
    entrypoint: >
        sh -c "
        echo 'Creating ingest pipeline...';
        curl -s -X PUT http://es01:9200/_ingest/pipeline/pricing-pipeline \\
          -H 'Content-Type: application/json' \\
          -d '{\"processors\":[{\"script\":{\"source\":\"ctx.price = ctx.price.replace(\\\"$\\\", \\\"\\\")\"}}]}';
        echo 'Pipeline created!';
        "

现在让我们运行 ./local.sh 来在本地查看更改：

太好了！现在让我们推送更改：

git add crawler-config.yml
git commit -m "added price CSS selector"
git push

要确认一切正常，你可以查看生产环境的 Kibana，它应反映更改，并显示没有美元符号的价格作为新字段。

结论

Elastic Open Web Crawler 允许你将 crawler 当作代码来管理，这意味着你可以自动化整个流程 —— 从开发到部署 —— 并添加临时本地环境以及对爬取数据进行编程测试，仅举几例。

你可以克隆官方仓库，使用此工作流开始索引自己的数据。你还可以阅读这篇文章，了解如何对 crawler 生成的索引运行语义搜索。

原文：https://www.elastic.co/search-labs/blog/elastic-open-crawler-config-as-code