21、使用 Docker 和 Elastic Cloud 构建灵活的抓取服务

QuietPulse

于 2025-09-07 13:34:17 发布

阅读量82

点赞数

CC 4.0 BY-SA版权

分类专栏： Python爬虫实战精讲文章标签： Docker Elastic Cloud 抓取服务

本文链接：https://blog.youkuaiyun.com/c2d3e4f/article/details/151335546

Python爬虫实战精讲专栏收录该内容

23 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

使用 Docker 和 Elastic Cloud 构建灵活的抓取服务

1. 使用 Docker Compose 创建抓取微服务

Docker Compose 借助 docker-compose.yml 文件告知 Docker 如何将容器组合为服务。以下是一个 docker-compose.yml 文件示例，可将抓取器的各个部分作为服务启动：

version: '3'
services:
  api:
    image: scraper-rest-api
    ports:
      - "8080:8080"
    networks:
      - scraper-compose-net
  scraper:
    image: scraping-microservice
    depends_on:
      - rabbitmq
    networks:
      - scraper-compose-net
  elastic:
    image: docker.elastic.co/elasticsearch/elasticsearch:6.1.1
    ports:
      - "9200:9200"
      - "9300:9300"
    networks:
      - scraper-compose-net
  rabbitmq:
    image: rabbitmq:3-management
    ports:
      - "15672:15672"
    networks:
      - scraper-compose-net
networks:
  scraper-compose-net:
    driver: bridge

此文件定义了四个服务： api 、 scraper 、 elastic 和 rabbitmq ，并说明了它们的创建方式。 image 标签指定了每个服务使用的 Docker 镜像； ports 标签用于映射端口； networks 标签指定了服务要连接的网络，这里的网络被声明为桥接网络。 scraper 服务的 depends_on 标签表明该服务依赖于 rabbitmq 服务，Docker Compose 会确保按指定顺序启动服务。

下面是启动服务的操作步骤：
1. 打开终端，进入包含 docker-compose.yml 文件的目录。
2. 运行以下命令启动服务：

$ docker-compose up

执行此命令后，Compose 会读取配置并确定操作，随后会有大量输出信息，每个容器的输出都会流式传输到这个终端。输出开始时，你会看到类似如下内容：

Starting 10_api_1 ...
 Recreating elastic ...
 Starting rabbitmq ...
 Starting rabbitmq
 Recreating elastic
 Starting rabbitmq ... done
 Starting 10_scraper_1 ...
 Recreating elastic ... done
 Attaching to rabbitmq, 10_api_1, 10_scraper_1, 10_elastic_1

在另一个终端中，运行 docker ps 命令查看已启动的容器：

$ docker ps
 CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
 2ed0d456ffa0 docker.elastic.co/elasticsearch/elasticsearch:6.1.1
"/usr/local/bin/do..." 3 minutes ago Up 2 minutes
0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp 10_elastic_1
 8395989fac8d scraping-microservice "nameko run --brok..." 26
minutes ago Up 3 minutes 10_scraper_1
 4e9fe8479db5 rabbitmq:3-management "docker-entrypoint..." 26
minutes ago Up 3 minutes 4369/tcp, 5671-5672/tcp, 15671/tcp,
25672/tcp, 0.0.0.0:15672->15672/tcp rabbitmq
 0b0df48a7201 scraper-rest-api "python -u api.py" 26 minutes ago Up
3 minutes 0.0.0.0:8080->8080/tcp 10_api_1

服务容器的名称包含两个不同的标识符。前缀是运行组合的文件夹名称，这里是 10 （前缀为 10_ ），你可以使用 docker-compose up -p 选项指定不同的前缀。尾随数字是该服务容器的实例编号，在这个场景中，每个服务只启动了一个容器，所以编号都是 _1 。

验证网络创建情况：

$ docker network ls | head -n 2
 NETWORK ID NAME DRIVER SCOPE
 0e27be3e30f2 10_scraper-compose-net bridge local

若未指定网络，Compose 会创建一个默认网络并将所有服务连接到该网络。在这个例子中，指定网络和使用默认网络均可正常工作，但在更复杂的场景中，默认网络可能不合适。

验证服务是否正常运行：
- 调用 REST 抓取 API：

$ curl localhost:8080/joblisting/122517
 "{\"ID\": \"122517\", \"JSON\": {\"@context\":
\"http://schema.org\", \"@type\": \"JobPosting\", \"title\":
\"SpaceX Enterprise Software Engineer, Full Stack\", \"
...

检查 Elasticsearch 是否运行：

$ curl localhost:9200/joblisting
{"error":{"root_cause":[{"type":"index_not_found_exception","reason
":"no such
index","resource.type":"index_or_alias","resource.id":"joblisting",
"index_uuid":"_na_","index":"j
...

服务扩展：
若要增加微服务容器数量以提高请求处理能力，可以使用 docker-compose 进行服务扩展。以下命令将 scraper 服务的容器数量增加到 3 个：

docker-compose up --scale scraper=3

执行该命令后，Compose 会启动另外两个 scraper 服务容器，并输出相关信息：

10_api_1 is up-to-date
10_elastic_1 is up-to-date
10_rabbitmq_1 is up-to-date
Starting 10_scraper_1 ... done
Creating 10_scraper_2 ...
Creating 10_scraper_3 ...
Creating 10_scraper_2 ... done
Creating 10_scraper_3 ... done
Attaching to 10_api_1, 10_elastic_1, 10_rabbitmq_1, 10_scraper_1,
10_scraper_3, 10_scraper_2

再次运行 docker ps 命令，可看到三个 scraper 容器正在运行：

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b9c2da0c9008 scraping-microservice "nameko run --brok..." About a
minute ago Up About a minute 10_scraper_2
643221f85364 scraping-microservice "nameko run --brok..." About a
minute ago Up About a minute 10_scraper_3
73dc31fb3d92 scraping-microservice "nameko run --brok..." 6 minutes
ago Up 6 minutes 10_scraper_1
5dd0db072483 scraper-rest-api "python api.py" 7 minutes ago Up 7
minutes 0.0.0.0:8080->8080/tcp 10_api_1
d8e25b6ce69a rabbitmq:3-management "docker-entrypoint..." 7 minutes
ago Up 7 minutes 4369/tcp, 5671-5672/tcp, 15671/tcp, 25672/tcp,
0.0.0.0:15672->15672/tcp 10_rabbitmq_1
f305f81ae2a3 docker.elastic.co/elasticsearch/elasticsearch:6.1.1
"/usr/local/bin/do..." 7 minutes ago Up 7 minutes
0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp 10_elastic_1

此时，有三个名为 10_scraper_1 、 10_scraper_2 和 10_scraper_3 的容器在运行。进入 RabbitMQ 管理界面，可看到有三个连接，每个连接有不同的 IP 地址。在桥接网络中，Compose 会在 172.23.0 网络上分配 IP 地址，从 .2 开始。所有来自 API 的抓取请求会路由到 rabbitmq 容器，RabbitMQ 服务会将消息分发到所有活动连接，从而实现处理能力的扩展。

服务缩容：
可以通过指定较小的容器数量来缩容服务实例，Compose 会移除容器直到达到指定数量。
停止服务：
当所有操作完成后，可使用以下命令停止并移除所有容器和网络：

$ docker-compose down
Stopping 10_scraper_1 ... done
Stopping 10_rabbitmq_1 ... done
Stopping 10_api_1 ... done
Stopping 10_elastic_1 ... done
Removing 10_scraper_1 ... done
Removing 10_rabbitmq_1 ... done
Removing 10_api_1 ... done
Removing 10_elastic_1 ... done
Removing network 10_scraper-compose-net

执行 docker ps 命令，可看到所有容器已被移除。

2. 创建和配置 Elastic Cloud 试用账户

由于在 AWS 中运行包含 Elasticsearch 的容器存在内存要求和系统配置等复杂问题，因此可以使用 Elastic Cloud 作为托管服务。以下是创建和配置 Elastic Cloud 试用账户的步骤：
1. 打开浏览器，访问 Elastic Cloud 注册页面。
2. 输入电子邮件地址，点击“Start Free Trial”按钮。收到验证邮件后，完成验证。
3. 选择云服务提供商和区域，例如选择 AWS 的俄勒冈（us-west-2）区域，然后点击“Create”按钮。
4. 记录下显示的用户名和密码，以及 Elasticsearch URL。

3. 使用 curl 访问 Elastic Cloud 集群

Elasticsearch 通过 REST API 进行访问，Elastic Cloud 也不例外。以下是使用 curl 访问 Elastic Cloud 集群的步骤：
1. 注册 Elastic Cloud 时会获得各种端点和变量，如用户名、密码和 URL。URL 类似如下格式：

https://<account-id>.us-west-2.aws.found.io:9243

根据云服务提供商和区域的不同，域名和端口可能会有所差异。
2. 使用包含用户名和密码的 URL 进行通信和身份验证：

https://<username>:<password>@<account-id>.us-west-2.aws.found.io:9243

例如：

https://elastic:tduhdExunhEWPjSuH73O6yLS@d7c72d3327076cc4daf5528103c46a27.us-west-2.aws.found.io:9243

使用 curl 检查基本身份验证和连接情况：

$ curl
https://elastic:tduhdExunhEWPjSuH73O6yLS@7dc72d3327076cc4daf5528103c46a27.us-west-2.aws.found.io:9243
{
  "name": "instance-0000000001",
  "cluster_name": "7dc72d3327076cc4daf5528103c46a27",
  "cluster_uuid": "g9UMPEo-QRaZdIlgmOA7hg",
  "version": {
    "number": "6.1.1",
    "build_hash": "bd92e7f",
    "build_date": "2017-12-17T20:23:25.338Z",
    "build_snapshot": false,
    "lucene_version": "7.1.0",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline": "You Know, for Search"
}

4. 使用 Python 连接 Elastic Cloud 集群

可以使用 Elasticsearch Python 库连接 Elastic Cloud 集群。以下是具体步骤：
1. 执行 11/01/elasticcloud_starwars.py 脚本：

$ python elasticcloud_starwars.py

脚本代码如下：

from elasticsearch import Elasticsearch
import requests
import json

if __name__ == '__main__':
    es = Elasticsearch(
        [
            "https://elastic:tduhdExunhEWPjSuH73O6yLS@d7c72d3327076cc4daf5528103c46a27.us-west-2.aws.found.io:9243"
        ])
    i = 1
    while i < 20:
        r = requests.get('http://swapi.co/api/people/' + str(i))
        if r.status_code is not 200:
            print("Got a " + str(r.status_code) + " so stopping")
            break
        j = json.loads(r.content)
        print(i, j)
        # es.index(index='sw', doc_type='people', id=i, body=json.loads(r.content))
        i = i + 1

该脚本会循环获取最多 20 个星球大战角色的数据，并将其存入 sw 索引，文档类型为 people 。连接时使用包含用户名和密码的 URL，数据通过 GET 请求从 swapi.co 获取，然后使用 Elasticsearch 对象的 .index() 方法存储数据。运行脚本后，会看到类似以下输出：

1 Luke Skywalker
2 C-3PO
3 R2-D2
4 Darth Vader
5 Leia Organa
6 Owen Lars
7 Beru Whitesun lars
8 R5-D4
9 Biggs Darklighter
10 Obi-Wan Kenobi
11 Anakin Skywalker
12 Wilhuff Tarkin
13 Chewbacca
14 Han Solo
15 Greedo
16 Jabba Desilijic Tiure
Got a 404 so stopping

5. 使用 Kibana 可视化数据

注册 Elastic Cloud 时会获得 Kibana 的 URL，Kibana 是 Elasticsearch 的强大图形化前端。以下是使用 Kibana 可视化数据的步骤：
1. 打开浏览器，访问 Kibana URL，输入用户名和密码登录。
2. 创建索引模式：
- 在索引模式文本框中输入 sw* ，点击“Next step”按钮。
- 选择“ I don’t want to use the Time Filter”，点击“Create Index Pattern”按钮。
3. 点击“Discover”菜单项，可看到之前存入 Elasticsearch 的数据。

6. 使用 Python API 执行 Elasticsearch 查询

可以使用 Elasticsearch Python 库在 Star Wars 索引上执行简单查询。以下是具体步骤：
1. 执行 11/02/search_starwars_by_haircolor.py 脚本：

from elasticsearch import Elasticsearch
import json

es = Elasticsearch(
    [
        "https://elastic:tduhdExunhEWPjSuH73O6yLS@7dc72d3327076cc4daf5528103c46a27.us-west-2.aws.found.io:9243"
    ])
search_definition = {
    "query": {
        "match": {
            "hair_color": "blond"
        }
    }
}
result = es.search(index="sw", doc_type="people", body=search_definition)
print(json.dumps(result, indent=4))

该脚本通过构造一个字典来表达 Elasticsearch DSL 查询，查询所有 hair_color 属性为 blond 的文档。将该对象作为 .search() 方法的 body 参数传递，方法返回一个字典，包含查询结果的元数据和匹配的文档。输出结果如下：

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 1.3112576,
        "hits": [
            {
                "_index": "sw",
                "_type": "people",
                "_id": "1",
                "_score": 1.3112576,
                "_source": {
                    "name": "Luke Skywalker",
                    "height": "172",
                    "mass": "77",
                    "hair_color": "blond",
                    "skin_color": "fair",
                    "eye_color": "blue",
                    "birth_year": "19BBY",
                    "gender": "male",
                    "homeworld": "https://swapi.co/api/planets/1/",
                    "films": [
                        "https://swapi.co/api/films/2/",
                        "https://swapi.co/api/films/6/",
                        "https://swapi.co/api/films/3/",
                        "https://swapi.co/api/films/1/",
                        "https://swapi.co/api/films/7/"
                    ],
                    "species": [
                        "https://swapi.co/api/species/1/"
                    ],
                    "vehicles": [
                        "https://swapi.co/api/vehicles/14/",
                        "https://swapi.co/api/vehicles/30/"
                    ],
                    "starships": [
                        "https://swapi.co/api/starships/12/",
                        "https://swapi.co/api/starships/22/"
                    ],
                    "created": "2014-12-09T13:50:51.644000Z",
                    "edited": "2014-12-20T21:17:56.891000Z",
                    "url": "https://swapi.co/api/people/1/"
                }
            },
            {
                "_index": "sw",
                "_type": "people",
                "_id": "11",
                "_score": 0.80259144,
                "_source": {
                    "name": "Anakin Skywalker",
                    "height": "188",
                    "mass": "84",
                    "hair_color": "blond",
                    "skin_color": "fair",
                    "eye_color": "blue",
                    "birth_year": "41.9BBY",
                    "gender": "male",
                    "homeworld": "https://swapi.co/api/planets/1/",
                    "films": [
                        "https://swapi.co/api/films/5/",
                        "https://swapi.co/api/films/4/",
                        "https://swapi.co/api/films/6/"
                    ],
                    "species": [
                        "https://swapi.co/api/species/1/"
                    ],
                    "vehicles": [
                        "https://swapi.co/api/vehicles/44/",
                        "https://swapi.co/api/vehicles/46/"
                    ],
                    "starships": [
                        "https://swapi.co/api/starships/59/",
                        "https://swapi.co/api/starships/65/",
                        "https://swapi.co/api/starships/39/"
                    ],
                    "created": "2014-12-10T16:20:44.310000Z",
                    "edited": "2014-12-20T21:17:50.327000Z",
                    "url": "https://swapi.co/api/people/11/"
                }
            }
        ]
    }
}

查询结果包含查询执行的元数据和匹配的文档，每个匹配项返回实际文档、索引名称、文档类型、文档 ID 和得分。得分是 Lucene 计算的文档与查询的相关性，即使查询使用精确匹配，不同文档的得分也可能不同。

通过以上步骤，你可以使用 Docker 和 Elastic Cloud 构建一个灵活的抓取服务，实现数据的抓取、存储和查询。

7. 将 Docker 容器迁移到 Amazon Elastic Container Repository (ECR)

在使用 Elastic Cloud 处理 Elasticsearch 相关任务后，接下来可以将 Docker 容器迁移到 Amazon Elastic Container Repository (ECR)，以便在 Amazon Elastic Container Service (ECS) 中运行。以下是具体步骤：

7.1 创建 AWS IAM 用户和密钥对

登录 AWS 管理控制台，导航到 IAM（身份与访问管理）服务。
在左侧导航栏中，选择“用户”，然后点击“添加用户”。
输入用户名称，选择“编程访问”作为访问类型，然后点击“下一步”。
为用户添加权限，建议添加 AmazonEC2ContainerRegistryFullAccess 策略，以允许用户完全访问 ECR。
点击“下一步”，查看用户信息，然后点击“创建用户”。
记录下生成的访问密钥 ID 和秘密访问密钥，这些信息将用于后续的 Docker 认证。

7.2 配置 Docker 以使用 ECR 进行认证

安装 AWS CLI（如果尚未安装），并使用之前创建的 IAM 用户的访问密钥 ID 和秘密访问密钥进行配置：

$ aws configure
AWS Access Key ID [None]: <your-access-key-id>
AWS Secret Access Key [None]: <your-secret-access-key>
Default region name [None]: <your-aws-region>
Default output format [None]: json

使用 AWS CLI 获取 ECR 的认证令牌，并使用该令牌登录 Docker：

$ aws ecr get-login-password --region <your-aws-region> | docker login --username AWS --password-stdin <your-account-id>.dkr.ecr.<your-aws-region>.amazonaws.com

其中， <your-account-id> 是你的 AWS 账户 ID， <your-aws-region> 是你选择的 AWS 区域。

7.3 推送容器到 ECR

为你的 Docker 镜像打标签，使其符合 ECR 的格式：

$ docker tag <your-image-name>:<your-image-tag> <your-account-id>.dkr.ecr.<your-aws-region>.amazonaws.com/<your-repository-name>:<your-image-tag>

例如：

$ docker tag scraper-rest-api:latest 123456789012.dkr.ecr.us-west-2.amazonaws.com/scraper-repo:latest

推送打标签后的镜像到 ECR：

$ docker push <your-account-id>.dkr.ecr.<your-aws-region>.amazonaws.com/<your-repository-name>:<your-image-tag>

8. 在 Amazon Elastic Container Service (ECS) 中运行容器

在将容器推送到 ECR 后，可以在 Amazon Elastic Container Service (ECS) 中创建集群和任务来运行这些容器。以下是具体步骤：

8.1 创建 ECS 集群

登录 AWS 管理控制台，导航到 ECS 服务。
在左侧导航栏中，选择“集群”，然后点击“创建集群”。
选择集群模板，例如“网络只有（使用 Fargate）”，然后点击“下一步”。
输入集群名称，选择 VPC 和子网，然后点击“创建”。

8.2 创建任务定义

在 ECS 控制台中，选择“任务定义”，然后点击“创建新任务定义”。
选择任务定义类型，例如“Fargate”，然后点击“下一步”。
输入任务定义名称，选择任务执行角色（建议使用 ecsTaskExecutionRole ），设置任务内存和 CPU 大小。
在“容器定义”部分，点击“添加容器”。
输入容器名称，指定之前推送到 ECR 的镜像 URI，设置端口映射等容器配置，然后点击“添加”。
点击“创建”完成任务定义。

8.3 启动和访问容器

在 ECS 控制台中，选择之前创建的集群。
在集群页面中，点击“运行任务”。
选择任务定义，设置任务数量，选择启动类型（例如“Fargate”），然后点击“运行任务”。
等待任务启动完成，在任务列表中可以查看任务的状态和公共 IP 地址。
使用公共 IP 地址和端口号访问容器中的服务。

9. 总结与展望

通过上述步骤，我们成功地使用 Docker 和 Docker Compose 创建了抓取微服务，将 Elasticsearch 迁移到 Elastic Cloud，将 Docker 容器迁移到 Amazon ECR，并在 Amazon ECS 中运行这些容器。整个过程涵盖了数据抓取、存储、查询和容器化部署等多个方面，为构建灵活的抓取服务提供了一个完整的解决方案。

以下是整个流程的 mermaid 流程图：

graph LR
    A[创建 Docker 微服务] --> B[创建 Elastic Cloud 账户]
    B --> C[使用 curl 访问 Elastic Cloud]
    C --> D[使用 Python 连接 Elastic Cloud]
    D --> E[使用 Kibana 可视化数据]
    E --> F[使用 Python API 执行查询]
    F --> G[创建 AWS IAM 用户和密钥对]
    G --> H[配置 Docker 认证]
    H --> I[推送容器到 ECR]
    I --> J[创建 ECS 集群]
    J --> K[创建任务定义]
    K --> L[启动和访问容器]