实现爬虫即服务
1. 将容器存储到 Elastic Container Repository (ECR)
1.1 访问 ECR 并列出仓库
登录为 ECS 创建的账户后,可访问 Elastic Container Repository。使用以下 AWS CLI 命令列出现有仓库:
$ aws ecr describe-repositories
{
"repositories": []
}
此时没有任何仓库,接下来创建仓库。
1.2 创建仓库
创建三个仓库,分别对应不同的容器:
scraper-rest-api
、
scraper-microservice
和
rabbitmq
。
$ aws ecr create-repository --repository-name scraper-rest-api
{
"repository": {
"repositoryArn": "arn:aws:ecr:us-west-2:414704166289:repository/scraper-rest-api",
"repositoryUri": "414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-rest-api",
"repositoryName": "scraper-rest-api",
"registryId": "414704166289",
"createdAt": 1515632756.0
}
}
$ aws ecr create-repository --repository-name scraper-microservice
{
"repository": {
"repositoryArn": "arn:aws:ecr:us-west-2:414704166289:repository/scraper-microservice",
"registryId": "414704166289",
"repositoryName": "scraper-microservice",
"repositoryUri": "414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-microservice",
"createdAt": 1515632772.0
}
}
$ aws ecr create-repository --repository-name rabbitmq
{
"repository": {
"repositoryArn": "arn:aws:ecr:us-west-2:414704166289:repository/rabbitmq",
"repositoryName": "rabbitmq",
"registryId": "414704166289",
"createdAt": 1515632780.0,
"repositoryUri": "414704166289.dkr.ecr.us-west-2.amazonaws.com/rabbitmq"
}
}
记录每个仓库的 URL,后续步骤会用到。
1.3 标记本地容器镜像
使用以下命令查看本地 Docker 镜像:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
scraper-rest-api latest b82653e11635 29 seconds ago 717MB
scraper-microservice latest efe19d7b5279 11 minutes ago 4.16GB
rabbitmq 3-management 6cb6e2f951a8 2 weeks ago 151MB
python 3 c1e459c00dc3 3 weeks ago 692MB
使用
docker tag
命令标记三个镜像(不需要标记
python
镜像):
$ docker tag b8 414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-rest-api
$ docker tag ef 414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-microservice
$ docker tag 6c 414704166289.dkr.ecr.us-west-2.amazonaws.com/rabbitmq
再次查看 Docker 镜像,会显示标记后的镜像:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-rest-api latest b82653e11635 4 minutes ago 717MB
scraper-rest-api latest b82653e11635 4 minutes ago 717MB
414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-microservice latest efe19d7b5279 15 minutes ago 4.16GB
scraper-microservice latest efe19d7b5279 15 minutes ago 4.16GB
414704166289.dkr.ecr.us-west-2.amazonaws.com/rabbitmq latest 6cb6e2f951a8 2 weeks ago 151MB
rabbitmq 3-management 6cb6e2f951a8 2 weeks ago 151MB
python 3 c1e459c00dc3 3 weeks ago 692MB
1.4 推送镜像到 ECR
$ docker push 414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-rest-api
$ docker push 414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-microservice
$ docker push 414704166289.dkr.ecr.us-west-2.amazonaws.com/rabbitmq
1.5 检查镜像是否推送成功
$ aws ecr list-images --repository-name scraper-rest-api
{
"imageIds": [
{
"imageTag": "latest",
"imageDigest": "sha256:2fa2ccc0f4141a1473386d3592b751527eaccb37f035aa08ed0c4b6d7abc9139"
}
]
}
流程总结
graph LR
A[登录 ECS 账户] --> B[列出 ECR 仓库]
B --> C{是否有仓库}
C -- 否 --> D[创建仓库]
C -- 是 --> E[跳过创建]
D --> F[标记本地镜像]
E --> F
F --> G[推送镜像到 ECR]
G --> H[检查镜像是否推送成功]
2. 创建 ECS 集群
2.1 创建 ECR 集群
使用 AWS CLI 创建名为
scraper-cluster
的 ECR 集群:
$ aws ecs create-cluster --cluster-name scraper-cluster
{
"cluster": {
"clusterName": "scraper-cluster",
"registeredContainerInstancesCount": 0,
"clusterArn": "arn:aws:ecs:us-west-2:414704166289:cluster/scraper-cluster",
"status": "ACTIVE",
"activeServicesCount": 0,
"pendingTasksCount": 0,
"runningTasksCount": 0
}
}
2.2 创建密钥对
$ aws ec2 create-key-pair --key-name ScraperClusterKP --query 'KeyMaterial' --output text > ScraperClusterKP.pem
$ aws ec2 describe-key-pairs --key-name ScraperClusterKP
{
"KeyPairs": [
{
"KeyFingerprint": "4a:8a:22:fa:53:a7:87:df:c5:17:d9:4f:b1:df:4e:22:48:90:27:2d",
"KeyName": "ScraperClusterKP"
}
]
}
2.3 创建安全组
创建一个安全组,开放端口 22 (ssh)、80 (http) 以及 RabbitMQ 的两个端口 (5672 和 15672)。
$ aws ec2 create-security-group --group-name ScraperClusterSG --description "Scraper Cluster SG”
{
"GroupId": "sg-5e724022"
}
$ aws ec2 authorize-security-group-ingress --group-name ScraperClusterSG --protocol tcp --port 22 --cidr 0.0.0.0/0
$ aws ec2 authorize-security-group-ingress --group-name ScraperClusterSG --protocol tcp --port 80 --cidr 0.0.0.0/0
$ aws ec2 authorize-security-group-ingress --group-name ScraperClusterSG --protocol tcp --port 5672 --cidr 0.0.0.0/0
$ aws ec2 authorize-security-group-ingress --group-name ScraperClusterSG --protocol tcp --port 15672 --cidr 0.0.0.0/0
可使用以下命令确认安全组内容:
$ aws ec2 describe-security-groups --group-names ScraperClusterSG
2.4 设置 IAM 策略
使用
ecsPolicy.json
和
rolePolicy.json
文件注册 IAM 策略:
$ aws iam create-role --role-name ecsRole --assume-role-policy-document file://ecsPolicy.json
$ aws iam put-role-policy --role-name ecsRole --policy-name ecsRolePolicy --policy-document file://rolePolicy.json
$ aws iam create-instance-profile --instance-profile-name ecsRole
$ aws iam add-role-to-instance-profile --instance-profile-name ecsRole --role-name ecsRole
2.5 启动 EC2 实例
$ aws ec2 run-instances --image-id ami-c9c87cb1 --count 1 --instance-type m4.large --key-name ScraperClusterKP --iam-instance-profile "Name= ecsRole" --security-groups ScraperClusterSG --user-data file://userdata.txt
2.6 检查实例是否运行
$ aws ecs list-container-instances --cluster scraper-cluster
{
"containerInstanceArns": [
"arn:aws:ecs:us-west-2:414704166289:container-instance/263d9416-305f-46ff-a344-9e7076ca352a"
]
}
步骤总结
| 步骤 | 操作 | 命令 |
|---|---|---|
| 1 | 创建 ECR 集群 |
aws ecs create-cluster --cluster-name scraper-cluster
|
| 2 | 创建密钥对 |
aws ec2 create-key-pair --key-name ScraperClusterKP --query 'KeyMaterial' --output text > ScraperClusterKP.pem
和
aws ec2 describe-key-pairs --key-name ScraperClusterKP
|
| 3 | 创建安全组 |
一系列
aws ec2
命令
|
| 4 | 设置 IAM 策略 |
一系列
aws iam
命令
|
| 5 | 启动 EC2 实例 |
aws ec2 run-instances --image-id ami-c9c87cb1 --count 1 --instance-type m4.large --key-name ScraperClusterKP --iam-instance-profile "Name= ecsRole" --security-groups ScraperClusterSG --user-data file://userdata.txt
|
| 6 | 检查实例是否运行 |
aws ecs list-container-instances --cluster scraper-cluster
|
3. 创建运行容器的任务
3.1 任务定义文件
使用
td.json
文件描述如何运行容器,通过以下命令向 ECS 注册任务:
$ aws ecs register-task-definition --cli-input-json file://td.json
输出结果如下:
{
"taskDefinition": {
"volumes": [],
"family": "scraper",
"memory": "4096",
"placementConstraints": [],
"cpu": "1024",
"containerDefinitions": [
{
"name": "rabbitmq",
"cpu": 0,
"volumesFrom": [],
"mountPoints": [],
"portMappings": [
{
"hostPort": 15672,
"protocol": "tcp",
"containerPort": 15672
},
{
"hostPort": 5672,
"protocol": "tcp",
"containerPort": 5672
}
],
"environment": [],
"image": "414704166289.dkr.ecr.us-west-2.amazonaws.com/rabbitmq",
"memory": 256,
"essential": true
},
{
"name": "scraper-microservice",
"cpu": 0,
"essential": true,
"volumesFrom": [],
"mountPoints": [],
"portMappings": [],
"environment": [
{
"name": "AMQP_URI",
"value": "pyamqp://guest:guest@rabbitmq"
}
],
"image": "414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-microservice",
"memory": 256,
"links": [
"rabbitmq"
]
},
{
"name": "api",
"cpu": 0,
"essential": true,
"volumesFrom": [],
"mountPoints": [],
"portMappings": [
{
"hostPort": 80,
"protocol": "tcp",
"containerPort": 8080
}
],
"environment": [
{
"name": "AMQP_URI",
"value": "pyamqp://guest:guest@rabbitmq"
},
{
"name": "ES_HOST",
"value": "https://elastic:tduhdExunhEWPjSuH73O6yLS@7dc72d3327076cc4daf5528103c46a27.us-west-2.aws.found.io:9243"
}
],
"image": "414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-rest-api",
"memory": 128,
"links": [
"rabbitmq"
]
}
],
"requiresCompatibilities": [
"EC2"
],
"status": "ACTIVE",
"taskDefinitionArn": "arn:aws:ecs:us-west-2:414704166289:task-definition/scraper:7",
"requiresAttributes": [
{
"name": "com.amazonaws.ecs.capability.ecr-auth"
}
],
"revision": 7,
"compatibilities": [
"EC2"
]
}
}
3.2 任务定义解析
任务定义主要由两部分组成:
-
整体信息
:定义任务的整体设置,如允许的内存和 CPU 总量,是否挂载卷等。
{
"family": "scraper-as-a-service",
"requiresCompatibilities": [
"EC2"
],
"cpu": "1024",
"memory": "4096",
"volumes": []
}
-
容器定义
:定义要运行的三个容器。
- rabbitmq 容器 :
{
"name": "rabbitmq",
"image": "414704166289.dkr.ecr.us-west-2.amazonaws.com/rabbitmq",
"cpu": 0,
"memory": 256,
"portMappings": [
{
"containerPort": 15672,
"hostPort": 15672,
"protocol": "tcp"
},
{
"containerPort": 5672,
"hostPort": 5672,
"protocol": "tcp"
}
],
"essential": true
}
- **scraper-microservice 容器**:
{
"name": "scraper-microservice",
"image": "414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-microservice",
"cpu": 0,
"memory": 256,
"essential": true,
"environment": [
{
"name": "AMQP_URI",
"value": "pyamqp://guest:guest@rabbitmq"
}
],
"links": [
"rabbitmq"
]
}
- **api 容器**:
{
"name": "api",
"image": "414704166289.dkr.ecr.us-west-2.amazonaws.com/scraper-rest-api",
"cpu": 0,
"memory": 128,
"essential": true,
"portMappings": [
{
"containerPort": 8080,
"hostPort": 80,
"protocol": "tcp"
}
],
"environment": [
{
"name": "AMQP_URI",
"value": "pyamqp://guest:guest@rabbitmq"
},
{
"name": "ES_HOST",
"value": "https://elastic:tduhdExunhEWPjSuH73O6yLS@7dc72d3327076cc4daf5528103c46a27.us-west-2.aws.found.io:9243"
}
],
"links": [
"rabbitmq"
]
}
3.3 任务定义结构总结
graph LR
A[任务定义] --> B[整体信息]
A --> C[容器定义]
B --> B1[family]
B --> B2[requiresCompatibilities]
B --> B3[cpu]
B --> B4[memory]
B --> B5[volumes]
C --> C1[rabbitmq 容器]
C --> C2[scraper-microservice 容器]
C --> C3[api 容器]
4. 在 AWS 中启动并访问容器
4.1 获取最新任务修订号
$ aws ecs list-task-definitions
{
"taskDefinitionArns": [
"arn:aws:ecs:us-west-2:414704166289:task-definition/scraper-as-a-service:17"
]
}
4.2 运行任务
$ aws ecs run-task --cluster scraper-cluster --task-definition scraper-as-a-service:17 --count 1
输出结果包含任务的当前状态,首次运行时,由于要将容器复制到 EC2 实例,可能需要一些时间。
4.3 检查任务状态
$ aws ecs describe-tasks --cluster scraper-cluster --task 00d7b868-1b99-4b54-9f2a-0d5d0ae75197
需将任务 GUID 替换为运行任务输出中
taskArn
属性的 GUID。当所有容器都运行时,即可测试 API。
4.4 获取集群实例的 IP 地址或 DNS 名称
- 列出集群实例:
$ aws ecs list-container-instances --cluster scraper-cluster
{
"containerInstanceArns": [
"arn:aws:ecs:us-west-2:414704166289:container-instance/5959fd63-7fd6-4f0e-92aa-ea136dabd762"
]
}
- 查询 EC2 实例 ID:
$ aws ecs describe-container-instances --cluster scraper-cluster --container-instances 5959fd63-7fd6-4f0e-92aa-ea136dabd762 | grep "ec2InstanceId"
"ec2InstanceId": "i-08614daf41a9ab8a2",
4.5 操作步骤总结
| 步骤 | 操作 | 命令 |
|---|---|---|
| 1 | 获取最新任务修订号 |
aws ecs list-task-definitions
|
| 2 | 运行任务 |
aws ecs run-task --cluster scraper-cluster --task-definition scraper-as-a-service:17 --count 1
|
| 3 | 检查任务状态 |
aws ecs describe-tasks --cluster scraper-cluster --task [task-guid]
|
| 4 | 获取集群实例信息 |
aws ecs list-container-instances --cluster scraper-cluster
和
aws ecs describe-container-instances --cluster scraper-cluster --container-instances [instance-guid] | grep "ec2InstanceId"
|
通过以上步骤,我们可以将爬虫服务部署到 AWS 上,并确保其正常运行。在实际应用中,可根据需求调整任务定义和容器配置,以满足不同的业务场景。
超级会员免费看
2778

被折叠的 条评论
为什么被折叠?



