etcd集群部署自动化:Ansible/Terraform配置
概述
在现代分布式系统中,etcd作为高可用的键值存储系统,已成为Kubernetes、云原生应用等关键基础设施的核心组件。手动部署etcd集群不仅耗时耗力,还容易出错。本文详细介绍如何使用Ansible和Terraform实现etcd集群的自动化部署,确保部署过程的可重复性、一致性和可靠性。
etcd集群架构基础
集群拓扑结构
etcd集群采用Raft共识算法,典型的部署架构包含3个或5个节点:
关键配置参数
| 参数 | 说明 | 示例值 |
|---|---|---|
--name | 节点名称 | etcd-node1 |
--data-dir | 数据目录 | /var/lib/etcd |
--listen-client-urls | 客户端监听地址 | http://0.0.0.0:2379 |
--advertise-client-urls | 客户端通告地址 | http://<node-ip>:2379 |
--listen-peer-urls | 节点间通信地址 | http://0.0.0.0:2380 |
--initial-cluster | 初始集群配置 | node1=http://ip1:2380,node2=http://ip2:2380 |
--initial-cluster-token | 集群令牌 | etcd-cluster-token |
Ansible自动化部署方案
项目结构设计
etcd-ansible/
├── inventories/
│ ├── production/
│ │ └── hosts
│ └── staging/
│ └── hosts
├── group_vars/
│ ├── all.yml
│ └── etcd_cluster.yml
├── roles/
│ └── etcd/
│ ├── tasks/
│ │ ├── main.yml
│ │ ├── install.yml
│ │ ├── configure.yml
│ │ └── service.yml
│ ├── templates/
│ │ ├── etcd.service.j2
│ │ └── etcd.conf.j2
│ └── handlers/
│ └── main.yml
└── site.yml
Ansible Playbook实现
主配置文件 (group_vars/all.yml)
# etcd版本配置
etcd_version: "3.5.0"
etcd_download_url: "https://github.com/etcd-io/etcd/releases/download/v{{ etcd_version }}/etcd-v{{ etcd_version }}-linux-amd64.tar.gz"
# 系统配置
etcd_data_dir: "/var/lib/etcd"
etcd_user: "etcd"
etcd_group: "etcd"
# 网络配置
client_port: 2379
peer_port: 2380
集群配置 (group_vars/etcd_cluster.yml)
# 集群配置
etcd_cluster_name: "production-etcd-cluster"
etcd_cluster_token: "etcd-cluster-token-2024"
# 节点配置
etcd_nodes:
- name: "etcd-node1"
ip: "192.168.1.101"
client_url: "http://192.168.1.101:2379"
peer_url: "http://192.168.1.101:2380"
- name: "etcd-node2"
ip: "192.168.1.102"
client_url: "http://192.168.1.102:2379"
peer_url: "http://192.168.1.102:2380"
- name: "etcd-node3"
ip: "192.168.1.103"
client_url: "http://192.168.1.103:2379"
peer_url: "http://192.168.1.103:2380"
# 生成初始集群字符串
etcd_initial_cluster: "{% for node in etcd_nodes %}{{ node.name }}={{ node.peer_url }}{% if not loop.last %},{% endif %}{% endfor %}"
安装任务 (roles/etcd/tasks/install.yml)
- name: 创建etcd用户和组
user:
name: "{{ etcd_user }}"
group: "{{ etcd_group }}"
system: yes
create_home: no
shell: /sbin/nologin
- name: 创建数据目录
file:
path: "{{ etcd_data_dir }}"
state: directory
owner: "{{ etcd_user }}"
group: "{{ etcd_group }}"
mode: '0755'
- name: 下载etcd二进制包
get_url:
url: "{{ etcd_download_url }}"
dest: "/tmp/etcd-v{{ etcd_version }}-linux-amd64.tar.gz"
checksum: "sha256:相应的SHA256校验和"
- name: 解压etcd二进制文件
unarchive:
src: "/tmp/etcd-v{{ etcd_version }}-linux-amd64.tar.gz"
dest: "/tmp/"
remote_src: yes
- name: 安装etcd和etcdctl到系统路径
copy:
src: "/tmp/etcd-v{{ etcd_version }}-linux-amd64/{{ item }}"
dest: "/usr/local/bin/{{ item }}"
owner: root
group: root
mode: '0755'
loop:
- etcd
- etcdctl
配置任务 (roles/etcd/tasks/configure.yml)
- name: 生成etcd配置文件
template:
src: etcd.conf.j2
dest: "/etc/etcd/etcd.conf"
owner: root
group: root
mode: '0644'
- name: 生成systemd服务文件
template:
src: etcd.service.j2
dest: "/etc/systemd/system/etcd.service"
owner: root
group: root
mode: '0644'
- name: 重载systemd配置
systemd:
daemon_reload: yes
Systemd服务模板 (roles/etcd/templates/etcd.service.j2)
[Unit]
Description=etcd key-value store
Documentation=https://github.com/etcd-io/etcd
After=network.target
[Service]
Type=notify
User={{ etcd_user }}
Group={{ etcd_group }}
Environment=ETCD_DATA_DIR={{ etcd_data_dir }}
Environment=ETCD_NAME={{ inventory_hostname_short }}
Environment=ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
Environment=ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
Environment=ETCD_INITIAL_ADVERTISE_PEER_URLS=http://{{ hostvars[inventory_hostname].ansible_default_ipv4.address }}:2380
Environment=ETCD_ADVERTISE_CLIENT_URLS=http://{{ hostvars[inventory_hostname].ansible_default_ipv4.address }}:2379
Environment=ETCD_INITIAL_CLUSTER={{ etcd_initial_cluster }}
Environment=ETCD_INITIAL_CLUSTER_TOKEN={{ etcd_cluster_token }}
Environment=ETCD_INITIAL_CLUSTER_STATE=new
ExecStart=/usr/local/bin/etcd
Restart=always
RestartSec=5
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
部署执行流程
Terraform基础设施即代码方案
Terraform项目结构
etcd-terraform/
├── main.tf
├── variables.tf
├── outputs.tf
├── modules/
│ └── etcd_instance/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
└── scripts/
└── etcd-setup.sh
Terraform主配置 (main.tf)
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.0"
}
}
}
provider "aws" {
region = var.region
}
# 创建安全组
resource "aws_security_group" "etcd_cluster" {
name_prefix = "etcd-cluster-"
description = "Security group for etcd cluster"
ingress {
from_port = 2379
to_port = 2379
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "etcd client communication"
}
ingress {
from_port = 2380
to_port = 2380
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "etcd peer communication"
}
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "SSH access"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
# 创建etcd集群实例
module "etcd_instances" {
source = "./modules/etcd_instance"
count = var.cluster_size
name = "etcd-node-${count.index + 1}"
instance_type = var.instance_type
ami = var.ami_id
key_name = var.key_name
subnet_id = element(var.subnet_ids, count.index)
security_group_ids = [aws_security_group.etcd_cluster.id]
# 节点配置
node_name = "etcd-node-${count.index + 1}"
cluster_size = var.cluster_size
cluster_token = var.cluster_token
}
# 创建负载均衡器
resource "aws_elb" "etcd_lb" {
name = "etcd-cluster-lb"
subnets = var.subnet_ids
security_groups = [aws_security_group.etcd_cluster.id]
listener {
instance_port = 2379
instance_protocol = "http"
lb_port = 2379
lb_protocol = "http"
}
health_check {
healthy_threshold = 2
unhealthy_threshold = 2
timeout = 3
target = "HTTP:2379/health"
interval = 30
}
instances = module.etcd_instances[*].instance_id
}
etcd实例模块 (modules/etcd_instance/main.tf)
resource "aws_instance" "etcd_node" {
ami = var.ami
instance_type = var.instance_type
key_name = var.key_name
subnet_id = var.subnet_id
vpc_security_group_ids = var.security_group_ids
tags = {
Name = var.name
Role = "etcd"
}
user_data = templatefile("${path.module}/user_data.sh", {
node_name = var.node_name
cluster_size = var.cluster_size
cluster_token = var.cluster_token
private_ip = self.private_ip
})
root_block_device {
volume_size = 50
volume_type = "gp3"
}
}
resource "aws_ebs_volume" "etcd_data" {
availability_zone = aws_instance.etcd_node.availability_zone
size = 100
type = "gp3"
tags = {
Name = "${var.name}-data"
}
}
resource "aws_volume_attachment" "etcd_data_att" {
device_name = "/dev/sdf"
volume_id = aws_ebs_volume.etcd_data.id
instance_id = aws_instance.etcd_node.id
}
用户数据脚本 (user_data.sh)
#!/bin/bash
# 安装必要的依赖
apt-get update
apt-get install -y curl tar
# 下载并安装etcd
ETCD_VERSION="3.5.0"
ETCD_URL="https://github.com/etcd-io/etcd/releases/download/v${ETCD_VERSION}/etcd-v${ETCD_VERSION}-linux-amd64.tar.gz"
curl -L ${ETCD_URL} -o /tmp/etcd.tar.gz
tar xzf /tmp/etcd.tar.gz -C /tmp/
cp /tmp/etcd-v${ETCD_VERSION}-linux-amd64/etcd /usr/local/bin/
cp /tmp/etcd-v${ETCD_VERSION}-linux-amd64/etcdctl /usr/local/bin/
# 创建etcd用户和组
groupadd etcd
useradd -r -g etcd -d /var/lib/etcd -s /sbin/nologin etcd
# 创建数据目录
mkdir -p /var/lib/etcd
chown etcd:etcd /var/lib/etcd
# 生成etcd配置文件
cat > /etc/etcd.conf <<EOF
ETCD_NAME=${node_name}
ETCD_DATA_DIR=/var/lib/etcd
ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
ETCD_ADVERTISE_CLIENT_URLS=http://${private_ip}:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=http://${private_ip}:2380
ETCD_INITIAL_CLUSTER_TOKEN=${cluster_token}
ETCD_INITIAL_CLUSTER_STATE=new
EOF
# 生成初始集群配置
# 这里需要动态获取其他节点的IP地址来构建initial-cluster
# 在实际生产环境中,可以使用服务发现或配置管理工具
# 创建systemd服务
cat > /etc/systemd/system/etcd.service <<EOF
[Unit]
Description=etcd key-value store
Documentation=https://github.com/etcd-io/etcd
After=network.target
[Service]
Type=notify
User=etcd
Group=etcd
EnvironmentFile=/etc/etcd.conf
ExecStart=/usr/local/bin/etcd
Restart=always
RestartSec=5
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
EOF
# 启动etcd服务
systemctl daemon-reload
systemctl enable etcd
systemctl start etcd
自动化部署最佳实践
1. 集群健康检查
创建Ansible任务用于集群健康验证:
- name: 验证etcd集群健康状态
shell: |
etcdctl endpoint health \
--endpoints={{ etcd_nodes | map(attribute='client_url') | join(',') }}
register: etcd_health
changed_when: false
failed_when: "'healthy' not in etcd_health.stdout"
- name: 显示集群成员
shell: |
etcdctl member list \
--endpoints={{ etcd_nodes | map(attribute='client_url') | join(',') }}
register: etcd_members
changed_when: false
- debug:
msg: "{{ etcd_members.stdout_lines }}"
2. 备份与恢复自动化
- name: 创建etcd快照备份
shell: |
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
--endpoints={{ etcd_nodes.0.client_url }}
when: inventory_hostname == etcd_nodes.0.name
- name: 验证快照完整性
shell: |
etcdctl snapshot status /backup/etcd-snapshot-latest.db
register: snapshot_status
changed_when: false
3. 监控与告警配置
集成Prometheus监控:
- name: 配置etcd监控指标
template:
src: etcd-metrics.j2
dest: /etc/etcd/metrics.conf
- name: 重启etcd启用监控
systemd:
name: etcd
state: restarted
when: etcd_metrics_enabled
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



