Prometheus高可用部署实施方案

钧澜铭枢╮

已于 2025-02-10 22:38:53 修改

阅读量888

点赞数 22

分类专栏：运维文章标签： prometheus

于 2025-02-10 22:29:43 首次发布

本文链接：https://blog.youkuaiyun.com/qq_56446725/article/details/145559647

版权

Prometheus高可用部署实施方案

一、架构设计

1.1 服务器规划

本方案选用5台服务器构建Prometheus监控系统，以实现高可用性与高效数据处理，各服务器分工明确：

服务器IP	用途	说明
192.168.47.136	Prometheus Server 1 + Alertmanager 1 + Thanos Sidecar	作为主Prometheus服务器，承担数据采集与告警管理任务，集成Thanos Sidecar实现长期数据存储与处理
192.168.47.137	Prometheus Server 2 + Alertmanager 2 + Thanos Sidecar	作为备用Prometheus服务器，与主服务器协同保障高可用性，Thanos Sidecar确保数据处理一致
192.168.47.138	Exporter管理节点 + Grafana + Consul Server	集中部署Exporter，运行Grafana进行数据可视化，部署Consul Server用于服务发现
192.168.47.139	Consul Server	与其他Consul节点组成生产集群，增强服务发现可靠性
192.168.47.140	Consul Server	与其他Consul节点组成生产集群，增强服务发现可靠性

1.2 架构拓扑

通过负载均衡器分发请求到多个Prometheus Server实例，Thanos Query负责聚合查询，Thanos Store存储历史数据，共同构建高可用监控架构：

graph LR;

A[Load Balancer] --> B[Thanos Query];

B --> C[Thanos Store];

A --> D[Prometheus Server1];

A --> E[Prometheus Server2];

D --> F[Thanos Sidecar1];

E --> G[Thanos Sidecar2];

style A fill:#f9f,stroke:#333,stroke-width:2px;

style B fill:#def,stroke:#333,stroke-width:2px;

style C fill:#fde,stroke:#333,stroke-width:2px;

style D fill:#efe,stroke:#333,stroke-width:2px;

style E fill:#efe,stroke:#333,stroke-width:2px;

style F fill:#eef,stroke:#333,stroke-width:2px;

style G fill:#eef,stroke:#333,stroke-width:2px;

1.3 架构设计总结

明确的服务器规划与合理的架构拓扑，促使各服务器协同工作，实现Prometheus监控系统的高可用性与扩展性，保障数据可靠采集、存储与查询。

二、服务器性能要求

2.1 Prometheus Server（192.168.47.136和192.168.47.137）

Prometheus Server承担数据采集、处理与存储任务，对服务器性能要求较高：

CPU：至少2核，建议4核及以上。多核CPU可提升处理效率，尤其在监控目标多或查询复杂时。
内存：至少8GB。充足内存用于缓存和处理监控数据，随着数据量和查询复杂度增加，更多内存可保障系统流畅运行。
硬盘：建议使用SSD硬盘，容量依监控数据量和保留时长而定。若每天产生约1GB数据，计划保留30天，至少需30GB可用空间，实际部署应预留50%冗余空间，即约45GB。

2.2 Exporter管理节点 + Grafana（192.168.47.138）

该节点主要负责数据采集与可视化展示，性能要求相对较低：

CPU：至少2核，可满足Grafana数据可视化和Exporter数据采集基本需求。若监控目标和指标较多，建议提升至4核。
内存：至少4GB，Grafana缓存和展示数据、Exporter运行会占用一定内存。若运行多个Exporter或进行复杂可视化操作，建议增加到8GB。
硬盘：建议使用SSD硬盘，Grafana本身数据存储需求小，主要考虑Exporter相关数据和日志存储。若无大量额外日志，20GB左右空间通常可满足基本需求，同样建议预留冗余空间。

2.3 Consul Server（192.168.47.138、192.168.47.139、192.168.47.140）

Consul Server用于服务发现，性能要求相对较低：

CPU：至少1核，建议2核及以上。
内存：至少2GB，建议4GB及以上。
硬盘：建议使用SSD硬盘，容量10GB以上即可。

2.4 服务器性能要求总结

依据不同服务器功能，对CPU、内存和硬盘等性能指标提出相应要求，确保各组件稳定运行。实际部署可根据监控规模和数据量适当调整。

三、部署步骤

3.1 前期准备

五台服务器均采用Ubuntu 24.04操作系统，执行以下通用操作：

3.1.1 系统更新与依赖安装

执行以下命令进行系统更新并安装依赖：

sudo apt update

sudo apt upgrade -y

sudo apt install -y curl wget unzip openssl

3.1.2 防火墙配置

开放相关端口，保障各组件间正常通信：

sudo ufw allow 9090/tcp # Prometheus Server

sudo ufw allow 9093/tcp # Alertmanager

sudo ufw allow 9094/tcp # Alertmanager集群端口

sudo ufw allow 9100/tcp # Node Exporter

sudo ufw allow 3000/tcp # Grafana

sudo ufw allow 8500/tcp # Consul

sudo ufw allow 10901/tcp # Thanos Sidecar gRPC

sudo ufw allow 10902/tcp # Thanos Sidecar HTTP

sudo ufw allow 10903/tcp # Thanos Query gRPC

sudo ufw allow 10904/tcp # Thanos Query HTTP

sudo ufw allow 22/tcp

sudo ufw enable

sudo ufw status

3.1.3 生成TLS证书（用于安全通信的服务器，如Prometheus Server）

执行以下命令生成TLS证书：

sudo openssl req -x509 -newkey rsa:2048 -keyout prometheus.key -out prometheus.pem -days 365 -nodes -subj "/CN=prometheus.local"

sudo chmod 644 prometheus.key prometheus.pem

3.2 Prometheus Server部署（192.168.47.136和192.168.47.137）

3.2.1 下载与解压

使用scp命令将安装包上传到两台服务器的/tmp目录下并解压，创建工作目录和数据存储目录，创建专门用户并设置权限，最后将解压后的文件移动到目标目录：

# 上传安装包

scp ~/Downloads/prometheus-3.2.0-rc.1.linux-amd64.tar.gz root@192.168.47.136:/tmp

scp ~/Downloads/prometheus-3.2.0-rc.1.linux-amd64.tar.gz root@192.168.47.137:/tmp

# 解压文件

ssh root@192.168.47.136

cd /tmp

tar -zxvf prometheus-3.2.0-rc.1.linux-amd64.tar.gz

ssh root@192.168.47.137

cd /tmp

tar -zxvf prometheus-3.2.0-rc.1.linux-amd64.tar.gz

# 创建目录

mkdir -p /opt/prometheus

mkdir -p /var/lib/prometheus

# 创建用户

useradd --no-create-home --shell /bin/false prometheus

chown -R prometheus:prometheus /opt/prometheus

chown -R prometheus:prometheus /var/lib/prometheus

# 移动文件

mv /tmp/prometheus-3.2.0-rc.1.linux-amd64/* /opt/prometheus/

3.2.2 配置文件优化

根据服务器IP不同，配置相应参数，保障数据采集与联邦：

在/opt/prometheus/prometheus-3.2.0-rc.1.linux-amd64/prometheus.yml文件中配置如下（以192.168.47.136为例）：

global:

scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

# scrape_timeout is set to the global default (10s).

external_labels:

replica: "1"

# Alertmanager configuration

alerting:

alertmanagers:

- static_configs:

- targets:

# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

# - "first_rules.yml"

# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

- job_name: "prometheus"

# metrics_path defaults to '/metrics'

# scheme defaults to 'http'.

static_configs:

- targets: ["localhost:9090"]

192.168.47.137：在/opt/prometheus/prometheus-3.2.0-rc.1.linux-amd64/prometheus.yml文件中，将replica改为2，其他配置保持一致。

3.2.3 创建Systemd服务文件，设置为系统服务

创建/etc/systemd/system/prometheus.service文件，内容如下：

[Unit]

Description=Prometheus

Wants=network-online.target

After=network-online.target

[Service]

User=prometheus

Group=prometheus

Type=simple

Exe

最低0.47元/天解锁文章