背景
用于容器的监控和容器的宿主机的监控
prometheus安装
我们可以到官网下载指定版本:https://prometheus.io/download/
wget https://github.com/prometheus/prometheus/releases/download/v2.24.0/prometheus-2.24.0.linux-amd64.tar.gz
tar xf prometheus-2.24.0.linux-amd64.tar.gz
cp prometheus-2.24.0.linux-amd64/prometheus /usr/local/bin/
配置启动配置文件
vim /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Server
After=network.target
Documentation=https://prometheus.io/docs/introduction/overview/
[Service]
Type=simple
WorkingDirectory=/home/data/prometheus/
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--web.read-timeout=5m \
--web.max-connections=512 \
--storage.tsdb.retention=15d \
--storage.tsdb.path=/home/data/prometheus \
--query.timeout=2m
Restart=on-failure
[Install]
WantedBy=multi-user.target
配置配置文件
mkdir /etc/prometheus
cp prometheus-2.24.0.linux-amd64/prometheus.yml /etc/prometheus
自带的配置就可以启动了。
启动
systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus
但是这里我没有选择这种方式启动,而是用docker启动的。
而且为了方便我直接把node_export也下载好了。
node_export
下载地址:https://prometheus.io/download/
wget https://github.com/prometheus/node_exporter/releases/download/v1.2.0/node_exporter-1.2.0.linux-amd64.tar.gz
tar xf node_exporter-1.2.0.linux-amd64.tar.gz
cp node_exporter-1.2.0.linux-amd64/node_export /usr/local/bin/
scp node_exporter-1.2.0.linux-amd64/node_export k8s2:/usr/local/bin/
scp node_exporter-1.2.0.linux-amd64/node_export k8s3:/usr/local/bin/
配置启动文件
vim /etc/systemd/system/node_export.service
[Unit]
Description=Node Export
After=network.target
Documentation=https://prometheus.io/docs/guides/node-exporter/
[Service]
Type=simple
WorkingDirectory=/tmp/
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
直接机器启动,这里的配置我在k8s2、k8s3机器都放了。三台机器都需要执行。
systemctl daemon-reload
systemctl start node_export
systemctl enable node_export
启动之后添加配置到prometheus配置文件,如下:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_export'
static_configs:
- targets:
- k8s1:9100
- k8s2:9100
- k8s3:9100
docker启动prometheus
docker run --name prometheus -d -p 0.0.0.0:9090:9090 -v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml -v /etc/hosts:/etc/hosts quay.io/prometheus/prometheus
grafana
wget https://dl.grafana.com/oss/release/grafana-8.0.6-1.x86_64.rpm
sudo yum install grafana-8.0.6-1.x86_64.rpm
systemctl enable grafana-server
systemctl start grafana-server
默认端口3000。
添加数据源:
配置上prometheus的端口。默认9090。
导入模板
先导入一个模板,只需要输入其编号即可。更多的官方 Dashboard 请参见:
https://grafana.com/grafana/dashboards?orderBy=name&direction=asc
查看指标
prometheus node_export常用查询
上行带宽:
sum by(instance)(irate(node_network_receive_bytes_total{instance="k8s1:9100",device!~"bond.*?|lo"}[5m] )/128)
下行带宽:
sum by(instance)(irate(node_network_transmit_bytes_total{instance="k8s1:9100",device!~"bond.*?|lo"}[5m] )/128)
网卡出包量:
sum by(instance)(rate(node_network_receive_bytes_total{instance="k8s1:9100",device!~"lo"}[5m] ))
网卡入包量:
sum by(instance)(rate(node_network_transmit_bytes_total{instance="k8s1:9100",device!~"lo"}[5m] ))
15分钟负载:
node_load15{instance="k8s1:9100"}
free 内存
node_memory_MemFree_bytes/1024/1024
磁盘可用率
node_filesystem_free_bytes{instance="k8s1:9100"}/1024/1024
这里node_export版本不一样,指标应该也是不一样的。有错误可以指点下。感谢。
pushgateway
在prometheus配置文件写上pushgateway的相关信息
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_export'
static_configs:
- targets:
- k8s1:9100
- k8s2:9100
- k8s3:9100
- job_name: 'pushgateway'
honor_labels: true #防止pull的job_name被pushgateway的job_name覆盖
static_configs:
- targets: ['192.168.0.5:9091']
labels:
instance: pushgateway
这里我没有热加载配置,所以我是直接重启了这个prometheus的docker容器。命令省略了。
然后到pushgateway的机器启动容器。
docker run --name pushgateway -d -p 0.0.0.0:9091:9091 -v /etc/hosts:/etc/hosts prom/pushgateway:latest
镜像没有的话直接pull一下即可。
然后打开prometheus的页面就能看到pushgateway了。
正常情况我们会使用 Client SDK 推送数据到 pushgateway, 但是我们还可以通过 API 来管理, 例如:
echo "some_metric 3.14" | curl --data-binary @- http://localhost:9091/metrics/job/some_job
–data-binary 表示发送二进制数据,注意:它是使用POST方式发送的!
cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/some_job/instance/some_instance
# TYPE some_metric counter
some_metric{label="val1"} 42
# TYPE another_metric gauge
# HELP another_metric Just an example.
another_metric 2398.283
EOF
注意:必须是指定的格式才行。
删除某个组下的某实例的所有数据:
curl -X DELETE http://localhost:9091/metrics/job/some_job/instance/some_instance
删除某个组下的所有数据:
curl -X DELETE http://localhost:9091/metrics/job/some_job
可以发现 pushgateway 中的数据我们通常按照 job 和 instance 分组分类,所以这两个参数不可缺少。
因为 Prometheus 配置 pushgateway 的时候,也会指定 job 和 instance, 但是它只表示 pushgateway 实例,不能真正表达收集数据的含义。所以在 prometheus 中配置 pushgateway 的时候,需要添加 honor_labels: true 参数, 从而避免收集数据本身的 job 和 instance 被覆盖。
注意,为了防止 pushgateway 重启或意外挂掉,导致数据丢失,我们可以通过 -persistence.file 和 -persistence.interval 参数将数据持久化下来。
钉钉告警
首先我们先准备好配置文件
[root@emr-header-1 prometheus]# cat rules/rules.yml
groups:
- name: pushgateway
rules:
- alert: server_status
expr: up{job="pushgateway"} == 0
for: 10s
labels:
severity: page
annotations:
summary: "机器{{$labels.instance}}挂了"
description: "注释信息:{{$labels.instance}}挂了"
[root@emr-header-1 prometheus]# cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "rules/*"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_export'
static_configs:
- targets:
- k8s1:9100
- k8s2:9100
- k8s3:9100
- job_name: 'pushgateway'
static_configs:
- targets: ['192.168.0.5:9091']
labels:
instance: pushgateway
[root@emr-header-1 prometheus]# cat /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- send_resolved: true
url: 'http://localhost:8060/dingtalk/webhook1/send'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
这里是准备了prometheus的配置文件,里面包含rule的位置。然后我们的rule配置里面自定义了一个pushgateway的规则。再就是在alertmanager的配置里面我们配置了端口(这里的端口是dingtalk插件的端口),所以我们还需要dingtalk这个插件。
地址的话在GitHub上搜下就可以了。我是直接选择从docker上pull的。
所以我们这里还是通过docker启动的。
#启动prometheus
docker run --name prometheus -d --net="host" -p 0.0.0.0:9090:9090 -v /etc/prometheus/:/etc/prometheus/ -v /etc/hosts:/etc/hosts quay.io/prometheus/prometheus
#启动alertmanager
docker run --name alertmanager -d --net="host" -p 9093:9093 -v /etc/alertmanager/:/etc/alertmanager/ -v /etc/hosts:/etc/hosts prom/alertmanager:latest --config.file=/etc/alertmanager/alertmanager.yml
#启动dingding插件
docker run -d --name dingtalk --net="host" --restart always -p 8060:8060 timonwong/prometheus-webhook-dingtalk:master --ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token=6dd77b2f7a27fb59fec031e4e03ad1917cc06f0c95760217a0exxxx"
这里注意钉钉的机器人需要设置包含到你需要告警的关键字,否则会告警失败的。
告警抑制
这里需要对应前面的rule写的labels。可以选择抑制时间段。
取消抑制规则如下:
blackbox_exporter
blockbox_export可以用于端口、网页这些的监控。具体详情的话可以查下比我说的清晰的多,下面介绍具体实现。
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.14.0/blackbox_exporter-0.14.0.linux-amd64.tar.gz
tar解压完后
mv blackbox_exporter-0.14.0.linux-amd64 /usr/local/blackbox_exporter
vim /lib/systemd/system/blackbox_exporter.service
[Unit]
Description=blackbox_exporter
After=network.target
[Service]
User=root
Type=simple
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
启动
systemctl daemon-reload
systemctl start blackbox_exporter
[root@emr-header-1 ~]# netstat -lntup|grep 9115
tcp 0 0 0.0.0.0:9115 0.0.0.0:* LISTEN 26214/blackbox_expo
prometheus配置
[root@emr-header-1 rules]# cat ../prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "rules/*"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_export'
static_configs:
- targets:
- k8s1:9100
- k8s2:9100
- k8s3:9100
- job_name: 'pushgateway'
static_configs:
- targets: ['192.168.0.5:9091']
labels:
instance: pushgateway
- job_name: 'http_status'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets: ['http://www.baidu.com']
labels:
instance: http_status
group: web
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: 192.168.0.1:9115
- job_name: 'ping_status'
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- 192.168.0.1
- 192.168.0.2
- 192.168.0.3
labels:
instance: 'ping_status'
group: 'icmp'
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
replacement: ${1}
- source_labels: [__param_target]
regex: (.*)
target_label: ping
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 192.168.0.1:9115
- job_name: 'nodemanager'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 192.168.0.2:8042
- 192.168.0.3:8042
labels:
instance: 'nodemanager_status'
group: 'port'
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
replacement: ${1}
- source_labels: [__param_target]
regex: (.*)
target_label: nodemanager
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 192.168.0.1:9115
- job_name: 'datanode'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 192.168.0.2:50010
- 192.168.0.3:50010
labels:
instance: 'datanode_status'
group: 'port'
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
replacement: ${1}
- source_labels: [__param_target]
regex: (.*)
target_label: datanode
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 192.168.0.1:9115
rule规则
[root@emr-header-1 rules]# cat cpu_over.yml
groups:
- name: CPU报警规则
rules:
- alert: CPU使用率告警
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 50
for: 1m
labels:
severity: warning
annotations:
summary: "CPU使用率正在飙升。"
description: "CPU使用率超过50%(当前值:{{ $value }}%)"
[root@emr-header-1 rules]# cat disk_over.yml
groups:
- name: 磁盘使用率报警规则
rules:
- alert: 磁盘使用率告警
expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "硬盘分区使用率过高"
description: "分区使用大于80%(当前值:{{ $value }}%)"
[root@emr-header-1 rules]# cat memory_over.yml
groups:
- name: 内存告警规则
rules:
- alert: 内存使用率告警
expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 60
for: 30m
labels:
serverity: warning
annotations:
summary: "服务器可用内存不足"
description: "机器{{$labels.instance}}内存使用率已经超过60%(当前值{{$value}}%)"
[root@emr-header-1 rules]# cat node_alived.yml
groups:
- name: 实例存活告警规则
rules:
- alert: 实例存活告警
expr: up == 0
for: 1m
labels:
user: prometheus
severity: warning
annotations:
summary: "主机宕机 !!!"
description: "主机{{$labels.instance}}已经宕机超过一分钟了。"
[root@emr-header-1 rules]# cat pushgateway_down.yml
groups:
- name: pushgateway告警
rules:
- alert: pushgateway告警
expr: up{job="pushgateway"} == 0
for: 10s
labels:
severity: page
annotations:
summary: "机器{{$labels.instance}}挂了"
description: "注释信息:{{$labels.instance}}挂了"
[root@emr-header-1 rules]# cat nodemanager_down.yml
groups:
- name: nodemanager告警
rules:
- alert: nodemanager is down
expr: probe_success{group="port",instance="nodemanager_status",job="nodemanager"} == 0
for: 10s
labels:
severity: nodemanager
annotations:
summary: "nodemanager unavaliable"
description: "{{$labels.nodemanager}}服务挂了"
[root@emr-header-1 rules]# cat datanode_down.yml
groups:
- name: datanode告警
rules:
- alert: datanode is down
expr: probe_success{group="port",instance="datanode_status",job="datanode"} == 0
for: 10s
labels:
severity: datanode
annotations:
summary: "datanode unavaliable"
description: "{{$labels.datanode}}服务挂了"
alertmanager配置
[root@emr-header-1 alertmanager]# cat alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- send_resolved: true
url: 'http://localhost:8060/dingtalk/webhook1/send'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
启动
启动prometheus
docker run --name prometheus -d --net="host" -p 0.0.0.0:9090:9090 -v /etc/prometheus/:/etc/prometheus/ -v /etc/hosts:/etc/hosts quay.io/prometheus/prometheus
启动alertmanager
docker run --name alertmanager -d --net="host" -p 9093:9093 -v /etc/alertmanager/:/etc/alertmanager/ -v /etc/hosts:/etc/hosts prom/alertmanager:latest --config.file=/etc/alertmanager/alertmanager.yml
启动dingding插件
docker run -d --name dingtalk --net="host" --restart always -p 8060:8060 timonwong/prometheus-webhook-dingtalk:master --ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token=6dd77b2f7a27fb59fec031e4e03ad1917cc06f0cxxx217a0edf01fxxxx"
这里还是要注意钉钉机器人要匹配到对应的字符才能正常发出告警。
mysql_exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz
tar xf mysqld_exporter-0.12.1.linux-amd64.tar.gz
mv mysqld_exporter-0.12.1.linux-amd64 /usr/local/mysqld_exporter
cd /usr/local/mysqld_exporter/
#设置访问密码
vi .my.cnf
[client]
user=exporter
password=1qaz@WSX
#登陆到mysql设置用户名密码及授权
CREATE USER 'exporter'@'localhost' IDENTIFIED BY '1qaz@WSX';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
启动服务
默认端口9104
nohup ./mysqld_exporter --config.my-cnf=.my.cnf &
配置prometheus
- job_name: 'mysql-exporter'
static_configs:
- targets: ['k8s1:9104']
labels:
instance: mysql
然后重启prometheus或者热加载一下配置即可。
Grafana 模板 ID:11323
cadvisor
用于监控docker容器
docker pull google/cadvisor:latest
docker run --privileged=true --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=38080:8080 --detach=true --name=cadvisor google/cadvisor:latest
配置prometheus
- job_name: docker
static_configs:
- targets: ['192.168.0.1:38080']
labels:
instance: cadvisor
重启 prometheus(Kill -HUP pid)或者热加载一下
consul
服务自动发现,解释这里就略过了。直接上配置。
docker pull consul
docker run --name consul -d -p 8500:8500 consul
[root@emr-header-1 prometheus]# cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "rules/*"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['k8s1:9090']
- job_name: 'node_export'
static_configs:
- targets:
- k8s1:9100
- k8s2:9100
- k8s3:9100
# - job_name: 'consul-prometheus'
# consul_sd_configs:
# - server: 'k8s1:8500'
# services: []
# relabel_configs:
# - source_labels: [__meta_consul_tags]
# regex: .*test.*
# action: keep
# - regex: _meta_consul_service_metadata_(.+)
# action: labelmap
- job_name: 'consul-node-exporter'
consul_sd_configs:
- server: 'k8s1:8500'
services: []
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*node-exporter.*
action: keep
- regex: __meta_consul_service_metadata_(.+)
action: labelmap
- job_name: 'consul-cadvisor-exporter'
consul_sd_configs:
- server: 'k8s1:8500'
services: []
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*cadvisor-exporter.*
action: keep
- regex: __meta_consul_service_metadata_(.+)
action: labelmap
- job_name: 'pushgateway'
honor_labels: true
static_configs:
- targets: ['192.168.0.5:9091']
labels:
instance: pushgateway
- job_name: 'mysql-exporter'
static_configs:
- targets: ['k8s1:9104']
labels:
instance: mysql
- job_name: docker
static_configs:
- targets: ['192.168.0.1:38080']
labels:
instance: cadvisor
- job_name: 'http_status'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets: ['http://www.baidu.com']
labels:
instance: http_status
group: web
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: 192.168.0.1:9115
- job_name: 'ping_status'
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- 192.168.0.1
- 192.168.0.2
- 192.168.0.3
labels:
instance: 'ping_status'
group: 'icmp'
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
replacement: ${1}
- source_labels: [__param_target]
regex: (.*)
target_label: ping
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 192.168.0.1:9115
- job_name: 'nodemanager'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 192.168.0.2:8042
- 192.168.0.3:8042
labels:
instance: 'nodemanager_status'
group: 'port'
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
replacement: ${1}
- source_labels: [__param_target]
regex: (.*)
target_label: nodemanager
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 192.168.0.1:9115
- job_name: 'datanode'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 192.168.0.2:50010
- 192.168.0.3:50010
labels:
instance: 'datanode_status'
group: 'port'
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
replacement: ${1}
- source_labels: [__param_target]
regex: (.*)
target_label: datanode
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 192.168.0.1:9115
注册:
curl -X PUT -d '{"id": "node-exporter","name": "node-exporter-k8s1","address": "192.168.0.1","port": 9100,"tags": ["test-k8s1"],"checks": [{"http": "http://192.168.0.1:9100/metrics", "interval": "5s"}]}' http://192.168.0.1:8500/v1/agent/service/register
删除:
curl -X PUT http://k8s1:8500/v1/agent/service/deregister/node-exporter
这里使用curl注册上去
[root@emr-header-1 ~]# cat consul-0.json
{
"ID": "node-exporter",
"Name": "node-exporter-k8s1",
"Tags": [
"test"
],
"Address": "192.168.0.1",
"Port": 9100,
"Meta": {
"app": "spring-boot",
"team": "appgroup",
"project": "bigdata"
},
"EnableTagOverride": false,
"Check": {
"HTTP": "http://192.168.0.1:9100/metrics",
"Interval": "10s"
},
"Weights": {
"Passing": 10,
"Warning": 1
}
}
curl --request PUT --data @consul-0.json http://k8s1:8500/v1/agent/service/register?replace-existing-checks=1
[root@emr-header-1 ~]# cat consul-1.json
{
"ID": "node-exporter",
"Name": "node-exporter-k8s1",
"Tags": [
"node-exporter"
],
"Address": "192.168.0.1",
"Port": 9100,
"Meta": {
"app": "spring-boot",
"team": "appgroup",
"project": "bigdata"
},
"EnableTagOverride": false,
"Check": {
"HTTP": "http://192.168.0.1:9100/metrics",
"Interval": "10s"
},
"Weights": {
"Passing": 10,
"Warning": 1
}
}
curl --request PUT --data @consul-1.json http://k8s1:8500/v1/agent/service/register?replace-existing-checks=1
[root@emr-header-1 ~]# cat consul-2.json
{
"ID": "cadvisor-exporter",
"Name": "cadvisor-exporter-k8s1",
"Tags": [
"cadvisor-exporter"
],
"Address": "192.168.0.1",
"Port": 38080,
"Meta": {
"app": "docker",
"team": "cloudgroup",
"project": "docker-service"
},
"EnableTagOverride": false,
"Check": {
"HTTP": "http://192.168.0.1:38080/metrics",
"Interval": "10s"
},
"Weights": {
"Passing": 10,
"Warning": 1
}
}
curl --request PUT --data @consul-2.json http://k8s1:8500/v1/agent/service/register?replace-existing-checks=1