PostgreSQL Cluster监控插件开发:自定义指标采集实践
你是否在管理PostgreSQL集群时遇到过这些问题:数据库连接数突增却无法快速定位原因?主从切换过程中关键指标缺失?资源使用率异常却没有针对性告警?本文将带你通过开发自定义监控插件,解决PostgreSQL High-Availability Cluster(基于Patroni和etcd/consul)的监控盲区,实现业务级指标的精准采集与告警。
监控插件开发基础
PostgreSQL集群的监控体系需要覆盖从基础设施到数据库内核的全栈指标。项目中已集成的Netdata监控框架提供了系统级指标采集能力,其配置文件automation/roles/netdata/templates/netdata.conf.j2定义了基础监控项,包括CPU、内存、磁盘和网络等资源的监控配置。
Netdata通过插件系统支持扩展采集能力,在配置文件的[plugins] section中可以看到多种插件类型的启用状态:
[plugins]
# idlejitter = yes
# netdata pulse = yes
# profile = no
# tc = yes
# diskspace = yes
# proc = yes
# cgroups = yes
# timex = yes
# enable running new plugins = yes
# check for new plugins every = 1m
# slabinfo = no
# nfacct = yes
# statsd = yes
# perf = yes
# network-viewer = yes
# systemd-journal = yes
# ioping = yes
# charts.d = yes
# debugfs = yes
# go.d = yes
# ebpf = yes
# apps = yes
# python.d = yes
其中python.d插件是实现自定义指标采集的理想选择,它允许通过Python脚本扩展监控能力,且与Netdata的集成已在项目中预配置完成。
自定义指标采集架构设计
PostgreSQL集群监控需要关注三类核心指标:
- 数据库层指标:连接数、查询性能、事务吞吐量
- Patroni层指标:集群角色状态、切换次数、复制延迟
- DCS层指标:etcd/consul健康状态、键值存储性能
项目中的Netdata角色automation/roles/netdata/已包含基础部署能力,其元数据文件automation/roles/netdata/meta/main.yml定义了与其他角色的依赖关系:
---
dependencies:
- role: vitabaks.autobase.common
这为我们扩展自定义指标提供了稳定的基础设施层。
开发实战:Patroni集群状态监控插件
插件文件结构
按照Netdata的Python插件规范,在Netdata配置目录下创建插件目录结构:
/etc/netdata/
└── python.d/
├── patroni_cluster.chart.py # 指标采集逻辑
└── patroni_cluster.conf # 插件配置文件
核心采集代码实现
以下是采集Patroni集群状态的Python插件实现(完整代码应保存为patroni_cluster.chart.py):
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import json
import requests
from bases.FrameworkServices.UrlService import UrlService
# 插件配置
update_every = 5
priority = 60000
retries = 3
ORDER = [
'cluster_health',
'replication_lag',
'connections'
]
CHARTS = {
'cluster_health': {
'options': [None, 'Patroni Cluster Health', 'status', 'cluster', 'patroni.health', 'line'],
'lines': [
['master_status', 'master', 'absolute', 1, 1],
['replica_count', 'replicas', 'absolute', 1, 1]
]
},
'replication_lag': {
'options': [None, 'Replication Lag', 'bytes', 'replication', 'patroni.lag', 'line'],
'lines': [
['max_lag', 'max lag', 'absolute', 1, 1]
]
},
'connections': {
'options': [None, 'Database Connections', 'connections', 'postgres', 'patroni.connections', 'line'],
'lines': [
['active_connections', 'active', 'absolute', 1, 1],
['max_connections', 'max', 'absolute', 1, 1]
]
}
}
class Service(UrlService):
def __init__(self, configuration=None, name=None):
UrlService.__init__(self, configuration=configuration, name=name)
self.order = ORDER
self.definitions = CHARTS
self.patroni_url = self.configuration.get('patroni_url', 'http://localhost:8008')
def _get_data(self):
data = {}
# 获取Patroni集群状态
try:
r = requests.get(f'{self.patroni_url}/cluster', timeout=3)
cluster_info = r.json()
# 主节点状态 (1=正常, 0=异常)
data['master_status'] = 1 if cluster_info['state'] == 'running' else 0
# 从节点数量
data['replica_count'] = len(cluster_info['members']) - 1
# 最大复制延迟
max_lag = 0
for member in cluster_info['members']:
if member['role'] == 'replica' and 'lag' in member:
max_lag = max(max_lag, member['lag'])
data['max_lag'] = max_lag
except Exception as e:
self.error(f"Error getting cluster info: {e}")
return None
# 获取数据库连接数
try:
r = requests.get(f'{self.patroni_url}/metrics', timeout=3)
for line in r.text.split('\n'):
if line.startswith('pg_stat_activity_count'):
data['active_connections'] = int(line.split()[-1])
elif line.startswith('pg_settings_max_connections'):
data['max_connections'] = int(line.split()[-1])
except Exception as e:
self.error(f"Error getting metrics: {e}")
return None
return data
插件配置文件
创建插件配置文件patroni_cluster.conf:
update_every: 5
priority: 60000
retries: 3
patroni_url: 'http://localhost:8008' # Patroni API地址
集成到Ansible自动化部署流程
为使自定义插件随集群自动部署,需要修改Netdata角色的Ansible任务文件automation/roles/netdata/tasks/main.yml,添加以下任务:
- name: Create Netdata python.d directory
ansible.builtin.file:
path: /etc/netdata/python.d
state: directory
mode: '0755'
- name: Deploy Patroni cluster monitoring plugin
ansible.builtin.template:
src: templates/patroni_cluster.chart.py.j2
dest: /etc/netdata/python.d/patroni_cluster.chart.py
mode: '0755'
- name: Deploy plugin configuration
ansible.builtin.template:
src: templates/patroni_cluster.conf.j2
dest: /etc/netdata/python.d/patroni_cluster.conf
mode: '0644'
- name: Restart Netdata service
ansible.builtin.service:
name: netdata
state: restarted
enabled: yes
同时需要在Netdata角色中添加对应的模板文件,将自定义插件和配置文件纳入版本控制。
监控效果与告警配置
部署完成后,Netdata将自动发现并加载新插件,提供直观的可视化图表。可以通过访问Netdata dashboard(默认端口19999)查看自定义指标:
为关键指标配置告警规则,编辑Netdata健康配置文件:
# /etc/netdata/health.d/patroni_health.conf
alarm: patroni_master_status
on: patroni.health.master_status
os: linux
hosts: *
warning: $this == 0
critical: $this == 0
delay: down 5m multiplier 1.5 max 1h
info: Patroni master node status
to: sysadmin
alarm: replication_lag_high
on: patroni.lag.max_lag
os: linux
hosts: *
warning: $this > 10485760 # 10MB
critical: $this > 104857600 # 100MB
delay: down 5m multiplier 1.5 max 1h
info: Replication lag is too high
to: sysadmin
alarm: connections_usage_high
on: patroni.connections.active_connections
os: linux
hosts: *
warning: $this > $pg_settings_max_connections * 0.8
critical: $this > $pg_settings_max_connections * 0.9
delay: down 5m multiplier 1.5 max 1h
info: Database connections usage is high
to: sysadmin
扩展与优化建议
-
指标扩展:根据业务需求添加更多指标,如:
- 事务吞吐量(tps)
- 慢查询数量
- 表空间增长率
- WAL生成速率
-
性能优化:
- 对于大规模集群,增加
update_every间隔减少采集开销 - 使用缓存减少重复计算
- 实现指标采样机制降低对数据库性能影响
- 对于大规模集群,增加
-
高可用考虑:
- 实现插件自身健康检查
- 添加采集失败重试机制
- 配置插件监控告警
-
安全增强:
- 为Patroni API添加认证
- 限制插件文件访问权限
- 加密敏感配置信息
通过本文介绍的方法,你可以构建贴合自身业务需求的PostgreSQL集群监控系统。完整的插件代码和部署配置可在项目的automation/roles/netdata/目录中找到,更多高级用法请参考项目文档README.md。
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考





