PostgreSQL Cluster监控插件开发：自定义指标采集实践-优快云博客

PostgreSQL Cluster监控插件开发：自定义指标采集实践

【免费下载链接】postgresql_cluster PostgreSQL High-Availability Cluster (based on "Patroni" and DCS "etcd" or "consul"). Automating with Ansible. 项目地址: https://gitcode.com/GitHub_Trending/po/postgresql_cluster

你是否在管理PostgreSQL集群时遇到过这些问题：数据库连接数突增却无法快速定位原因？主从切换过程中关键指标缺失？资源使用率异常却没有针对性告警？本文将带你通过开发自定义监控插件，解决PostgreSQL High-Availability Cluster（基于Patroni和etcd/consul）的监控盲区，实现业务级指标的精准采集与告警。

监控插件开发基础

PostgreSQL集群的监控体系需要覆盖从基础设施到数据库内核的全栈指标。项目中已集成的Netdata监控框架提供了系统级指标采集能力，其配置文件automation/roles/netdata/templates/netdata.conf.j2定义了基础监控项，包括CPU、内存、磁盘和网络等资源的监控配置。

Netdata通过插件系统支持扩展采集能力，在配置文件的[plugins] section中可以看到多种插件类型的启用状态：

[plugins]
  # idlejitter = yes
  # netdata pulse = yes
  # profile = no
  # tc = yes
  # diskspace = yes
  # proc = yes
  # cgroups = yes
  # timex = yes
  # enable running new plugins = yes
  # check for new plugins every = 1m
  # slabinfo = no
  # nfacct = yes
  # statsd = yes
  # perf = yes
  # network-viewer = yes
  # systemd-journal = yes
  # ioping = yes
  # charts.d = yes
  # debugfs = yes
  # go.d = yes
  # ebpf = yes
  # apps = yes
  # python.d = yes

其中python.d插件是实现自定义指标采集的理想选择，它允许通过Python脚本扩展监控能力，且与Netdata的集成已在项目中预配置完成。

自定义指标采集架构设计

PostgreSQL集群监控需要关注三类核心指标：

数据库层指标：连接数、查询性能、事务吞吐量
Patroni层指标：集群角色状态、切换次数、复制延迟
DCS层指标：etcd/consul健康状态、键值存储性能

项目中的Netdata角色automation/roles/netdata/已包含基础部署能力，其元数据文件automation/roles/netdata/meta/main.yml定义了与其他角色的依赖关系：

---
dependencies:
  - role: vitabaks.autobase.common

这为我们扩展自定义指标提供了稳定的基础设施层。

开发实战：Patroni集群状态监控插件

插件文件结构

按照Netdata的Python插件规范，在Netdata配置目录下创建插件目录结构：

/etc/netdata/
└── python.d/
    ├── patroni_cluster.chart.py  # 指标采集逻辑
    └── patroni_cluster.conf     # 插件配置文件

核心采集代码实现

以下是采集Patroni集群状态的Python插件实现（完整代码应保存为patroni_cluster.chart.py）：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import json
import requests
from bases.FrameworkServices.UrlService import UrlService

# 插件配置
update_every = 5
priority = 60000
retries = 3

ORDER = [
    'cluster_health',
    'replication_lag',
    'connections'
]

CHARTS = {
    'cluster_health': {
        'options': [None, 'Patroni Cluster Health', 'status', 'cluster', 'patroni.health', 'line'],
        'lines': [
            ['master_status', 'master', 'absolute', 1, 1],
            ['replica_count', 'replicas', 'absolute', 1, 1]
        ]
    },
    'replication_lag': {
        'options': [None, 'Replication Lag', 'bytes', 'replication', 'patroni.lag', 'line'],
        'lines': [
            ['max_lag', 'max lag', 'absolute', 1, 1]
        ]
    },
    'connections': {
        'options': [None, 'Database Connections', 'connections', 'postgres', 'patroni.connections', 'line'],
        'lines': [
            ['active_connections', 'active', 'absolute', 1, 1],
            ['max_connections', 'max', 'absolute', 1, 1]
        ]
    }
}

class Service(UrlService):
    def __init__(self, configuration=None, name=None):
        UrlService.__init__(self, configuration=configuration, name=name)
        self.order = ORDER
        self.definitions = CHARTS
        self.patroni_url = self.configuration.get('patroni_url', 'http://localhost:8008')

    def _get_data(self):
        data = {}
        
        # 获取Patroni集群状态
        try:
            r = requests.get(f'{self.patroni_url}/cluster', timeout=3)
            cluster_info = r.json()
            
            # 主节点状态 (1=正常, 0=异常)
            data['master_status'] = 1 if cluster_info['state'] == 'running' else 0
            
            # 从节点数量
            data['replica_count'] = len(cluster_info['members']) - 1
            
            # 最大复制延迟
            max_lag = 0
            for member in cluster_info['members']:
                if member['role'] == 'replica' and 'lag' in member:
                    max_lag = max(max_lag, member['lag'])
            data['max_lag'] = max_lag
            
        except Exception as e:
            self.error(f"Error getting cluster info: {e}")
            return None
            
        # 获取数据库连接数
        try:
            r = requests.get(f'{self.patroni_url}/metrics', timeout=3)
            for line in r.text.split('\n'):
                if line.startswith('pg_stat_activity_count'):
                    data['active_connections'] = int(line.split()[-1])
                elif line.startswith('pg_settings_max_connections'):
                    data['max_connections'] = int(line.split()[-1])
                    
        except Exception as e:
            self.error(f"Error getting metrics: {e}")
            return None
            
        return data

插件配置文件

创建插件配置文件patroni_cluster.conf：

update_every: 5
priority: 60000
retries: 3
patroni_url: 'http://localhost:8008'  # Patroni API地址

集成到Ansible自动化部署流程

为使自定义插件随集群自动部署，需要修改Netdata角色的Ansible任务文件automation/roles/netdata/tasks/main.yml，添加以下任务：

- name: Create Netdata python.d directory
  ansible.builtin.file:
    path: /etc/netdata/python.d
    state: directory
    mode: '0755'

- name: Deploy Patroni cluster monitoring plugin
  ansible.builtin.template:
    src: templates/patroni_cluster.chart.py.j2
    dest: /etc/netdata/python.d/patroni_cluster.chart.py
    mode: '0755'

- name: Deploy plugin configuration
  ansible.builtin.template:
    src: templates/patroni_cluster.conf.j2
    dest: /etc/netdata/python.d/patroni_cluster.conf
    mode: '0644'

- name: Restart Netdata service
  ansible.builtin.service:
    name: netdata
    state: restarted
    enabled: yes

同时需要在Netdata角色中添加对应的模板文件，将自定义插件和配置文件纳入版本控制。

监控效果与告警配置

部署完成后，Netdata将自动发现并加载新插件，提供直观的可视化图表。可以通过访问Netdata dashboard（默认端口19999）查看自定义指标：

为关键指标配置告警规则，编辑Netdata健康配置文件：

# /etc/netdata/health.d/patroni_health.conf
alarm: patroni_master_status
on: patroni.health.master_status
os: linux
hosts: *
warning: $this == 0
critical: $this == 0
delay: down 5m multiplier 1.5 max 1h
info: Patroni master node status
to: sysadmin

alarm: replication_lag_high
on: patroni.lag.max_lag
os: linux
hosts: *
warning: $this > 10485760  # 10MB
critical: $this > 104857600  # 100MB
delay: down 5m multiplier 1.5 max 1h
info: Replication lag is too high
to: sysadmin

alarm: connections_usage_high
on: patroni.connections.active_connections
os: linux
hosts: *
warning: $this > $pg_settings_max_connections * 0.8
critical: $this > $pg_settings_max_connections * 0.9
delay: down 5m multiplier 1.5 max 1h
info: Database connections usage is high
to: sysadmin

扩展与优化建议

指标扩展：根据业务需求添加更多指标，如：
- 事务吞吐量（tps）
- 慢查询数量
- 表空间增长率
- WAL生成速率
性能优化：
- 对于大规模集群，增加update_every间隔减少采集开销
- 使用缓存减少重复计算
- 实现指标采样机制降低对数据库性能影响
高可用考虑：
- 实现插件自身健康检查
- 添加采集失败重试机制
- 配置插件监控告警
安全增强：
- 为Patroni API添加认证
- 限制插件文件访问权限
- 加密敏感配置信息

通过本文介绍的方法，你可以构建贴合自身业务需求的PostgreSQL集群监控系统。完整的插件代码和部署配置可在项目的automation/roles/netdata/目录中找到，更多高级用法请参考项目文档README.md。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考