PostgreSQL页面锁等待探测机制分析报告

PostgreSQL页面锁等待探测实际机制分析

环境信息

  • PostgreSQL实例: postgres-6699cc25-0-0
  • 版本: PostgreSQL 16.10
  • Exporter版本: Crunchy Data postgres-exporter v1.3.2

一、实际的锁等待探测机制

1.1 监控查询配置

核心锁等待查询
ccp_pg_waitlocks:
  query: "SELECT COUNT(*) FROM pg_locks WHERE granted = 'f';"
  metrics:
    - count:
        usage: "GAUGE"
        description: "number of waitlock."
实际监控指标
# HELP ccp_pg_waitlocks_count number of waitlock.
# TYPE ccp_pg_waitlocks_count gauge
ccp_pg_waitlocks_count{server="127.0.0.1:5432"} 0

1.2 探测机制分析

探测原理
  1. 基础查询: SELECT COUNT(*) FROM pg_locks WHERE granted = 'f'
  2. 探测范围: 全局锁等待计数(不区分页面级、表级、行级)
  3. 更新频率: 默认15秒采集间隔
  4. 指标类型: GAUGE(实时值)
监控表达式示例
# Grafana中的监控表达式
ccp_pg_waitlocks_count{namespace="qfusion-admin",instance="postgres-6699cc25-0-0"}

二、当前环境实际状态

2.1 锁等待状态

SELECT COUNT(*) FROM pg_locks WHERE granted = 'f';
-- 结果: 0 (当前无锁等待)

-- 详细锁状态分析
SELECT
    locktype,
    COUNT(*) FILTER (WHERE granted = 't') AS granted_locks,
    COUNT(*) FILTER (WHERE granted = 'f') AS waiting_locks,
    COUNT(*) AS total_locks
FROM pg_locks
GROUP BY locktype;

2.2 当前活跃锁分布

-- 实际查询结果
 locktype   | granted_locks | waiting_locks | total_locks
------------+---------------+---------------|------------
 relation   |             1 |             0 |          1
 virtualxid |             1 |             0 |          1

2.3 Exporter配置详情

完整的锁相关查询配置
# 锁统计查询(按类型统计)
ccp_locks:
  query: "SELECT pg_database.datname as dbname, tmp.mode, COALESCE(count,0) as count
          FROM (
            VALUES ('accesssharelock'),
                   ('rowsharelock'),
                   ('rowexclusivelock'),
                   ('shareupdateexclusivelock'),
                   ('sharelock'),
                   ('sharerowexclusivelock'),
                   ('exclusivelock'),
                   ('accessexclusivelock')
          ) AS tmp(mode)
          CROSS JOIN pg_catalog.pg_database
          LEFT JOIN (
            SELECT database, lower(mode) AS mode, count(*) AS count
            FROM pg_catalog.pg_locks
            WHERE database IS NOT NULL
            GROUP BY database, lower(mode)
          ) AS tmp2
          ON tmp.mode=tmp2.mode and pg_database.oid = tmp2.database"
  metrics:
    - dbname:
        usage: "LABEL"
        description: "Database name"
    - mode:
        usage: "LABEL"
        description: "Lock type"
    - count:
        usage: "GAUGE"
        description: "Number of locks"

三、页面锁等待的具体实现

3.1 页面锁等待的检测方法

当前机制的局限性
-- 当前查询无法区分页面级锁等待
SELECT COUNT(*) FROM pg_locks WHERE granted = 'f';
-- 问题:无法知道具体是哪种锁类型的等待

-- 改进查询:区分页面级锁等待
SELECT COUNT(*) FROM pg_locks
WHERE granted = 'f' AND locktype = 'page';
页面锁等待的详细检测
-- 页面锁等待检测查询
CREATE OR REPLACE VIEW page_lock_wait_detection AS
SELECT
    COUNT(*) FILTER (WHERE locktype = 'page' AND granted = 'f') AS page_lock_waits,
    COUNT(*) FILTER (WHERE locktype = 'tuple' AND granted = 'f') AS tuple_lock_waits,
    COUNT(*) FILTER (WHERE locktype = 'relation' AND granted = 'f') AS relation_lock_waits,
    COUNT(*) FILTER (WHERE granted = 'f') AS total_lock_waits
FROM pg_locks;

3.2 页面锁等待的诊断查询

诊断查询1:页面锁等待详情
-- 查看具体的页面锁等待
SELECT
    pl.pid,
    pl.database,
    pl.relation,
    pl.page,
    pl.mode,
    pl.granted,
    psa.query,
    psa.wait_event,
    psa.wait_event_type,
    psa.state,
    now() - psa.query_start AS wait_duration
FROM pg_locks pl
JOIN pg_stat_activity psa ON pl.pid = psa.pid
WHERE pl.locktype = 'page'
  AND pl.granted = 'f'
  AND psa.wait_event_type = 'Lock';
诊断查询2:页面锁冲突分析
-- 页面锁冲突分析
WITH page_lock_conflicts AS (
    SELECT
        pl.page,
        pl.relation,
        pg_class.relname AS table_name,
        COUNT(*) FILTER (WHERE pl.granted = 'f') AS waiting_count,
        COUNT(*) FILTER (WHERE pl.granted = 't') AS holding_count,
        array_agg(DISTINCT CASE
            WHEN pl.granted = 'f' THEN pl.pid
            ELSE NULL
        END) FILTER (WHERE pl.granted = 'f') AS waiting_pids,
        array_agg(DISTINCT CASE
            WHEN pl.granted = 't' THEN pl.pid
            ELSE NULL
        END) FILTER (WHERE pl.granted = 't') AS holding_pids
    FROM pg_locks pl
    JOIN pg_class ON pl.relation = pg_class.oid
    WHERE pl.locktype = 'page'
    GROUP BY pl.page, pl.relation, pg_class.relname
    HAVING COUNT(*) FILTER (WHERE pl.granted = 'f') > 0
)
SELECT
    table_name,
    page,
    waiting_count,
    holding_count,
    waiting_pids,
    holding_pids
FROM page_lock_conflicts;

四、监控告警配置

4.1 基于实际指标的告警规则

Prometheus告警规则
groups:
- name: postgres_lock_wait_alerts
  rules:
  - alert: PostgreSQLLockWaiting
    expr: ccp_pg_waitlocks_count > 0
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "PostgreSQL锁等待告警"
      description: "PostgreSQL实例 {{ $labels.instance }} 存在 {{ $value }} 个锁等待"

  - alert: PostgreSQLHighLockWaiting
    expr: ccp_pg_waitlocks_count > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "PostgreSQL高锁等待告警"
      description: "PostgreSQL实例 {{ $labels.instance }} 锁等待数量过高: {{ $value }}"

4.2 Grafana监控面板

锁等待监控面板配置
{
  "title": "PostgreSQL锁等待监控",
  "panels": [
    {
      "title": "锁等待数量",
      "type": "stat",
      "targets": [
        {
          "expr": "ccp_pg_waitlocks_count{instance=\"$instance\"}",
          "legendFormat": "锁等待数量"
        }
      ]
    },
    {
      "title": "锁类型分布",
      "type": "piechart",
      "targets": [
        {
          "expr": "ccp_locks_count{instance=\"$instance\"}",
          "legendFormat": "{{ mode }}"
        }
      ]
    }
  ]
}

五、页面锁等待探测的优化建议

5.1 当前机制的问题

主要局限性
  1. 粒度粗糙: 无法区分页面级、表级、行级锁等待
  2. 信息不足: 缺少等待时间、阻塞链等详细信息
  3. 诊断困难: 无法快速定位具体的锁冲突

5.2 优化方案

方案1:增强监控查询
# 建议的增强配置
ccp_pg_page_lock_waits:
  query: "SELECT COUNT(*) FROM pg_locks WHERE granted = 'f' AND locktype = 'page';"
  metrics:
    - count:
        usage: "GAUGE"
        description: "number of page lock waits"

ccp_pg_lock_wait_details:
  query: "SELECT
            locktype,
            COUNT(*) FILTER (WHERE granted = 'f') as waiting_count,
            MAX(EXTRACT(EPOCH FROM (now() - query_start))) FILTER (WHERE granted = 'f') as max_wait_time
          FROM pg_locks pl
          LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
          GROUP BY locktype;"
  metrics:
    - locktype:
        usage: "LABEL"
        description: "Lock type"
    - waiting_count:
        usage: "GAUGE"
        description: "Number of waiting locks"
    - max_wait_time:
        usage: "GAUGE"
        description: "Maximum wait time in seconds"
方案2:添加自定义监控函数
-- 创建页面锁等待监控函数
CREATE OR REPLACE FUNCTION monitor_page_lock_waits()
RETURNS TABLE(
    page_wait_count BIGINT,
    tuple_wait_count BIGINT,
    relation_wait_count BIGINT,
    max_wait_seconds INTERVAL,
    blocked_pids BIGINT[]
) AS $$
BEGIN
    RETURN QUERY
    SELECT
        COUNT(*) FILTER (WHERE locktype = 'page' AND granted = 'f'),
        COUNT(*) FILTER (WHERE locktype = 'tuple' AND granted = 'f'),
        COUNT(*) FILTER (WHERE locktype = 'relation' AND granted = 'f'),
        MAX(now() - query_start) FILTER (WHERE wait_event_type = 'Lock'),
        array_agg(DISTINCT pid) FILTER (WHERE granted = 'f')
    FROM pg_locks pl
    LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
    WHERE pl.granted = 'f';
END;
$$ LANGUAGE plpgsql;

六、总结

6.1 当前状态

  • 基础监控: ✅ 有基本的锁等待计数监控
  • 实时状态: 当前无锁等待(count=0)
  • 监控覆盖: 基础覆盖,但粒度较粗

6.2 改进建议

  1. 短期优化: 增加页面级锁等待的专门监控
  2. 中期优化: 添加等待时间、阻塞链等详细信息
  3. 长期规划: 建立完整的锁等待诊断和优化体系

6.3 实际应用价值

当前的ccp_pg_waitlocks_count指标虽然简单,但对于基础监控已经够用:

  • 可以快速发现锁等待是否存在
  • 适合作为基础告警指标
  • 为深入诊断提供入口点

结论: 25环境的PostgreSQL锁等待探测机制虽然简单,但实用性强,能够有效满足基础的锁监控需求。

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值