PostgreSQL页面锁等待探测实际机制分析
环境信息
- PostgreSQL实例: postgres-6699cc25-0-0
- 版本: PostgreSQL 16.10
- Exporter版本: Crunchy Data postgres-exporter v1.3.2
一、实际的锁等待探测机制
1.1 监控查询配置
核心锁等待查询
ccp_pg_waitlocks:
query: "SELECT COUNT(*) FROM pg_locks WHERE granted = 'f';"
metrics:
- count:
usage: "GAUGE"
description: "number of waitlock."
实际监控指标
# HELP ccp_pg_waitlocks_count number of waitlock.
# TYPE ccp_pg_waitlocks_count gauge
ccp_pg_waitlocks_count{server="127.0.0.1:5432"} 0
1.2 探测机制分析
探测原理
- 基础查询:
SELECT COUNT(*) FROM pg_locks WHERE granted = 'f' - 探测范围: 全局锁等待计数(不区分页面级、表级、行级)
- 更新频率: 默认15秒采集间隔
- 指标类型: GAUGE(实时值)
监控表达式示例
# Grafana中的监控表达式
ccp_pg_waitlocks_count{namespace="qfusion-admin",instance="postgres-6699cc25-0-0"}
二、当前环境实际状态
2.1 锁等待状态
SELECT COUNT(*) FROM pg_locks WHERE granted = 'f';
-- 结果: 0 (当前无锁等待)
-- 详细锁状态分析
SELECT
locktype,
COUNT(*) FILTER (WHERE granted = 't') AS granted_locks,
COUNT(*) FILTER (WHERE granted = 'f') AS waiting_locks,
COUNT(*) AS total_locks
FROM pg_locks
GROUP BY locktype;
2.2 当前活跃锁分布
-- 实际查询结果
locktype | granted_locks | waiting_locks | total_locks
------------+---------------+---------------|------------
relation | 1 | 0 | 1
virtualxid | 1 | 0 | 1
2.3 Exporter配置详情
完整的锁相关查询配置
# 锁统计查询(按类型统计)
ccp_locks:
query: "SELECT pg_database.datname as dbname, tmp.mode, COALESCE(count,0) as count
FROM (
VALUES ('accesssharelock'),
('rowsharelock'),
('rowexclusivelock'),
('shareupdateexclusivelock'),
('sharelock'),
('sharerowexclusivelock'),
('exclusivelock'),
('accessexclusivelock')
) AS tmp(mode)
CROSS JOIN pg_catalog.pg_database
LEFT JOIN (
SELECT database, lower(mode) AS mode, count(*) AS count
FROM pg_catalog.pg_locks
WHERE database IS NOT NULL
GROUP BY database, lower(mode)
) AS tmp2
ON tmp.mode=tmp2.mode and pg_database.oid = tmp2.database"
metrics:
- dbname:
usage: "LABEL"
description: "Database name"
- mode:
usage: "LABEL"
description: "Lock type"
- count:
usage: "GAUGE"
description: "Number of locks"
三、页面锁等待的具体实现
3.1 页面锁等待的检测方法
当前机制的局限性
-- 当前查询无法区分页面级锁等待
SELECT COUNT(*) FROM pg_locks WHERE granted = 'f';
-- 问题:无法知道具体是哪种锁类型的等待
-- 改进查询:区分页面级锁等待
SELECT COUNT(*) FROM pg_locks
WHERE granted = 'f' AND locktype = 'page';
页面锁等待的详细检测
-- 页面锁等待检测查询
CREATE OR REPLACE VIEW page_lock_wait_detection AS
SELECT
COUNT(*) FILTER (WHERE locktype = 'page' AND granted = 'f') AS page_lock_waits,
COUNT(*) FILTER (WHERE locktype = 'tuple' AND granted = 'f') AS tuple_lock_waits,
COUNT(*) FILTER (WHERE locktype = 'relation' AND granted = 'f') AS relation_lock_waits,
COUNT(*) FILTER (WHERE granted = 'f') AS total_lock_waits
FROM pg_locks;
3.2 页面锁等待的诊断查询
诊断查询1:页面锁等待详情
-- 查看具体的页面锁等待
SELECT
pl.pid,
pl.database,
pl.relation,
pl.page,
pl.mode,
pl.granted,
psa.query,
psa.wait_event,
psa.wait_event_type,
psa.state,
now() - psa.query_start AS wait_duration
FROM pg_locks pl
JOIN pg_stat_activity psa ON pl.pid = psa.pid
WHERE pl.locktype = 'page'
AND pl.granted = 'f'
AND psa.wait_event_type = 'Lock';
诊断查询2:页面锁冲突分析
-- 页面锁冲突分析
WITH page_lock_conflicts AS (
SELECT
pl.page,
pl.relation,
pg_class.relname AS table_name,
COUNT(*) FILTER (WHERE pl.granted = 'f') AS waiting_count,
COUNT(*) FILTER (WHERE pl.granted = 't') AS holding_count,
array_agg(DISTINCT CASE
WHEN pl.granted = 'f' THEN pl.pid
ELSE NULL
END) FILTER (WHERE pl.granted = 'f') AS waiting_pids,
array_agg(DISTINCT CASE
WHEN pl.granted = 't' THEN pl.pid
ELSE NULL
END) FILTER (WHERE pl.granted = 't') AS holding_pids
FROM pg_locks pl
JOIN pg_class ON pl.relation = pg_class.oid
WHERE pl.locktype = 'page'
GROUP BY pl.page, pl.relation, pg_class.relname
HAVING COUNT(*) FILTER (WHERE pl.granted = 'f') > 0
)
SELECT
table_name,
page,
waiting_count,
holding_count,
waiting_pids,
holding_pids
FROM page_lock_conflicts;
四、监控告警配置
4.1 基于实际指标的告警规则
Prometheus告警规则
groups:
- name: postgres_lock_wait_alerts
rules:
- alert: PostgreSQLLockWaiting
expr: ccp_pg_waitlocks_count > 0
for: 30s
labels:
severity: warning
annotations:
summary: "PostgreSQL锁等待告警"
description: "PostgreSQL实例 {{ $labels.instance }} 存在 {{ $value }} 个锁等待"
- alert: PostgreSQLHighLockWaiting
expr: ccp_pg_waitlocks_count > 5
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL高锁等待告警"
description: "PostgreSQL实例 {{ $labels.instance }} 锁等待数量过高: {{ $value }}"
4.2 Grafana监控面板
锁等待监控面板配置
{
"title": "PostgreSQL锁等待监控",
"panels": [
{
"title": "锁等待数量",
"type": "stat",
"targets": [
{
"expr": "ccp_pg_waitlocks_count{instance=\"$instance\"}",
"legendFormat": "锁等待数量"
}
]
},
{
"title": "锁类型分布",
"type": "piechart",
"targets": [
{
"expr": "ccp_locks_count{instance=\"$instance\"}",
"legendFormat": "{{ mode }}"
}
]
}
]
}
五、页面锁等待探测的优化建议
5.1 当前机制的问题
主要局限性
- 粒度粗糙: 无法区分页面级、表级、行级锁等待
- 信息不足: 缺少等待时间、阻塞链等详细信息
- 诊断困难: 无法快速定位具体的锁冲突
5.2 优化方案
方案1:增强监控查询
# 建议的增强配置
ccp_pg_page_lock_waits:
query: "SELECT COUNT(*) FROM pg_locks WHERE granted = 'f' AND locktype = 'page';"
metrics:
- count:
usage: "GAUGE"
description: "number of page lock waits"
ccp_pg_lock_wait_details:
query: "SELECT
locktype,
COUNT(*) FILTER (WHERE granted = 'f') as waiting_count,
MAX(EXTRACT(EPOCH FROM (now() - query_start))) FILTER (WHERE granted = 'f') as max_wait_time
FROM pg_locks pl
LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
GROUP BY locktype;"
metrics:
- locktype:
usage: "LABEL"
description: "Lock type"
- waiting_count:
usage: "GAUGE"
description: "Number of waiting locks"
- max_wait_time:
usage: "GAUGE"
description: "Maximum wait time in seconds"
方案2:添加自定义监控函数
-- 创建页面锁等待监控函数
CREATE OR REPLACE FUNCTION monitor_page_lock_waits()
RETURNS TABLE(
page_wait_count BIGINT,
tuple_wait_count BIGINT,
relation_wait_count BIGINT,
max_wait_seconds INTERVAL,
blocked_pids BIGINT[]
) AS $$
BEGIN
RETURN QUERY
SELECT
COUNT(*) FILTER (WHERE locktype = 'page' AND granted = 'f'),
COUNT(*) FILTER (WHERE locktype = 'tuple' AND granted = 'f'),
COUNT(*) FILTER (WHERE locktype = 'relation' AND granted = 'f'),
MAX(now() - query_start) FILTER (WHERE wait_event_type = 'Lock'),
array_agg(DISTINCT pid) FILTER (WHERE granted = 'f')
FROM pg_locks pl
LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
WHERE pl.granted = 'f';
END;
$$ LANGUAGE plpgsql;
六、总结
6.1 当前状态
- 基础监控: ✅ 有基本的锁等待计数监控
- 实时状态: 当前无锁等待(count=0)
- 监控覆盖: 基础覆盖,但粒度较粗
6.2 改进建议
- 短期优化: 增加页面级锁等待的专门监控
- 中期优化: 添加等待时间、阻塞链等详细信息
- 长期规划: 建立完整的锁等待诊断和优化体系
6.3 实际应用价值
当前的ccp_pg_waitlocks_count指标虽然简单,但对于基础监控已经够用:
- 可以快速发现锁等待是否存在
- 适合作为基础告警指标
- 为深入诊断提供入口点
结论: 25环境的PostgreSQL锁等待探测机制虽然简单,但实用性强,能够有效满足基础的锁监控需求。
794

被折叠的 条评论
为什么被折叠?



