最近看了一个视频,介绍相关的SQL语句,记录如下
Troubleshooting high cpu utilization in PostgreSQL databases
High CPU usage is often connected to
-long running times
-poor performance
-unexcepted crashes
-bad end-customer experience
-limited scalability of app
-wrong architecture
-unplanned and uncontrolled increase in cost
目录
Distribution of non-idle connections per database
Distribution of non-idle connections per database and per query
The PostgreSQL parameter statement_timeout
Check connections patterns in The PostgreSQL DB log file(s)
Check execution plan of the query/queries
The PostgreSQL parameter random_page_cost
Reusing cached prepared statement slow
PostgreSQL DB Server CPU distribution by queries/databases
Proactive PostgreSQL DB Scanner
DB Connections
One of the patterns of PostgreSQL DB leading to high CPU utilizaiton is a high number of active connections.
"Connection Storm" is a very painful pattern for DB
The following SQL query lists the
-total number of connections.
-number of non-idle connections.
-number of maximum available connections.
-connections utilization percentage
select
A.total_connections,
A.non_idle_connections,
B.max_connections,
round((100 * A.total_connections::numeric / B.max_connections::numeric),
2) connections_utilization_pctg
from
(select count(1) as total_connections, sum(case when state!='idle' then 1
else 0 end) as non_idle_connections from pg_stat_activity) A,
(select setting as max_connections from pg_settings where
name='max_connections') B;
Here's an example of the SQL output:
total_connections | non_idle_connections | max_connections | conn_utilization_pctg
3457 3 9057 38.17
To tune the parameter max_connections,
and other key PostgreSQL DB instance parameters
I recommand using free online tools, like, for example,
PostgreSQL configuration builder at:
https://www.pgconfig.org
Distribution of non-idle connections per database
The following query allows to check
the distribution of non-idle connections per databases,
sorted in descending order:
select datname as db_name, count(1) as num_non_idle_connections
from pg_stat_activity
where state!='idle'
group by 1
order by 2 desc;
Here's an exmaple of the SQL output:
db_name num_non_idle_connections
my_db_1 133
my_db_2 6
my_db_3 3
Distribution of non-idle connections per database and per query
The following sql checks the distribution of non-idle connections
per database and per query, sorted in descending order
as seen in the example below:
select datname as db_name, query, count(1) as num_non_idle_connections
from pg_stat_activity
where state!='idel'
group by 1,2
order by 3 desc;
Here's an example of the SQL output
db_name | short_query | num_non_idle_connections
db_1 select * from table_1 40
db_1 select * from table_2 1
Non-idle connections detailed
The following query lists non-idle PostgreSQL sessions
that take more than 5 seconds, The result is sorted by the runtime
in descending order:
select
now()-query_start as runtime,
pid as process_id,
datname as db_name, client_addr, client_hostname,
query
from pg_stat_activity
where state='idle'
and now() - query_start > '5 seconds'::interval
order by 1 desc;
Here's an example of the SQL output:
runtime | process_id | db_name | client_addr | client_hostname | query
00:12:34 7770 db_1 192.168.12.208 select
For each long-running query,the resultset contains the corresponding
runtime,process id,database name,client address,and hostname.
In case the query runs too long,
causing a high load on the DB CPU and other resources,
you may want to terminate it explicitly.
To terminate a PostgreSQL DB sesiion by <process id>
run the following command:
select pg_terminate_backend(<process_id>);
The PostgreSQL parameter statement_timeout
The statement_timeout parameter in Postgres
sets the maximum allowed duration in milliseconds(ms)
for any statement.
The statement_timeout parameter default value is 0,
meaning no timeout.
I recommend to set a default value to 30 seconds
at the role/session level to defend PostgreSQL DB Server
from long-running queries.
Check connections patterns in The PostgreSQL DB log file(s)
Let's check if there is a "connection storming" pattern,
when a lot of connections arriving at the database at the specific time.
This can be the root cause for a high CPU utilization of PostgreSQL
Database Server.
Let's consider,for example,
Azure Database for PostgreSQL Server log file(s).
I will demonstrate how
Azure Database for PostgreSQL Server log file(s)
can be parsed using AWK.
AWK is a text-processing utility.It is very powerful
and uses a simple programming language.
How to list top DB users by "connection authorized" text pattern
occurrences sorted in descending order?
grep "connection authorized" postgresql-2023-02-01_100000.log | awk '{
printf("%s %s\n", substr($2,1,5), substr($6,index($6,"user=")+5, index($6,
"database=")-6)); } ' | awk '{count[$2]++} END {for (word in count) print
word, count[word]}'| sort -k 2,2 -n -r
db_user_1 1234
db_user_2 98
From this output we can conclude that the DB user db_user1 did themost of connections to the DB.
And it can be a good candidate to investigate high CPU utilization on PostgreSQL DB Server.
Running frequent SQL queries
The root cause of high CPU utilization in PostgreSQL databases
may not be a necessary long-running query.
Quick, but too fequest queries running hundreds of times per
second can cause high CPU utilization too.
To find the top frequent PostgreSQL queries
run the following sql as shown in the example below:
with
a as (select dbid, queryid, query, calls s from pg_stat_statements),
b as (select dbid, queryid, query, calls s from pg_stat_statements,
pg_sleep(1))
select
pd.datname as db_name,
substr(a.query, 1, 400) as the_query,
sum(b.s-a.s) as runs_per_second
from a, b, pg_database pd
where
a.dbid=b.dbid and a.queryid=b.queryid and pd.oid=a.dbid
group by 1, 2
order by 3 desc;
Here's an example of the output:
db_name | the_query | runs_per_second
db_1 select ... 10
db_1 insert into ... 2
This resultset includes a list of queries
and corresponding frequency:
how many times each query runs per second.
Recommandation:
Examine SQL queries and corresponding application logic.
Try to improve the application architecture to use caching mechanisms.
This will help to prevent the running SQL queries too frequently
on the PostgreSQL Database Server.
Check execution plan of the query/queries
postgres=# explain (buffers, analyze) select count(1) from my_table where
abs(a)=22988504;
postgres=# explain (buffers, analyze) select * from some_table;
The PostgreSQL parameter random_page_cost
Reducing the parameter random_page_cost value
relative to the parameter seq_page_cost
will cause the system to prefer index scans.
Default values:
postgres=# show random_page_cost;
4
postgres=# show seq_page_cost;
1
Recommendation:
In case of use SSD storage,
lower the parameter random_page_cost to 1
Invalid indexes
One of the reasons of high CPU utilization
may be missing and invalid indexes.
postgres=# \d tab
Table "public.tab"
Column | Type | Collation | Nullable | Default
col | integer |
Indexes:
"idx" btree(col) INVALID
The following query checks invalid indexes:
select pg_class.relname
from pg_class, pg_index
where pg_index.indisvalid = false
AND pg_index.indexrelid = pg_class.oid;
In case invalid indexes found they should be rebuild / recreated.
Reusing cached prepared statement slow
In case accessing PostgreSQL via the JDBC driver
using a prepared statements cache
it can be performance degradation and high CPU utilization.
During the first 5 executions of the prepared statement
PostgreSQL engine is using good execution plan.
Starting from the 6th execution
it will use generic execution plan,
which will lead to performance degradation
and high CPU utilization effect.
Recommendation:
Disable prepared statements caching.
PostgreSQL DB Server CPU distribution by queries/databases
The following sql checks
how much each query in each database
uses the CPU.
It provides a resultset
sorted by the most CPU-intensive queries
in descending order
select
pss.userid, pss.dbid, pd.datname as db_name,
round((pss.total_exec_time + pss.total_plan_time)::numeric,2) as total_time,
pss.calls,
round((pss.mean_exec_time + pss.mean_plan_time)::numeric,2) as mean,
round((100 * (pss.total_exec_time + pss.total_plan_time) /
sum((pss.total_exec_time+pss.total_plan_time)::numeric) OVER ())::numeric,2)
as cpu_portion_pctg,pss.query
from pg_stat_statements pss, pg_database pd
where pd.oid=pss.dbid
order by (pss.total_exec_time + pss.total_plan_time)
desc limit 30;
Here's an example of the SQL output:
db_name | total_time | calls | mean | cpu_pctg | short_query
db_1 27349172.12 3905898 7.00 25.15 select ... from table_1
db_1 391744.00 105 3731.00 16.76 select ... from table_2
From the output example,you can see
the first query (i.e.,select ... from table_1)
takes the most portion of the CPU,
because of a high number of calls.
I recommend looking at how often
the application is running queries
connected to this PostgreSQL DB.
Also, from the output example,you can see
a high mean time of the second query(i.e., select ... from table_2),
which exceeds three seconds.
PostgreSQL DB statistics
Outdated PostgreSQL statistics
can be another root cause for high CPU utilization.
When statistical data isn't updated,
the PostgreSQL query planner may generate
non-efficient execution plans for queries,
which will lead to a bad performance
of the entire PostgreSQL DB server.
select
schemaname, relname,
DATE_TRUNC('minute', last_analyze) last_analyze,
DATE_TRUNC('minute', last_autoanalyze) last_autoanalyze
from
pg_stat_all_tables
where
schemaname = 'public'
order by
last_analyze desc NULLS FIRST,
last_autoanalyze desc NULLS FIRST;
To check the last date and time the statistics were updated
for each table in the PostgreSQL DB Server
for a specific DB,
connect to the DB and run the following query:
Recommendation:
Ensure tables are analyzed regularly.
To collect statistics manually for a specific table
and its associated indexes
run the command:
ANALYZE <table_name>
PostgreSQL DB bloat
In cases of intensive data updates,
both with frequent UPDATE and with INSERT/DELETE operations
PostgreSQL tables and their indices become bloated.
Bloat refers to disk space
that was allocated by a table or index
and is now available for reuse by the database,
but has not been reclaimed.
Because of this bloat, the performance of the PostgreSQL DB Server is
degraded, which can lead to high CPU utilization scenarios.
Under normal PostgreSQL operations,
tuples(rows) that are deleted or stale because of an update
aren't physically removed from the table--
they're stored there until the VACUUM command is issued.
VACUUM releases the space occupied by "dead" tuples.
It's necessary to perform a VACUUM periodically,
especially for tables that change often.
To check information about dead tuples,
and when vacuum / autovacuum was run
for each table for a specific DB,
connect to the DB and run the following query:
select
schemaname, relname, n_tup_ins, n_tup_upd, n_tup_del, n_live_tup,
n_dead_tup,
DATE_TRUNC('minute', last_vacuum) last_vacuum,
DATE_TRUNC('minute', last_autovacuum) last_autovacuum
from
pg_stat_all_tables
where
schemaname = 'public'
order by
n_dead_tup desc;
Here's an example of the SQL output:
table_1 30237555 41784024 26858 30184398 142226
It indicates the number of inserted, updated, deleted and live tuples per table
since the last time the analyze process was performed.
It indicates the last date and time
the vacuum/autovacuum process was performed.
Recommendation.
Ensure tables are vacuumed regularly.
To run VACUUM(regular,not FULL)
for a specific table and all its associated indexes
run the command:
VACUUM <table_name>;
VACUUM can be run by an autovacuum process
together with statistics collection.
Autovacuum is a background process
that executes VACUUM and ANALYZE commands automatically.
To collect DB statistics and to do vacuum (regular, not FULL)
on all the objects of all the dbs
run the following command:
vacuumdb -h <db_port> -p <db_port> -U <db_user> -j 4 -z -v -a
Proactive PostgreSQL DB Scanner
On mission-critical systems, it is important to be proactive
in preventing the situation that the key parameters of the system,
including CPU utilization reach maximum values
that directly impact performance, stability, and cost.
Like in medicine, we want to find a cure
as soon as the first symptoms appear.
I'd like to consider an approach that makes it easy to
- identify queries that use the system inefficiently
- help to find a root cause for the performance issues
- assist to understand typical workload patterns and performance bottlenecks
The found patterns and queries can be improved and the system will work efficiently.
The Proactive PostgreSQL DB Performance Scanner
is a script that connects to a database
and runs a set of probes that can be extended if desired.
All the probes are queries to a database,
that are unified by structure.
It includes:
- the threshold value
- description of the check
- the issue this check is associated
- recommendation on how to Troubleshoot the issue
- SQL query to perform the check
- an additional optional SQL query
in case there is a need for more evidence
Example of how to run the Proactive PostgreSQL DB Performance Scanner:
proactive_pg_db_performance_scanner.sh -h db_host -p 5432 -U postgres -d postgres
相关代码在github上提供了,作者真是个好人
The source code of the Proactive PostgreSQL DB Scanner:
https://github.com/dm8ry/proactive-postgresql-db-performance-scanner
能看到最后的人才能看到这个工具:)