如何检测、清理Greenplum膨胀、垃圾

本文探讨了Pivotal Greenplum数据库中表膨胀的现象及其原因,并提供了识别和消除表膨胀的方法。通过使用gp_toolkit模式下的视图可以有效监控表膨胀情况,并采取VACUUM命令或表重新分布等措施来解决此问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

In some cases, due to the underlying storage architecture, Pivotal Greenplum "heap" tables are susceptible to bloat. Bloat can affect table scanning performance and therefore user query performance.

The following will be covered in this article:

  1. What is table bloat?
  2. What causes table bloat?
  3. How to identify table bloat?
  4. How to eliminate table bloat?
  5. Recommendations

Answer

1. What is table bloat?

Table bloat is accumulation of free space used by old data rows within the table data files. This space has been previously used by data rows which are deleted and not accessible any more. Failure to do table maintenance in order to allow reuse of this space causes table data file to grow bigger and therefore scans of the table take longer.

2. What causes table bloat?

The Pivotal Greenplum storage implementation (known as MVCC - MultiVersion Concurrency Control) is borrowed from Postgres. According to this implementation:

There are no in-place updates (updates are implemented by delete + insert) The row stays in the data file until space is marked as "free" space via the VACUUM command after no more transactions can potentially access it.

Once VACUUM registers the deleted row space as "free space", the space can be reused by future inserts and updates. After rows are deleted (and no transactions can access them) and before VACUUM is executed, this space is not marked as free for reuse and is effectively "dead space".

3. How to identify tables bloat?

Use the gp_toolkit administrative schema:

gp_toolkit.gp_bloat_diag - this view shows tables with moderate and significant amount of bloat

Columns:

  • bdirelid - Object ID of the table (pg_class.oid)
  • bdinspname - table schema name
  • bdirelname - table name
  • bdirelpages - number of pages currently in table data files(当前表的数据页数)
  • bdiexppages - number of pages expected according to current statistics(期望的数据页数)
  • bdidiag - diagnosis of bloat (ratio from 1 to 3 -> no bloat, ratio from 4 to 10 -> moderate bloat, ratio > 10 -> significant bloat)(诊断的水位no bloat:无水位  moderate bloat)

   --对于significant应该需要进行清理,使用vacuum进行清理。但是vacuum无法对索引进行回收空间,所以要先删除索引,然后再进行回收

Example:

In this example the table "t1" is severely bloated (the calculated data size for data currently in table is 1 page, but table consists of 97 pages).

gpadmin=# select * from gp_toolkit.gp_bloat_diag;
 bdirelid | bdinspname | bdirelname | bdirelpages | bdiexppages |                bdidiag                
----------+------------+------------+-------------+-------------+---------------------------------------
    21488 | public     | t1         |          97 |           1 | significant amount of bloat suspected
(1 row)

gp_toolkit.gp_bloat_expected_pages - this view shows the page data (current pages/expected pages) for all tables in the database Columns:

  • btdrelid - Object ID of the table (pg_class.oid)
  • btdrelpages - number of pages currently in table data files
  • btdexppages - number of pages expected according to current statistics

Example: In this example the tables shown all have calculated data size of 1 page and actual data file size 1 page. No bloat is detected.

gpadmin=# select * from gp_toolkit.gp_bloat_expected_pages limit 5;
 btdrelid | btdrelpages | btdexppages 
----------+-------------+-------------
    10789 |           1 |           1
    10794 |           1 |           1
    10799 |           1 |           1
     5004 |           1 |           1
     7175 |           1 |           1
(5 rows)

NOTE

  • These views depend on table statistics to calculate the amount of bloat, therefore it is crucial that table statistics are up to date!
  • Pivotal Greenplum database system tables are "heap" tables. Therefore they are susceptible to bloat. As system tables are very important for the database performance, it is recommended that they are vacuumed regularly. Recommended intervals are from one week for system with little object changes (create/drop/alter table) to one day or less for systems with lots of object changes.

4. How to eliminate table bloat?

The VACUUM command is used to mark the deleted rows in table data files as "free space" available for reuse and therefore eliminate bloat. It is normal and healthy for a table with lots of UPDATE/DELETE/INSERT workload to have certain small amount of deleted rows/free space, which will be reused as new data comes in. The regular use of VACUUM ensures that deleted rows are reused as free space as soon as possible.

In certain cases, when VACUUM has not been run for a long time and a huge amount of deleted rows (dead rows) have accumulated in the data files, it comes to situation where the real table rows are just a tiny amount of the table data files (in other words - table is significantly bloated). In these cases in order to eliminate the extra reserved space (if it will never be reused anyway), the VACUUM FULL command can be executed. This command compacts the table data by moving it at the front of the data file and truncates the unused space on the tail.

VACUUM FULL compacts rows one by one and therefore is slow for big tables and also takes exclusive lock on the table.
VACUUM FULL execution is recommended in maintenance window and with careful consideration of the run-time and effects.
VACUUM FULL command should not be terminated by user once started.

Better alternative to VACUUM FULL for user tables is to redistribute the table (cannot be performed on system tables). This effectively rebuilds the table getting rid of the bloat in the process.

The steps are:

  • Write down the current table distribution columns
  • ALTER TABLE SET with (REORGANIZE=false) DISTRIBUTED randomly; -- only "marks" the table, but does not move rows. Finishes instantly.
  • ALTER TABLE SET with (REORGANIZE=true) DISTRIBUTED BY (<column name>); -- rewrites the data files. As the distribution actually never changes on data file level, rewrite happens locally without sending rows over network.

Refer to the article for more option to remove database bloat

Note

  • VACUUM FULL and the better alternative with table redistribution are recommended only in extreme cases.
  • VACUUM is recommended to be run regularly on tables with INSERT/UPDATE/DELETE workload and on system tables.

5. Recommendations

  • Check for bloat regularly in the system tables and user tables using gp_toolkit views.
  • Regularly VACUUM system tables (according to the create/drop/alter object frequency).
  • Identify user tables with lots of INSERT/UPDATE/DELETE activity and VACUUM them in low activity or no activity time windows according to bloat accumulation.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值