PostgreSQL-3中的MVCC。行版本

最新推荐文章于 2025-12-20 13:10:13 发布

翻译最新推荐文章于 2025-12-20 13:10:13 发布 · 223 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://habr.com/en/company/postgrespro/blog/477648/

文章标签：

#数据库 #python #mysql #java #大数据

本文深入探讨了数据库事务处理的核心概念，包括行版本控制、元组头部信息、NULL位图等，解析了插入、更新、删除操作在低级别数据结构上的实现细节。并通过PostgreSQL的具体例子，展示了事务ID、可见性检查、XACT结构、子事务和保存点等功能的运作机制。

isolation and made a digression regarding 隔离，并且 the low-level data structure. And we've finally reached the most fascinating thing, that is, row versions (tuples). 对低级数据结构做了题外话。我们终于达到了最引人入胜的东西，即行版本(元组)。

元组头 (Tuple header)

As already mentioned, several versions of each row can be simultaneously available in the database. And we need to somehow distinguish one version from another one. To this end, each version is labeled with its effective «time» (xmin) and expiration «time» (xmax). Quotation marks denote that a special incrementing counter is used rather than the time itself. And this counter is the transaction identifier.

如前所述，数据库中可以同时使用每行的多个版本。我们需要以某种方式将一个版本与另一个版本区分开。为此，每个版本都标有有效的“时间”( xmin )和有效的“时间”( xmax )。引号表示使用特殊的递增计数器，而不是时间本身。这个计数器是交易标识符 。

(As usual, in reality this is more complicated: the transaction ID cannot always increment due to a limited bit depth of the counter. But we will explore more details of this when our discussion reaches freezing.)

(通常，这实际上更复杂：由于计数器的位深度有限，交易ID不能总是递增。但是，当我们的讨论陷入僵局时，我们将探索其更多细节。)

When a row is created, the value of xmin is set equal to the ID of the transaction that performed the INSERT command, while xmax is not filled in.

创建行时，将xmin的值设置为等于执行INSERT命令的事务的ID，而未填写xmax 。

When a row is deleted, the xmax value of the current version is labeled with the ID of the transaction that performed DELETE.

删除一行后，当前版本的xmax值将标记有执行DELETE的事务的ID。

An UPDATE command actually performs two subsequent operations: DELETE and INSERT. In the current version of the row, xmax is set equal to the ID of the transaction that performed UPDATE. Then a new version of the same row is created, in which the value of xmin is the same as xmax of the previous version.

UPDATE命令实际上执行两个后续操作：DELETE和INSERT。在该行的当前版本中，将xmax设置为等于执行UPDATE的事务的ID。然后，创建同一行的新版本，其中xmin的值与先前版本的xmax相同。

xmin and xmin和 xmax fields are included in the header of a row version. In addition to these fields, the tuple header contains other ones, such as: xmax字段包含在行版本的标题中。除这些字段外，元组头还包含其他字段，例如：

infomask — several bits that determine the properties of a given tuple. There are quite a few of them, and we will discuss each over time.
infomask —几个位，用于确定给定元组的属性。其中有很多，我们将逐步讨论。
ctid — a reference to the next, more recent, version of the same row. ctid of the newest, up-to-date, row version references that very version. The number is in the (x,y) form, where x is the number of the page and y is the order number of the pointer in the array.
ctid对同一行的下一个更新版本的引用。最新的行版本的ctid引用该版本。该数字采用(x,y)形式，其中x是页面的编号，而y是数组中指针的顺序号。
The NULLs bitmap, which marks those columns of a given version that contain a NULL. NULL is not a regular value of data types, and therefore, we have to store this characteristic separately.
NULL位图，用于标记给定版本中包含NULL的列。 NULL不是数据类型的常规值，因此，我们必须单独存储此特征。

As a result, the header appears pretty large: 23 bytes per each tuple at a minimum, but usually larger because of the NULLs bitmap. If a table is «narrow» (that is, it contains few columns), the overhead bytes can occupy more space that the useful information.

结果，标题看起来非常大：每个元组至少23个字节，但是由于NULL位图而通常更大。如果一个表是“ narrow”(窄)(也就是说，它包含几列)，那么开销字节会比有用信息占用更多的空间。

插 (Insert)

Let's look in more detail at how the operations on rows are performed at a low level, and we start with an insert.

让我们更详细地看一下如何在低级别执行行操作，并从插入开始。

To experiment, we will create a new table with two columns and an index on one of them:

为了进行实验，我们将创建一个新表，其中包含两列，并在其中一列上具有索引：

=> CREATE TABLE t(
  id serial,
  s text
);
=> CREATE INDEX ON t(s);

We start a transaction to insert a row.

我们开始一个事务以插入一行。

=> BEGIN;
=> INSERT INTO t(s) VALUES ('FOO');

This is the ID of our current transaction:

这是我们当前交易的ID：

=> SELECT txid_current();

txid_current 
--------------
         3664
(1 row)

Let's look into the contents of the page. The heap_page_items function from the «pageinspect» extension enables us to get information on the pointers and row versions:

让我们看一下页面的内容。 «pageinspect»扩展中的heap_page_items函数使我们能够获取有关指针和行版本的信息：

=> SELECT * FROM heap_page_items(get_raw_page('t',0)) \gx

-[ RECORD 1 ]-------------------
lp          | 1
lp_off      | 8160
lp_flags    | 1
lp_len      | 32
t_xmin      | 3664
t_xmax      | 0
t_field3    | 0
t_ctid      | (0,1)
t_infomask2 | 2
t_infomask  | 2050
t_hoff      | 24
t_bits      | 
t_oid       | 
t_data      | \x0100000009464f4f

Note that the word «heap» in PostgreSQL denotes tables. This is one more weird usage of a term: a heap is a known data structure, which has nothing to do with a table. This word is used here in the sense that «all is heaped up», unlike in ordered indexes.

请注意，PostgreSQL中的“堆”一词表示表。这是术语的另一种怪异用法：堆是已知的数据结构，与表无关。在此使用该词的含义是“全部堆积”，这与有序索引不同。

This function shows the data «as is», in a format that is difficult to comprehend. To clarify the things, we leave only part of the information and interpret it:

此功能以难以理解的格式显示数据“原样”。为了澄清问题，我们仅保留部分信息并进行解释：

=> SELECT '(0,'||lp||')' AS ctid,
       CASE lp_flags
         WHEN 0 THEN 'unused'
         WHEN 1 THEN 'normal'
         WHEN 2 THEN 'redirect to '||lp_off
         WHEN 3 THEN 'dead'
       END AS state,
       t_xmin as xmin,
       t_xmax as xmax,
       (t_infomask & 256) > 0  AS xmin_commited,
       (t_infomask & 512) > 0  AS xmin_aborted,
       (t_infomask & 1024) > 0 AS xmax_commited,
       (t_infomask & 2048) > 0 AS xmax_aborted,
       t_ctid
FROM heap_page_items(get_raw_page('t',0)) \gx

-[ RECORD 1 ]-+-------
ctid          | (0,1)
state         | normal
xmin          | 3664
xmax          | 0
xmin_commited | f
xmin_aborted  | f
xmax_commited | f
xmax_aborted  | t
t_ctid        | (0,1)

We did the following:

我们做了以下工作：

Added a zero to the pointer number to make it look like a t_ctid: (page number, pointer number).
在指针编号上添加了零，使其看起来像t_ctid ：(页面编号，指针编号)。
Interpreted the status of the lp_flags pointer. It is «normal» here, which means that the pointer actually references a row version. We will discuss other values later.
解释了lp_flags指针的状态。这里是“普通”，这意味着指针实际上引用了行版本。稍后我们将讨论其他值。
Of all information bits, we selected only two pairs so far. xmin_committed and xmin_aborted bits show whether the transaction with the ID xmin is committed (rolled back). A pair of similar bits relates to the transaction with the ID xmax.
到目前为止，在所有信息位中，我们仅选择了两对。 xmin_committed和xmin_aborted位显示ID为xmin的事务是否已提交(回退)。一对相似的比特与ID为xmax的交易有关。

What do we observe? When a row is inserted, in the table page a pointer appears that has number 1 and references the first and the only version of the row.

我们观察到什么？当插入一行时，在表格页面中会出现一个指针，该指针的编号为1，并引用该行的第一个和唯一版本。

The xmin field in the tuple is filled with the ID of the current transaction. Because the transaction is still active, both xmin_committed and xmin_aborted bits are unset.

元组中的xmin字段填充有当前事务的ID。由于事务仍处于活动状态，因此xmin_committed和xmin_aborted位均未设置。

The ctid field of the row version references the same row. It means that no newer version is available.

行版本的ctid字段引用同一行。这意味着没有新版本可用。

The xmax field is filled with the conventional number 0 since the tuple is not deleted, that is, up-to-date. Transactions will ignore this number because of the xmax_aborted bit set.

由于没有删除元组(即最新)，因此xmax字段用常规数字0填充。由于设置了xmax_aborted位，事务将忽略该数字。

Let's move one more step to improving the readability by appending information bits to transaction IDs. And let's create the function since we will need the query more than once:

通过将信息位附加到事务ID上，我们又向前迈出了一步，以提高可读性。并且创建函数，因为我们将多次查询：

=> CREATE FUNCTION heap_page(relname text, pageno integer)
RETURNS TABLE(ctid tid, state text, xmin text, xmax text, t_ctid tid)
AS $$
SELECT (pageno,lp)::text::tid AS ctid,
       CASE lp_flags
         WHEN 0 THEN 'unused'
         WHEN 1 THEN 'normal'
         WHEN 2 THEN 'redirect to '||lp_off
         WHEN 3 THEN 'dead'
       END AS state,
       t_xmin || CASE
         WHEN (t_infomask & 256) > 0 THEN ' (c)'
         WHEN (t_infomask & 512) > 0 THEN ' (a)'
         ELSE ''
       END AS xmin,
       t_xmax || CASE
         WHEN (t_infomask & 1024) > 0 THEN ' (c)'
         WHEN (t_infomask & 2048) > 0 THEN ' (a)'
         ELSE ''
       END AS xmax,
       t_ctid
FROM heap_page_items(get_raw_page(relname,pageno))
ORDER BY lp;
$$ LANGUAGE SQL;

What is happening in the header of the row version it is much clearer in this form:

在行版本的标题中发生的事情，这种形式更加清楚：

=> SELECT * FROM heap_page('t',0);

ctid  | state  | xmin | xmax  | t_ctid 
-------+--------+------+-------+--------
 (0,1) | normal | 3664 | 0 (a) | (0,1)
(1 row)

We can get similar information, but far less detailed, from the table itself by using xmin and xmax pseudo-columns:

通过使用xmin和xmax伪列，我们可以从表本身获得类似的信息，但详细程度要低得多：

=> SELECT xmin, xmax, * FROM t;

xmin | xmax | id |  s  
------+------+----+-----
 3664 |    0 |  1 | FOO
(1 row)

承诺 (Commit)

When a transaction is successful, its status must be remembered, that is, the transaction must be marked as committed. To this end, the XACT structure is used. (Before version 10 it was called CLOG (commit log), and you are still likely to come across this name.)

事务成功后，必须记住其状态，即必须将事务标记为已提交。为此，使用了XACT结构。 (在版本10之前，它被称为CLOG(提交日志)，您仍然很可能会遇到此名称。)

XACT is not a table of the system catalog, but files in the PGDATA/pg_xact directory. Two bits are allocated in these files for each transaction — «committed» and «aborted» — exactly the same way as in the tuple header. This information is spread across several files only for convenience; we will get back to this when we discuss freezing. PostgreSQL works with these files page by page, as with all others.

XACT不是系统目录的表，而是PGDATA / pg_xact目录中的文件。在这些文件中为每个事务分配了两位(即“已提交”和“已中止”)，其方式与元组头中的方式完全相同。此信息只是为了方便而散布在几个文件中。在讨论冻结时，我们将回到这一点。 PostgreSQL与所有其他文件一样，逐页处理这些文件。

So, when a transaction is committed, the «committed» bit is set for this transaction in XACT. And this is all that happens when the transaction is committed (although we do not mention the write-ahead log yet).

因此，提交事务后，将在XACT中为此事务设置“ committed”位。这就是提交事务时发生的所有事情(尽管我们还没有提到预写日志)。

When some other transaction accesses the table page we were just looking at, the former will have to answer a few questions.

当其他事务访问我们刚刚查看的表页面时，前者将不得不回答一些问题。

Was the transaction xmin completed? If not, the created tuple must not be visible.
交易xmin是否已完成？如果不是，则创建的元组必须不可见。

This is checked by looking through another structure, which is located in the shared memory of the instance and called ProcArray. This structure holds a list of all active processes, along with the ID of the current (active) transaction for each.
通过查看位于实例的共享内存中的另一个结构ProcArray来检查此情况。此结构包含所有活动进程的列表，以及每个进程的当前(活动)事务的ID。
If the transaction was completed, then was it committed or rolled back? If it was rolled back, the tuple must not be visible either.
如果交易完成，那么是提交还是回滚？如果已回滚，则该元组也不得可见。

This is just what XACT is needed for. But it is expensive to check XACT each time, although last pages of XACT are stored in buffers in the shared memory. Therefore, once figured out, the transaction status is written to
这正是XACT所需要的。但是，尽管XACT的最后一页存储在共享内存的缓冲区中，但每次检查XACT的开销都很高。因此，一旦确定，就将交易状态写入

xmin_committed and xmin_aborted bits of the tuple. If any of these bits is set, the transaction status is treated as known and the next transaction will not need to check XACT.
元组的xmin_committed和xmin_aborted位。如果设置了这些位中的任何一个，则将事务状态视为已知，并且下一个事务将不需要检查XACT。

Why does not the transaction that performs the insert set these bits? When an insert is being performed, the transaction is yet unaware of whether it will be completed successfully. And at the commit time it's already unclear which rows and in which pages were changed. There can be a lot of such pages, and it is impractical to keep track of them. Besides, some of the pages can be evicted to disk from the buffer cache; to read them again in order to change the bits would mean a considerable slowdown of the commit.

为什么执行插入的事务未设置这些位？当执行插入操作时，事务尚不知道它是否将成功完成。在提交时，还不清楚哪些行和哪些页面已更改。这样的页面可能很多，要跟踪它们是不切实际的。此外，某些页面可以从缓冲区高速缓存中逐出到磁盘上。再次读取它们以更改位将意味着提交的速度大大降低。

The reverse side of the cost saving is that after the updates, any transaction (even the one performing SELECT) can begin changing data pages in the buffer cache.

节省成本的反面是，更新之后，任何事务(甚至是执行SELECT的事务)都可以开始更改缓冲区高速缓存中的数据页。

So, we commit the change.

因此，我们提交更改。

=> COMMIT;

Nothing has changed in the page (but we know that the transactions status is already written to XACT):

页面中没有任何更改(但我们知道事务状态已经写入XACT)：

=> SELECT * FROM heap_page('t',0);

ctid  | state  | xmin | xmax  | t_ctid 
-------+--------+------+-------+--------
 (0,1) | normal | 3664 | 0 (a) | (0,1)
(1 row)

Now a transaction that first accesses the page will need to determine the status of the transaction xmin and will write it to the information bits:

现在，首先访问该页面的事务将需要确定事务xmin的状态，并将其写入信息位：

=> SELECT * FROM t;

id |  s  
----+-----
  1 | FOO
(1 row)

=> SELECT * FROM heap_page('t',0);

ctid  | state  |   xmin   | xmax  | t_ctid 
-------+--------+----------+-------+--------
 (0,1) | normal | 3664 (c) | 0 (a) | (0,1)
(1 row)

删除 (Delete)

When a row is deleted, the ID of the current deleting transaction is written to the xmax field of the up-to-date version and the xmax_aborted bit is reset.

当删除一行时，当前删除事务的ID会写入最新版本的xmax字段，并且xmax_aborted位将被重置。

Note that the value of xmax corresponding to the active transaction works as a row lock. If another transaction is going to update or delete this row, it will have to wait until the xmax transaction completes. We will talk about locks in more detail later. At this point, only note that the number of row locks is not limited at all. They do not occupy memory, and the system performance is not affected by that number. However, long lasting transactions have other drawbacks, which will also be discussed later.

请注意，与活动事务相对应的xmax值用作行锁。如果另一个事务要更新或删除该行，则必须等到xmax事务完成。稍后我们将更详细地讨论锁。此时，仅注意行锁的数量根本没有限制。它们不占用内存，并且该数量不会影响系统性能。但是，持久的交易还有其他缺点，稍后将对此进行讨论。

Let's delete a row.

让我们删除一行。

=> BEGIN;
=> DELETE FROM t;
=> SELECT txid_current();

txid_current 
--------------
         3665
(1 row)

We see that the transaction ID is written to the xmax field, but information bits are unset:

我们看到事务ID被写入xmax字段，但是未设置信息位：

=> SELECT * FROM heap_page('t',0);

ctid  | state  |   xmin   | xmax | t_ctid 
-------+--------+----------+------+--------
 (0,1) | normal | 3664 (c) | 3665 | (0,1)
(1 row)

中止 (Abort)

Abort of a transaction works similarly to commit, except that the «aborted» bit is set in XACT. An abort is done as fast as a commit. Although the command is called ROLLBACK, the changes are not rolled back: everything that the transaction has already changed, remains untouched.

事务中止的工作与提交类似，只是在XACT中将“中止”位置1。中止与提交一样快。尽管该命令称为ROLLBACK，但更改不会回滚：事务已更改的所有内容均保持不变。

=> ROLLBACK;
=> SELECT * FROM heap_page('t',0);

ctid  | state  |   xmin   | xmax | t_ctid 
-------+--------+----------+------+--------
 (0,1) | normal | 3664 (c) | 3665 | (0,1)
(1 row)

When accessing the page, the status will be checked and the hint bit xmax_aborted will be set. Although the number xmax itself will be still in the page, it will not be looked at.

访问该页面时，将检查状态并设置提示位xmax_aborted 。尽管数字xmax本身仍将在页面中，但不会看到它。

=> SELECT * FROM t;

id |  s  
----+-----
  1 | FOO
(1 row)

=> SELECT * FROM heap_page('t',0);

ctid  | state  |   xmin   |   xmax   | t_ctid 
-------+--------+----------+----------+--------
 (0,1) | normal | 3664 (c) | 3665 (a) | (0,1)
(1 row)

更新资料 (Update)

An update works as if the current version is deleted first and then a new one is inserted.

更新的工作方式就像先删除当前版本，然后插入新版本一样。

=> BEGIN;
=> UPDATE t SET s = 'BAR';
=> SELECT txid_current();

txid_current 
--------------
         3666
(1 row)

The query returns one row (the new version):

查询返回一行(新版本)：

=> SELECT * FROM t;

id |  s  
----+-----
  1 | BAR
(1 row)

But we can see both versions in the page:

但是我们可以在页面中看到两个版本：

=> SELECT * FROM heap_page('t',0);

ctid  | state  |   xmin   | xmax  | t_ctid 
-------+--------+----------+-------+--------
 (0,1) | normal | 3664 (c) | 3666  | (0,2)
 (0,2) | normal | 3666     | 0 (a) | (0,2)
(2 rows)

The deleted version is labeled with the ID of the current transaction in the xmax field. Moreover, this value has overwritten the old one since the previous transaction was rolled back. And the xmax_aborted bit is reset since the status of the current transaction is unknown yet.

删除的版本在xmax字段中标有当前交易的ID。而且，自从上次事务回滚以来，此值已覆盖了旧值。并且xmax_aborted位被重置，因为当前事务的状态还未知。

The first version of the row is now referencing the second, as a newer one.

该行的第一个版本现在引用第二个版本，作为较新的版本。

The index page now contains the second pointer and second row, which references the second version in the table page.

现在，索引页包含第二个指针和第二行，该行引用表页面中的第二个版本。

The same way as for a delete, the value of xmax in the first version indicates that the row is locked.

与删除操作相同，第一个版本中的xmax值指示该行已锁定。

Lastly, we commit the transaction.

最后，我们提交交易。

=> COMMIT;

指标 (Indexes)

We were talking only about table pages so far. But what happens inside indexes?

到目前为止，我们仅谈论表页面。但是索引内部会发生什么？

Information in index pages highly depends on the specific index type. Moreover, even one type of indexes can have different kinds of pages. For example: a B-tree has the metadata page and «normal» pages.

索引页中的信息高度取决于特定的索引类型。而且，即使一种类型的索引也可以具有不同种类的页面。例如：B树具有元数据页面和“普通”页面。

Nevertheless, an index page usually has an array of pointers to the rows and rows themselves (just like table pages). Besides, some space at the end of a page is allocated for special data.

但是，索引页通常具有一个指向行和行本身的指针数组(就像表页一样)。此外，页面末尾的一些空间还分配给特殊数据。

Rows in indexes can also have different structures depending on the index type. For example: in an B-tree, rows pertinent to leaf pages contain the value of the indexing key and a reference (ctid) to the appropriate table row. In general, an index can be structured quite a different way.

根据索引类型，索引中的行也可以具有不同的结构。例如：在B树中，与叶子页相关的行包含索引键的值和对相应表行的引用( ctid )。通常，索引的结构可以完全不同。

The main point is that in indexes of any type there are no row versions. Or we can consider each row to be represented by only one version. In other words, the header of the index row does not contain the xmin and xmax fields. For now we can assume that references from the index point to all versions of table rows. So to make out which of the row versions are visible to a transaction, PostgreSQL needs to look into the table. (As usual, this is not the whole story. Sometimes the visibility map enables optimizing the process, but we will discuss this later.)

要点是，在任何类型的索引中都没有行版本。或者我们可以考虑每一行仅由一个版本表示。换句话说，索引行的标题不包含xmin和xmax字段。现在，我们可以假设从索引指向表行的所有版本的引用。因此，要确定事务中可见哪些行版本，PostgreSQL需要调查表。 (像往常一样，这不是全部内容。有时可见性图可以优化流程，但我们稍后会讨论。)

Here, in the index page, we find pointers to both versions: the up-to-date and previous:

在这里，在索引页面中，我们找到两个版本的指针：最新版本和之前版本：

=> SELECT itemoffset, ctid FROM bt_page_items('t_s_idx',1);

itemoffset | ctid  
------------+-------
          1 | (0,2)
          2 | (0,1)
(2 rows)

虚拟交易 (Virtual transactions)

In practice, PostgreSQL takes advantage of an optimization that permits to «sparingly» expends transaction IDs.

实际上，PostgreSQL利用了优化的优势，该优化允许“少量”消耗事务ID。

If a transaction only reads data, it does not affect the visibility of tuple at all. Therefore, first the backend process assigns a virtual ID (virtual xid) to the transaction. This ID consists of the process identifier and a sequential number.

如果事务仅读取数据，则根本不影响元组的可见性。因此，首先，后端进程将虚拟ID(虚拟xid)分配给事务。该ID由进程标识符和序列号组成。

Assignment of this virtual ID does not require synchronization between all the processes and is therefore performed very quickly. We will learn another reason of using virtual IDs when we discuss freezing.

分配此虚拟ID不需要所有进程之间的同步，因此可以非常快速地执行。在讨论冻结时，我们将了解使用虚拟ID的另一个原因。

Data snapshots do not take into account virtual ID at all.

数据快照根本不考虑虚拟ID。

At different points in time, the system can have virtual transactions with IDs that were already used, and this is fine. But this ID cannot be written to data pages since when the page is accessed next time, the ID can become meaningless.

在不同的时间点，系统可以具有已使用ID的虚拟事务，这很好。但是此ID无法写入数据页，因为下次访问该页时，该ID可能变得毫无意义。

=> BEGIN;
=> SELECT txid_current_if_assigned();

txid_current_if_assigned 
--------------------------
                         
(1 row)

But if a transaction begins to change data, it receives a true, unique, transaction ID.

但是，如果事务开始更改数据，它将收到一个真实的，唯一的事务ID。

=> UPDATE accounts SET amount = amount - 1.00;
=> SELECT txid_current_if_assigned();

txid_current_if_assigned 
--------------------------
                     3667
(1 row)

=> COMMIT;

子交易 (Subtransactions)

保存点 (Savepoints)

In SQL, savepoints are defined, which permit rolling back some operations of the transaction without its complete abortion. But this is incompatible with the above model since the transaction status is one for all the changes and no data is physically rolled back.

在SQL中，定义了保存点 ，这些保存点允许回滚事务的某些操作而不会完全中止。但这与上述模型不兼容，因为事务状态是所有更改的结果之一，并且没有物理回滚的数据。

To implement this functionality, a transaction with a savepoint is divided into several separate subtransactions whose statuses can be managed separately.

为了实现此功能，带有保存点的事务被分为几个单独的子事务，其状态可以分别进行管理。

Subtrabsactions have their own IDs (greater than the ID of the main transaction). The statuses of subtransactions are written to XACT in a usual way, but the final status depends on the status of the main transaction: if it is rolled back, all subtransactions are rolled back as well.

子交易具有自己的ID(大于主交易的ID)。子事务的状态以通常的方式写入XACT，但最终状态取决于主事务的状态：如果回滚，则所有子事务也将回滚。

Information about subtransactions nesting is stored in files of the PGDATA/pg_subtrans directory. These files are accessed through buffers in the shared memory of the instance, which are structured the same way as XACT buffers.

有关子事务嵌套的信息存储在PGDATA / pg_subtrans目录的文件中。这些文件是通过实例共享内存中的缓冲区访问的，这些缓冲区的结构与XACT缓冲区相同。

Do not confuse subtransactions with autonomous transactions. Autonomous transactions in no way depend on one another, while subtransactions do depend. There are no autonomous transactions in the regular PostgreSQL, which is, perhaps, for the better: they are actually needed extremely rarely, and their availability in other DBMS invites abuse, which everyone suffers.

不要将子事务与自主事务混淆。自主事务绝不相互依赖，而子事务却相互依赖。常规PostgreSQL中没有自治事务，这可能会变得更好：实际上很少需要它们，并且它们在其他DBMS中的可用性会导致滥用，每个人都会遭受痛苦。

Let's clear the table, start a transaction and insert a row:

让我们清除表，开始事务并插入一行：

=> TRUNCATE TABLE t;
=> BEGIN;
=> INSERT INTO t(s) VALUES ('FOO');
=> SELECT txid_current();

txid_current 
--------------
         3669
(1 row)

=> SELECT xmin, xmax, * FROM t;

xmin | xmax | id |  s  
------+------+----+-----
 3669 |    0 |  2 | FOO
(1 row)

=> SELECT * FROM heap_page('t',0);

ctid  | state  | xmin | xmax  | t_ctid 
-------+--------+------+-------+--------
 (0,1) | normal | 3669 | 0 (a) | (0,1)
(1 row)

Now we establish a savepoint and insert another row.

现在我们建立一个保存点并插入另一行。

=> SAVEPOINT sp;
=> INSERT INTO t(s) VALUES ('XYZ');
=> SELECT txid_current();

txid_current 
--------------
         3669
(1 row)

Note that the txid_current function returns the ID of the main transaction rather than of the subtransaction.

请注意， txid_current函数返回主事务的ID，而不是子事务的ID。

=> SELECT xmin, xmax, * FROM t;

xmin | xmax | id |  s  
------+------+----+-----
 3669 |    0 |  2 | FOO
 3670 |    0 |  3 | XYZ
(2 rows)

=> SELECT * FROM heap_page('t',0);

ctid  | state  | xmin | xmax  | t_ctid 
-------+--------+------+-------+--------
 (0,1) | normal | 3669 | 0 (a) | (0,1)
 (0,2) | normal | 3670 | 0 (a) | (0,2)
(2 rows)

Let's rollback to the savepoint and insert the third row.

让我们回滚到保存点并插入第三行。

=> ROLLBACK TO sp;
=> INSERT INTO t VALUES ('BAR');
=> SELECT xmin, xmax, * FROM t;

xmin | xmax | id |  s  
------+------+----+-----
 3669 |    0 |  2 | FOO
 3671 |    0 |  4 | BAR
(2 rows)

=> SELECT * FROM heap_page('t',0);

ctid  | state  |   xmin   | xmax  | t_ctid 
-------+--------+----------+-------+--------
 (0,1) | normal | 3669     | 0 (a) | (0,1)
 (0,2) | normal | 3670 (a) | 0 (a) | (0,2)
 (0,3) | normal | 3671     | 0 (a) | (0,3)
(3 rows)

In the page, we continue to see the row that was added by the rolled back subtransaction.

在页面中，我们继续看到由回滚子事务添加的行。

Committing the changes.

提交更改。

=> COMMIT;
=> SELECT xmin, xmax, * FROM t;

xmin | xmax | id |  s  
------+------+----+-----
 3669 |    0 |  2 | FOO
 3671 |    0 |  4 | BAR
(2 rows)

=> SELECT * FROM heap_page('t',0);

ctid  | state  |   xmin   | xmax  | t_ctid 
-------+--------+----------+-------+--------
 (0,1) | normal | 3669 (c) | 0 (a) | (0,1)
 (0,2) | normal | 3670 (a) | 0 (a) | (0,2)
 (0,3) | normal | 3671 (c) | 0 (a) | (0,3)
(3 rows)

It is clearly seen now that each subtransaction has its own status.

现在可以清楚地看到每个子事务都有其自己的状态。

Note that SQL does not permit explicit use of subtransactions, that is, you cannot start a new transaction before you complete the current one. This technique gets implicitly involved when savepoints are used and also when handling PL/pgSQL exceptions, as well as in some other, more exotic situations.

请注意，SQL不允许显式使用子事务，也就是说，您不能在完成当前事务之前启动新事务。当使用保存点以及处理PL / pgSQL异常时，以及在其他一些更特殊的情况下，将隐式涉及此技术。

=> BEGIN;

BEGIN

=> BEGIN;

WARNING:  there is already a transaction in progress
BEGIN

=> COMMIT;

COMMIT

=> COMMIT;

WARNING:  there is no transaction in progress
COMMIT

错误和操作原子性 (Errors and operation atomicity)

What happens if an error occurs while the operation is being performed? For example, like this:

如果在执行操作时发生错误怎么办？例如，像这样：

=> BEGIN;
=> SELECT * FROM t;

id |  s  
----+-----
  2 | FOO
  4 | BAR
(2 rows)

=> UPDATE t SET s = repeat('X', 1/(id-4));

ERROR:  division by zero

An error occurred. Now the transaction is treated as aborted and no operations are permitted in it:

发生错误。现在，该事务被视为已中止，并且其中不允许任何操作：

=> SELECT * FROM t;

ERROR:  current transaction is aborted, commands ignored until end of transaction block

And even if we try to commit the changes, PostgreSQL will report the rollback:

即使我们尝试提交更改，PostgreSQL也会报告回滚：

=> COMMIT;

ROLLBACK

Why is it impossible to continue execution of the transaction after a failure? The thing is that the error could occur so that we would get access to part of the changes, that is, the atomicity would be broken not only for the transaction, but even for a single operator. For instance, in our example the operator could have updated one row before the error occurred:

为什么失败后无法继续执行事务？问题是可能发生错误，以便我们可以访问部分更改，也就是说，不仅对于事务，甚至对于单个操作员，原子性都将被破坏。例如，在我们的示例中，操作员可以在发生错误之前更新一行：

=> SELECT * FROM heap_page('t',0);

ctid  | state  |   xmin   | xmax  | t_ctid 
-------+--------+----------+-------+--------
 (0,1) | normal | 3669 (c) | 3672  | (0,4)
 (0,2) | normal | 3670 (a) | 0 (a) | (0,2)
 (0,3) | normal | 3671 (c) | 0 (a) | (0,3)
 (0,4) | normal | 3672     | 0 (a) | (0,4)
(4 rows)

It's worth noting that psql has a mode that allows continuing the transaction after failure, as if the effects of the erroneous operator were rolled back.

值得注意的是，psql具有一种模式，该模式允许失败后继续进行事务，就好像错误的运算符的作用已回滚一样。

=> \set ON_ERROR_ROLLBACK on
=> BEGIN;
=> SELECT * FROM t;

id |  s  
----+-----
  2 | FOO
  4 | BAR
(2 rows)

=> UPDATE t SET s = repeat('X', 1/(id-4));

ERROR:  division by zero

=> SELECT * FROM t;

id |  s  
----+-----
  2 | FOO
  4 | BAR
(2 rows)

=> COMMIT;

It's easy to figure out that in this mode, psql actually establishes an implicit savepoint before each command and initiates a rollback to it in the event of failure. This mode is not used by default since establishing savepoints (even without a rollback to them) entails a significant overhead.

很容易弄清楚，在这种模式下，psql实际上在每个命令之前建立一个隐式保存点，并在失败时向其发起回滚。默认情况下不使用此模式，因为建立保存点(即使不回滚保存点)也需要大量的开销。

Read on. 继续阅读。

翻译自: https://habr.com/en/company/postgrespro/blog/477648/