ClickHouse源码阅读(0000 0101) —— ReplicatedMergeTree存储引擎

本文链接：https://blog.youkuaiyun.com/B_e_a_u_tiful1205/article/details/103633224
/** The engine that uses the merge tree (see MergeTreeData) and replicated through ZooKeeper.
ReplicatedMergeTree引擎使用MergeTree存储数据，通过ZooKeeper协调保存副本数据。

  * ZooKeeper is used for the following things:
  * - the structure of the table (/ metadata, /columns)
  * - action log with data (/log/log-...,/replicas/replica_name/queue/queue-...);
  * - a replica list (/replicas), and replica activity tag (/replicas/replica_name/is_active), replica addresses (/replicas/replica_name/host);
  * - select the leader replica (/leader_election) - this is the replica that assigns the merge;
  * - a set of parts of data on each replica (/replicas/replica_name/parts);
  * - list of the last N blocks of data with checksum, for deduplication (/blocks);
  * - the list of incremental block numbers (/block_numbers) that we are about to insert,
  *   to ensure the linear order of data insertion and data merge only on the intervals in this sequence;
  * - coordinates writes with quorum (/quorum).
  * - Storage of mutation entries (ALTER DELETE, ALTER UPDATE etc.) to execute (/mutations).
  *   See comments in StorageReplicatedMergeTree::mutate() for details.
  */
Zookeeper的作用，主要用于保存以下内容：
1、表结构（在/metadata, /columns这两个目录下）
2、对数据的操作日志(/log/log-...,/replicas/replica_name/queue/queue-...)
3、副本列表(/replicas), 副本活动标记(/replicas/replica_name/is_active), 副本的地址(/replicas/replica_name/host)
4、选择所有副本的leader（/leader_election），leader副本是负责进行merger的那个副本
5、每个副本上的一组数据部分（/replicas/replica_name/parts）；
6、用于重复数据消除的最后N个校验和数据块的列表（/blocks）；
7、将要插入的增量块编号（/block_numbers）列表， 用于确保数据插入和数据合并的线性顺序仅在该序列的间隔上；
8、协调写入（/quorum）
9、需要执行的ALTER DELETE、ALTER UPDATE等操作（/mutations）。
有关详细信息，请参阅StorageReplicatedMergeTree：：mutate（）中的注释。

/** The replicated tables have a common log (/log/log-...).
  * Log - a sequence of entries (LogEntry) about what to do.
  * Each entry is one of:
  * - normal data insertion (GET),
  * - merge (MERGE),
  * - delete the partition (DROP).
  *
复制表都有一个日志目录, 用于保存需要执行的一系列操作. 这些操作包括GET、MERGE、DROP.

  * Each replica copies (queueUpdatingTask, pullLogsToQueue) entries from the log to its queue (/replicas/replica_name/queue/queue-...)
  *  and then executes them (queueTask).
  * Despite the name of the "queue", execution can be reordered, if necessary (shouldExecuteLogEntry, executeLogEntry).
  * In addition, the records in the queue can be generated independently (not from the log), in the following cases:
  * - when creating a new replica, actions are put on GET from other replicas (createReplica);
  * - if the part is corrupt (removePartAndEnqueueFetch) or absent during the check (at start - checkParts, while running - searchForMissingPart),
  *   actions are put on GET from other replicas;
  *
  * The replica to which INSERT was made in the queue will also have an entry of the GET of this data.
  * Such an entry is considered to be executed as soon as the queue handler sees it.
  *
  * The log entry has a creation time. This time is generated by the clock of server that created entry
  * - the one on which the corresponding INSERT or ALTER query came.
  *
  * For the entries in the queue that the replica made for itself,
  * as the time will take the time of creation the appropriate part on any of the replicas.
  */
每个副本将操作日志复制到自己的队列中(通过queueUpdatingTask, pullLogsToQueue这几个方法), 并执行这些操作;
需要注意的是这些操作可能不是顺序执行的，必要的话可能会进行重排序。
遇到虾米那两种情况，队列中的操作也可单独生成。
1-
2-

需要插入数据的那个副本会创建一个GET类型的日志操作, 队列处理程序会去处理这些操作.

每条日志都有一个创建时间，这个时间是由创建这条日志的服务器生成的。
  */