ClickHouse源码阅读(0000 0110) —— 使用ReplicatedMergeTree引擎时的副本选择问题_ck日志报错 local replica of shard 1 is stale (delay: 2-优快云博客

本文链接：https://blog.youkuaiyun.com/B_e_a_u_tiful1205/article/details/103700654

文章深入探讨了ClickHouse中ReplicatedMergeTree与Distributed引擎如何在多副本间选择数据源，详细解析了从SQL查询到实际数据读取的全流程，重点关注了副本延迟、本地优先策略及远程副本回退机制。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在使用ReplicatedMergeTree引擎和Distributed引擎的时候，对于同一张表，服务器上存在多个副本，在查询数据的时候，是如何在这些副本之间进行选择的呢？结合源码来试着分析一下...

对于一条SELECT SQL，从以下方法开始：

pipeline.streams = storage->read(required_columns, query_info, context, processing_stage, max_block_size, max_streams);

如图：

进入到这个方法中，主要步骤包括修改AST(修改表名)、物化header、构建select_stream_factory等，最后执行ClusterProxy::executeQuery()方法，简化代码如下：

    BlockInputStreams StorageDistributed::read(
            const Names & /*column_names*/,
            const SelectQueryInfo &query_info,
            const Context &context,
            QueryProcessingStage::Enum processed_stage,
            const size_t /*max_block_size*/,
            const unsigned /*num_streams*/) {
        auto cluster = getCluster();

        const Settings &settings = context.getSettingsRef();

        //修改AST, 修改表名
        const auto &modified_query_ast = rewriteSelectQuery(......);

        //header, 应该是列名那行, 即Structure of query result
        Block header = materializeBlock(......);

        ClusterProxy::SelectStreamFactory select_stream_factory = ......;

        ......

        //重点方法
        return ClusterProxy::executeQuery(
                select_stream_factory, cluster, modified_query_ast, context, settings);
    }

进入ClusterProxy::executeQuery()方法，会设置网络带宽限制，然后遍历数据的所有分片，对每个分片执行createForShard()方法，简化代码如下：

BlockInputStreams executeQuery(
        IStreamFactory & stream_factory, const ClusterPtr & cluster,
        const ASTPtr & query_ast, const Context & context, const Settings & settings)
{
    BlockInputStreams res;

    const std::string query = queryToString(query_ast);

    ......

    /// Network bandwidth limit, if needed. 网络带宽限制
    ThrottlerPtr throttler;
    if (settings.max_network_bandwidth || settings.max_network_bytes){......}

    //遍历数据的所有的分片，针对每个分片
    for (const auto & shard_info : cluster->getShardsInfo())
        stream_factory.createForShard(shard_info, query, query_ast, new_context, throttler, res);

    return res;
}

进入到createForShard()方法，代码逻辑还是比较清晰的，先定义了emplace_local_stream和emplace_remote_stream，然后根据prefer_localhost_replica和shard_info.isLocal()这两个条件判断使用local_stream还是remotr_stream，简化代码如下：

        //遍历数据的所有的分片，针对每个分片
        void SelectStreamFactory::createForShard(
                const Cluster::ShardInfo &shard_info,
                const String &query, const ASTPtr &query_ast,
                const Context &context, const ThrottlerPtr &throttler,
                BlockInputStreams &res) {
            auto emplace_local_stream = [&]()//将 数据流 放在本地
            {
                res.emplace_back(createLocalStream(query_ast, context, processed_stage));
            };

            auto emplace_remote_stream = [&]()//将 数据流 发送给远程服务器
            {
                //构建RemoteBlockInputStream
                auto stream = std::make_shared<RemoteBlockInputStream>(shard_info.pool, query, header, context, nullptr,
                                                                       throttler, external_tables, processed_stage);
                stream->setPoolMode(PoolMode::GET_MANY);
                if (!table_func_ptr)
                    stream->setMainTable(main_table);
                res.emplace_back(std::move(stream));
            };

            const auto &settings = context.getSettingsRef();


            //prefer_localhost_replica = 1 且 本地服务器上存在这个分片(shard_info.isLocal() = true), 就使用本地分片数据.
            //                              如果本地服务器上没有这个分片, 则只能连接远程获取该分片的数据emplace_remote_stream
            //prefer_localhost_replica = 0 则 连接远程获取该分片的数据emplace_remote_stream
            if (settings.prefer_localhost_replica && shard_info.isLocal()) {

            ......

            } else
                emplace_remote_stream();
        }

注意：可以具体看下shard_info.isLocal()方法的具体实现。

下面分开分析，先分析满足settings.prefer_localhost_replica && shard_info.isLocal()条件的情况，主要流程包括：

1-判断本地服务器的这个分片上有没有这个表；

2-判断这个表是不是用的复制表引擎；

3-如果本地分片存在这个表，且这个表是复制表，那么就该考虑副本的延迟问题了。

4-获取本地副本的延迟和配置的可允许的最大延迟时间，两者比较。如果本地副本的延迟时间小于max_allowed_delay, 说明本地副本是可以使用的，否则认为本地副本已经过期了。

5-如果本地副本已经过期了，则看fallback_to_stale_replicas_for_distributed_queries这个配置参数，是不是允许使用过期的副本。

6-如果不允许使用过期的副本，即设置了fallback_to_stale_replicas_for_distributed_queries=0，则看当前分片是不是有远程副本分片，如果有则使用远程分片；如果没有则抛出异常；

7-如果允许使用过期的副本，即设置了fallback_to_stale_replicas_for_distributed_queries=1，且当前分片没有远程副本分片，则使用本地分片过期的数据；

8-如果允许使用过期的副本，且当前分片也有远程副本分片，就先尝试使用远程副本, 但如果它们也过期了, 则退回使用本地副本。懒洋洋地做这件事以避免在主线程中连接(惰性创建连接)（后面有部分代码还没有仔细看）。

基本判断逻辑就是这样了。带注释的代码如下：

                //运行到这里表示本地服务器上存在这个分片
                StoragePtr main_table_storage;//根据库名表名在本地服务器的分片上的找到需要查询的这个表

                if (table_func_ptr) {
                    const auto *table_function = table_func_ptr->as<ASTFunction>();
                    main_table_storage = TableFunctionFactory::instance().get(table_function->name, context)->execute(
                            table_func_ptr, context);
                } else
                    main_table_storage = context.tryGetTable(main_table.database, main_table.table);

                //本地服务器的这个分片上没有这个表
                if (!main_table_storage) /// Table is absent on a local server.
                {
                    ProfileEvents::increment(ProfileEvents::DistributedConnectionMissingTable);
                    //shard_info.hasRemoteConnections() 本次查询是否是远程发过来的(是否有远程副本), 如果是则需要emplace_remote_stream
                    if (shard_info.hasRemoteConnections()) {
                        LOG_WARNING(
                                &Logger::get("ClusterProxy::SelectStreamFactory"),
                                "There is no table " << main_table.database << "." << main_table.table
                                                     << " on local replica of shard " << shard_info.shard_num
                                                     << ", will try remote replicas.");
                        emplace_remote_stream();
                    } else
                        emplace_local_stream();  /// Let it fail the usual way.

                    return;
                }

                //运行到这里, 表示这个表在本地服务器的这个分片上

                //通过一个动态转换来判断这个表有没有副本
                const auto *replicated_storage = dynamic_cast<const StorageReplicatedMergeTree *>(main_table_storage.get());

                //该分片在远程没有副本, 只能使用本地分片数据
                if (!replicated_storage) {
                    /// Table is not replicated, use local server.
                    emplace_local_stream();
                    return;
                }

                // 代码运行到这里, 说明对于当前分片, 本地服务器上有这个表, 且该分片也有远程副本.
                // 那应该怎么选呢？
                // 到了这一步就需要考虑应该选择本地的还是远程的了

                // 如果设置了max_replica_delay_for_distributed_queries(分布式查询的最大副本延迟)这个参数, 则复制表的分布式查询将选择复制延迟时间(秒)小于指定值(不包括该指定值)的服务器.
                // 零意味着不考虑延迟。
                UInt64 max_allowed_delay = settings.max_replica_delay_for_distributed_queries;

                //没有设置max_allowed_delay这个参数, 则不考虑延迟, 优先使用本地分片数据
                if (!max_allowed_delay) {
                    emplace_local_stream();
                    return;
                }

                //设置了max_allowed_delay这个参数, 就先获取副本的绝对延迟 (这里获取的应该是本地副本的延迟时间)
                UInt32 local_delay = replicated_storage->getAbsoluteDelay();
                //如果本地副本的延迟时间小于max_allowed_delay, 说明本地副本是可以使用的
                if (local_delay < max_allowed_delay) {
                    emplace_local_stream();
                    return;
                }

                /// If we reached this point, local replica is stale.
                // 如果代码执行到这里, 表示本地副本已经过期了(复制延迟时间 >= 指定值300s)
                ProfileEvents::increment(ProfileEvents::DistributedConnectionStaleReplica);
                LOG_WARNING(
                        &Logger::get("ClusterProxy::SelectStreamFactory"),
                        "Local replica of shard " << shard_info.shard_num << " is stale (delay: " << local_delay
                                                  << "s.)");

                // 代码运行到这里表示设置了max_replica_delay_for_distributed_queries(分布式查询的最大副本延迟)这个参数, 并且本地的副本已过期了.

                // 如果设置了fallback_to_stale_replicas_for_distributed_queries=0, 表示不允许使用过期的副本,
                // 则将查看当前分片是不是有远程副本, 如果由则使用远程副本, 如果没有则报错
                if (!settings.fallback_to_stale_replicas_for_distributed_queries) {
                    if (shard_info.hasRemoteConnections()) {
                        /// If we cannot fallback, then we cannot use local replica. Try our luck with remote replicas.
                        emplace_remote_stream();
                        return;
                    } else
                        throw Exception(
                                "Local replica of shard " + toString(shard_info.shard_num)
                                + " is stale (delay: " + toString(local_delay) + "s.), but no other replica configured",
                                ErrorCodes::ALL_REPLICAS_ARE_STALE);
                }

                // 如果设置了fallback_to_stale_replicas_for_distributed_queries=1, 表示允许使用过期的副本,
                // 再判断当前分片是不是有远程副本, 如果没有远程副本. 则只能使用本地过期的副本
                if (!shard_info.hasRemoteConnections()) {
                    /// There are no remote replicas but we are allowed to fall back to stale local replica.
                    emplace_local_stream();
                    return;
                }

                //代码运行到这里表示允许使用过期的副本, 且远程也有副本
                /// Try our luck with remote replicas, but if they are stale too, then fallback to local replica.
                /// Do it lazily to avoid connecting in the main thread.
                //于是就先尝试使用远程副本, 但如果它们也过期了, 则退回使用本地副本
                //懒洋洋地做这件事以避免在主线程中连接(惰性创建连接)

                //惰性创建数据流(类比spark中, 一次行动操作触发一次计算), 这里也是, 先捋清底层数据都有哪些, 再创建stream
                auto lazily_create_stream = [
                        pool = shard_info.pool, shard_num = shard_info.shard_num, query, header = header, query_ast, context, throttler,
                        main_table = main_table, table_func_ptr = table_func_ptr, external_tables = external_tables, stage = processed_stage,
                        local_delay]()
                        -> BlockInputStreamPtr {
                    std::vector<ConnectionPoolWithFailover::TryResult> try_results;
                    try {
                        if (table_func_ptr)
                            try_results = pool->getManyForTableFunction(&context.getSettingsRef(), PoolMode::GET_MANY);
                        else
                            try_results = pool->getManyChecked(&context.getSettingsRef(), PoolMode::GET_MANY,
                                                               main_table);
                    }
                    catch (const Exception &ex) {
                        if (ex.code() == ErrorCodes::ALL_CONNECTION_TRIES_FAILED)
                            LOG_WARNING(
                                    &Logger::get("ClusterProxy::SelectStreamFactory"),
                                    "Connections to remote replicas of local shard " << shard_num
                                                                                     << " failed, will use stale local replica");
                        else
                            throw;
                    }

                    double max_remote_delay = 0.0;
                    for (const auto &try_result : try_results) {
                        if (!try_result.is_up_to_date)
                            max_remote_delay = std::max(try_result.staleness, max_remote_delay);
                    }

                    if (try_results.empty() || local_delay < max_remote_delay)
                        return createLocalStream(query_ast, context, stage);
                    else {
                        std::vector<IConnectionPool::Entry> connections;
                        connections.reserve(try_results.size());
                        for (auto &try_result : try_results)
                            connections.emplace_back(std::move(try_result.entry));

                        return std::make_shared<RemoteBlockInputStream>(
                                std::move(connections), query, header, context, nullptr, throttler, external_tables,
                                stage);
                    }
                };

                res.emplace_back(std::make_shared<LazyBlockInputStream>("LazyShardWithLocalReplica", header,
                                                                        lazily_create_stream));

对于使用远程副本的情况，先看下定义的emplace_remote_stream，代码如下：

            auto emplace_remote_stream = [&]()//将 数据流 发送给远程服务器
            {
                //构建RemoteBlockInputStream
                auto stream = std::make_shared<RemoteBlockInputStream>(shard_info.pool, query, header, context, nullptr,
                                                                       throttler, external_tables, processed_stage);
                stream->setPoolMode(PoolMode::GET_MANY);
                if (!table_func_ptr)
                    stream->setMainTable(main_table);
                res.emplace_back(std::move(stream));
            };

关键在于构建RemoteBlockInputStream。注意shard_info.pool的类型是ConnectionPoolWithFailoverPtr（具有容错功能的连接池）。进一步找到构建RemoteBlockInputStream的方法：

    RemoteBlockInputStream::RemoteBlockInputStream(
            const ConnectionPoolWithFailoverPtr &pool,
            const String &query_, const Block &header_, const Context &context_, const Settings *settings,
            const ThrottlerPtr &throttler, const Tables &external_tables_, QueryProcessingStage::Enum stage_)
            : header(header_), query(query_), context(context_), external_tables(external_tables_), stage(stage_) {
        if (settings)
            context.setSettings(*settings);

        //创建多路连接
        create_multiplexed_connections = [this, pool, throttler]() {
            const Settings &current_settings = context.getSettingsRef();

            std::vector<IConnectionPool::Entry> connections;
            if (main_table) {//如果限定了表名(没有使用remote表函数的情况)
                auto try_results = pool->getManyChecked(&current_settings, pool_mode, *main_table);
                connections.reserve(try_results.size());
                for (auto &try_result : try_results)
                    connections.emplace_back(std::move(try_result.entry));
            } else//对于使用了remote表函数的情况
                connections = pool->getMany(&current_settings, pool_mode);

            return std::make_unique<MultiplexedConnections>(
                    std::move(connections), current_settings, throttler);
        };
    }

其中，对于没有使用表函数的情况，pool->getManyChecked()这个方法是重点。

好了，这篇文章已经很长了，就先到这儿，剩余的内容到下篇文章中吧。