在 ClickHouse 上构建堆栈跟踪相似度分析系统，用于识别 CI 崩溃日志

最新推荐文章于 2025-11-23 23:08:45 发布

原创最新推荐文章于 2025-11-23 23:08:45 发布 · 1k 阅读

25 ·

CC 4.0 BY-SA版权

文章标签：

#clickhouse #ci/cd

软件工程专栏收录该内容

130 篇文章

订阅专栏

本文字数：12066；估计阅读时间：31 分钟

作者：Misha Shiryaev

本文在公众号【ClickHouseInc】首发

问题背景

我们的 CI 系统在捕捉 Bug 方面表现优秀，尤其擅长在 AST Fuzzing、Stress（压力）测试和功能测试阶段捕获崩溃日志。但这些崩溃报告往往高度相似，导致我们难以区分哪些是全新的问题，哪些是重复的、已知的错误。

正如论文《TraceSim: A Method for Calculating Stack Trace Similarity》中所指出：

许多现代软件系统都配备了自动崩溃报告子系统。然而，同一个 Bug 常常会生成略有差异的报告。

该论文的方法论基于以下假设：结构相似的堆栈跟踪往往来源于相同的 Bug。作者提出了一种用于计算堆栈信息相似度的方法，可将相似的崩溃日志聚类归组。

我们最早的堆栈聚类系统由基础设施工程师 Michael Stetsyuk 构建，用于分析来自 ClickHouse Cloud 的崩溃日志。该系统通过堆栈信息对崩溃报告进行聚类，并在 issue 跟踪系统中进行统一管理。

基于上述经验，我们希望在 ClickHouse 之上构建一套用于分析 CI 崩溃日志的堆栈相似度系统，具备以下核心功能：

解析崩溃日志，提取堆栈帧；
计算堆栈信息之间的相似度；
自动将新日志与已知堆栈进行聚类归组；
在 GitHub 上创建或更新 issue，以便进行 Bug 追踪。

下面我将逐步介绍系统的实现过程，供你根据自身需求搭建类似的解决方案。

实现方式

系统的整体架构如下所示：

准备阶段：采集崩溃日志并初始化表结构

首先，需要从 CI 系统中收集崩溃报告。幸运的是，这一系统已稳定运行多年，具备良好的数据基础。

系统通过创建物化视图，将新日志自动写入远程 ClickHouse 集群中的目标表。目标表名根据当前表结构及附加列动态生成，使我们能够在表结构不完全一致的情况下，保留完整的历史数据，并实现跨 CI 任务共享同一张逻辑表。相关管理脚本位于项目的 ci 目录下。

数据主要来源于 system.crash_log 表，以下是当前 CI Logs 集群中部分表结构：

SHOW TABLES LIKE 'crash_log_%';

    ┌─name───────────────────────────┐
 1. │ crash_log_12587237819650651296 │
 2. │ crash_log_12670418084883306529 │
 3. │ crash_log_1527066305010279420  │
 4. │ crash_log_15355897847728332522 │
 5. │ crash_log_15557244372725679386 │
 6. │ crash_log_18405985218782237968 │
 7. │ crash_log_2288102012038531617  │
 8. │ crash_log_3310266143589491008  │
 9. │ crash_log_6802555697904881735  │
10. │ crash_log_9016585404038675675  │
11. │ crash_log_9097266775814416937  │
12. │ crash_log_9243005856023138905  │
13. │ crash_log_9768092148702997133  │
14. │ crash_logs                     │
    └────────────────────────────────┘

SHOW CREATE TABLE crash_logs;

CREATE TABLE default.crash_logs
(
    `repo` String,
    `pull_request_number` UInt32,
    `commit_sha` String,
    `check_start_time` DateTime,
    `check_name` String,
    `instance_type` String,
    `instance_id` String,
    `hostname` LowCardinality(String),
    `event_date` Date,
    `event_time` DateTime,
    `timestamp_ns` UInt64,
    `signal` Int32,
    `thread_id` UInt64,
    `query_id` String,
    `trace` Array(UInt64),
    `trace_full` Array(String),
    `version` String,
    `revision` UInt32,
    `build_id` String
)
ENGINE = Merge('default', '^crash_log_')

示例：存储于该表中的原始崩溃数据：

SELECT *
FROM crash_logs
WHERE event_date = today()
LIMIT 1

Row 1:
──────
repo:                ClickHouse/ClickHouse
pull_request_number: 85843
commit_sha:          41215391373ad2e277230939e887c834edeb16ce
check_start_time:    2025-08-19 10:07:50
check_name:          Stateless tests (arm_binary, parallel)
instance_type:       c8g.8xlarge
instance_id:         i-0b53d7dc362ab4cd8
hostname:            16ffef362c03
event_date:          2025-08-19
event_time:          2025-08-19 10:09:04
timestamp_ns:        1755598144624168055
signal:              6
thread_id:           6359
query_id:            5ce98be8-f4cb-4950-82a8-cc2720669cad
trace:               [280796479681009,280796479399548,280796479320368,188029769003464,188029769005984,188029769006680,188029679339096,188029679337384,188029679380404,188029836833040,188029837238400,188029837230992,188029837232044,188029915248680,188029915247868,188029915340828,188029915297288,188029915294080,188029915293160,…]
trace_full:          ['3. ? @ 0x000000000007f1f1','4. ? @ 0x000000000003a67c','5. ? @ 0x0000000000027130','6. ./ci/tmp/build/./src/Common/Exception.cpp:51: DB::abortOnFailedAssertion(String const&, void* const*, unsigned long, unsigned long) @ 0x000000000e33b1c8','7. ./ci/tmp/build/./src/Common/Exception.cpp:84: DB::handle_error_code(String const&, std::basic_string_view>, int, bool, std::vector> const&) @ 0x000000000e33bba0','8. ./ci/tmp/build/./src/Common/Exception.cpp:135: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000e33be58','9. DB::Exception::Exception(String&&, int, String, bool) @ 0x0000000008db8658','10. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x0000000008db7fa8','11. DB::Exception::Exception<>(int, FormatStringHelperImpl<>) @ 0x0000000008dc27b4','12. ./ci/tmp/build/./src/Storages/ObjectStorage/StorageObjectStorageConfiguration.cpp:212: DB::StorageObjectStorageConfiguration::addDeleteTransformers(std::shared_ptr, DB::QueryPipelineBuilder&, std::optional const&, std::shared_ptr) const @ 0x00000000123eb110','13. ./ci/tmp/build/./src/Storages/ObjectStorage/StorageObjectStorageSource.cpp:555: DB::StorageObjectStorageSource::createReader(unsigned long, std::shared_ptr const&, std::shared_ptr const&, std::shared_ptr const&, DB::ReadFromFormatInfo&, std::optional const&, std::shared_ptr const&, DB::SchemaCache*, std::shared_ptr const&, unsigned long, std::shared_ptr, std::shared_ptr, bool) @ 0x000000001244e080','14.0. inlined from ./ci/tmp/build/./src/Storages/ObjectStorage/StorageObjectStorageSource.cpp:412: DB::StorageObjectStorageSource::createReader()','14. ./ci/tmp/build/./src/Storages/ObjectStorage/StorageObjectStorageSource.cpp:265: DB::StorageObjectStorageSource::lazyInitialize() @ 0x000000001244c390','15. ./ci/tmp/build/./src/Storages/ObjectStorage/StorageObjectStorageSource.cpp:274: DB::StorageObjectStorageSource::generate() @ 0x000000001244c7ac','16. ./ci/tmp/build/./src/Processors/ISource.cpp:144: DB::ISource::tryGenerate() @ 0x0000000016eb3828','17. ./ci/tmp/build/./src/Processors/ISource.cpp:110: DB::ISource::work() @ 0x0000000016eb34fc','18.0. inlined from ./ci/tmp/build/./src/Processors/Executors/ExecutionThreadContext.cpp:53: DB::executeJob(DB::ExecutingGraph::Node*, DB::ReadProgressCallback*)','18. ./ci/tmp/build/./src/Processors/Executors/ExecutionThreadContext.cpp:102: DB::ExecutionThreadContext::executeTask() @ 0x0000000016eca01c','19. ./ci/tmp/build/./src/Processors/Executors/PipelineExecutor.cpp:350: DB::PipelineExecutor::executeStepImpl(unsigned long, DB::IAcquiredSlot*, std::atomic*) @ 0x0000000016ebf608',…]
version:             ClickHouse 25.8.1.1
revision:            54501
build_id:            88351121A8340C83AB8A60BA97765ADC4B9B7786

至此，我们已成功获取可用于分析的原始堆栈日志。

采集与分析分离：构建分析专用集群与表结构

为避免影响 CI 系统本身的运行性能，我们将日志分析任务部署于一个独立的 ClickHouse 集群中。通过定义相关表结构及物化视图，实现从 CI Logs 集群定时同步数据至本地分析集群。

-- Create a table to store the raw stack traces with details about where and when they were appeared
CREATE TABLE default.stack_traces
(
    `repo` LowCardinality(String),
    `pull_request_number` UInt32,
    `commit_sha` String,
    `check_name` String,
    `check_start_time` DateTime('UTC'),
    `event_time` DateTime,
    `timestamp_ns` UInt64,
    `trace_full` Array(String),
    `Version` UInt32 DEFAULT now()
)
ENGINE = ReplacingMergeTree(Version)
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, timestamp_ns, repo, check_start_time)
SETTINGS index_granularity = 8192;

-- create a refresheable materialized view to collect the data from the crash_logs table for the last 100 days
-- https://clickhouse.com/docs/materialized-view/refreshable-materialized-view
CREATE MATERIALIZED VIEW default._stack_traces_mv
REFRESH EVERY 30 MINUTE TO default.stack_traces
AS SELECT
    repo,
    pull_request_number,
    commit_sha,
    check_name,
    check_start_time,
    event_time,
    timestamp_ns,
    trace_full,
    now() AS Version
FROM remoteSecure('[HOST]', 'default', 'crash_logs', 'username', 'password')
WHERE event_date >= today() - INTERVAL 100 DAY;

-- Create a table to store the info about the GitHub issues created for the stack traces
CREATE TABLE default.crash_issues
(
    `created_at` DateTime,
    `updated_at` DateTime,
    `closed_at` DateTime,
    `repo` String,
    `number` UInt32,
    `state` Enum8('open' = 1, 'closed' = 2),
    `stack_traces_full` Array(Array(String)),
    `stack_traces_hash` Array(UInt64)
)
ENGINE = ReplacingMergeTree(updated_at)
PARTITION BY toYYYYMM(created_at)
ORDER BY (repo, number)
SETTINGS index_granularity = 8192;

从这一步开始，所有分析工作均基于分析集群进行。数据每隔 30 分钟刷新一次，确保我们始终处理的是最新的崩溃事件。

步骤一：清洗堆栈日志

原始堆栈日志通常包含大量无关信息，这些“噪声”会干扰堆栈相似度的准确计算，常见来源包括：

编译器版本差异；
构建类型差异（Debug 与 Release）；
操作系统或硬件平台差异；
堆栈中的文件路径与源码行号信息。

为此，我们定义了两个用户自定义函数（UDF）：

CREATE FUNCTION cleanStackFrame AS frame -> replaceRegexpOne(
  replaceRegexpAll(
    splitByString(
      -- second, strip the file name and line number, keep the function name
      ': ', splitByString(
        -- first, strip the frame address, keep the frame number, file name, line number, and function name
        ' @ ', frame, 2
      )[1], 2
    )[2],
    -- third, replace all ABI specific information with a placeholder
    '\\[abi:[^]]+\\]', '[$ABI]'),
  -- finally, delete all LLVM specific information
  '(\\s+[(])*[.]llvm[.].+', ''
);

CREATE FUNCTION cleanStackTrace AS trace -> arrayFilter(
  frame -> (
    -- second, keep only meaningful frames
    frame NOT IN ('', '?')), arrayMap(
      -- first, clean each frame of the stack trace
      fr -> cleanStackFrame(fr), trace
  )
);

cleanStackFrame：清洗单个堆栈帧，剔除无关上下文；
cleanStackTrace：对整个堆栈进行标准化处理，调用前者作用于每一帧。

这些函数在执行相似度计算前以“按需清洗”的方式动态调用，而不是将清洗结果事先写入数据库。该策略为后续迭代优化提供了更大灵活性，使我们能够在不重新处理所有历史数据的前提下，动态调整清洗逻辑与权重计算规则。

步骤二：堆栈分组与重要性判定

在完成堆栈清洗后，即可开始分析处理。系统会将已处理的堆栈信息存储在 crash_issues 表中，用于比对和识别现有的 issue。

WITH known_hashes AS
    (
        SELECT
            repo,
            -- Avoid default value of 'open' for state
            toString(state) AS state,
            updated_at,
            closed_at,
            arrayFlatten(groupArrayDistinct(stack_traces_hash)) AS known_hashes
        FROM default.crash_issues FINAL
        GROUP BY ALL
    )
SELECT
    repo,
    groupArrayDistinct((pull_request_number, commit_sha, check_name)) AS checks,
    cleanStackTrace(trace_full) AS trace_full,
    sipHash64(trace_full) AS trace_hash,
    length(groupArrayDistinct(pull_request_number) AS PRs) AS trace_count,
    has(PRs, 0) AS is_in_master
FROM default.stack_traces AS st
LEFT JOIN known_hashes AS kh ON (st.repo = kh.repo AND has(kh.known_hashes, trace_hash))
WHERE (st.event_time >= (now() - toIntervalDay(30)))
    AND (length(trace_full) > 5)
GROUP BY
    repo,
    trace_full
HAVING
    groupArrayDistinct(kh.state) = [''] -- new trace, no issues
    OR (groupArrayDistinct(kh.state) = ['closed'] AND max(st.check_start_time) > max(kh.closed_at)) -- new event after the issue is closed
ORDER BY max(st.event_time)

查询脚本会基于清洗后的堆栈内容进行分组，并为每条堆栈计算哈希值，然后与 crash_issues 表比对，判断该堆栈是否已存在。如果发现该堆栈对应的 issue 处于打开状态，或该堆栈生成时间早于已有 issue 被关闭的时间，则将其视为已知堆栈。

此外，查询还会统计包含该堆栈的唯一 Pull Request 数量，并检查该堆栈是否出现于 master 分支（Pull Request 编号 0）。若某堆栈仅出现在少于三个 Pull Request 中，且未出现在 master/release 分支中，则被视为不重要。此类堆栈不会触发新 issue 的创建，但若其与已知堆栈相似度足够高，则会附加到已有 issue 中。

步骤三：堆栈相似度计算

第二步得到的新堆栈列表，需要逐一与已知堆栈（已关联 GitHub issue）进行相似度比对。

以下查询实现了 TraceSim 论文中的相似度计算方法。ClickHouse 自 25.4 版本起提供了 arraySimilarity 函数，并可在查询时动态计算权重，其中 new_trace 表示第二步中的某条新堆栈。

WITH
    1.97 AS alpha,
    2.0 AS beta,
    3.7 AS gamma,
    0.68 AS threshold,
    stack_frame_weights AS (
        WITH
            (
                SELECT count()
                FROM default.stack_traces
                FINAL
            ) AS total
        SELECT
            arrayJoin(cleanStackTrace(trace_full)) AS frame,
            countDistinct(trace_full) AS count,
            log(total / count) AS IDF,
            sigmoid(beta * (IDF - gamma)) AS weight
        FROM default.stack_traces
        FINAL
        GROUP BY frame
    ),
    (SELECT groupArray(weight) AS w, groupArray(frame) AS f FROM stack_frame_weights) AS weights,
    (trace -> arrayMap((_frame, pos) -> (pow(pos, -alpha) * arrayFirst(w, f -> (f = _frame), weights.w, weights.f)), trace, arrayEnumerate(trace))) AS get_trace_weights,
    -- one of the new traces from step two
    ['DB::abortOnFailedAssertion(String const&, void* const*, unsigned long, unsigned long)','DB::handle_error_code(String const&, std::basic_string_view>, int, bool, std::vector> const&)','DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool)','DB::Exception::Exception(int, FormatStringHelperImpl::type, std::type_identity::type, std::type_identity::type>, String const&, String&&, String const&)','DB::paranoidCheckForCoveredPartsInZooKeeper(std::shared_ptr const&, String const&, StrongTypedef, String const&, DB::StorageReplicatedMergeTree const&)','DB::StorageReplicatedMergeTree::executeDropRange(DB::ReplicatedMergeTreeLogEntry const&)','DB::StorageReplicatedMergeTree::executeLogEntry(DB::ReplicatedMergeTreeLogEntry&)','operator()','decltype(std::declval)::$_1&>()(std::declval&>())) std::__invoke[$ABI])::$_1&, std::shared_ptr&>(DB::StorageReplicatedMergeTree::processQueueEntry(std::shared_ptr)::$_1&, std::shared_ptr&)','bool std::__invoke_void_return_wrapper::__call[$ABI])::$_1&, std::shared_ptr&>(DB::StorageReplicatedMergeTree::processQueueEntry(std::shared_ptr)::$_1&, std::shared_ptr&)','DB::ReplicatedMergeTreeQueue::processEntry(std::function ()>, std::shared_ptr&, std::function&)>)','DB::StorageReplicatedMergeTree::processQueueEntry(std::shared_ptr)','DB::ExecutableLambdaAdapter::executeStep()','DB::MergeTreeBackgroundExecutor::routine(std::shared_ptr)','DB::MergeTreeBackgroundExecutor::threadFunction()'] AS new_trace,
    get_trace_weights(new_trace) AS new_trace_weights
SELECT
    arraySimilarity(
        new_trace,
        arrayJoin(stack_traces_full) AS trace_full,
        new_trace_weights,
        get_trace_weights(trace_full)
    ) AS similarity,
    repo,
    number,
    created_at,
    closed_at,
    stack_traces_full,
    stack_traces_hash
FROM default.crash_issues FINAL
WHERE repo = 'ClickHouse/ClickHouse'
    AND state = 'open'
    AND threshold <= similarity

系统会按小时收集并处理重要堆栈，以便及时分析。如果某条新的重要堆栈未匹配到相似堆栈，将在 GitHub 仓库中自动创建新的 issue，并将该堆栈写入 crash_issues 表。

每天系统还会统一处理所有堆栈（包括不重要堆栈），以判断其是否与已知的重要堆栈相似。当某条堆栈与重要堆栈的相似度超过阈值时，会自动将其添加到对应的 issue。

我们还利用大语言模型（LLM）辅助生成 issue 标题，并基于堆栈内容推测可能的崩溃原因。

最终结果

所有自动创建的 issue 都集中在 GitHub 仓库中带有 crash-ci 标签的列表内。如果你发现可修复的问题，欢迎直接在仓库中贡献修复方案。

征稿启示

面向社区长期正文，文章内容包括但不限于关于 ClickHouse 的技术研究、项目实践和创新做法等。建议行文风格干货输出&图文并茂。质量合格的文章将会发布在本公众号，优秀者也有机会推荐到 ClickHouse 官网。请将文章稿件的 WORD 版本发邮件至：Tracy.Wang@clickhouse.com