ClickHouse vs. Elasticsearch: 计数聚合的工作原理

图片

本文字数:7875;估计阅读时间:20 分钟

审校:庄晓东(魏庄)

图片

介绍

图片

在另一篇博客文章中,我们对 ClickHouse 和 Elasticsearch 在大规模数据分析和可观测性用例中的性能进行了比较,特别是对数十亿行数据表进行的 count(*) 聚合操作。结果显示,ClickHouse 在处理大数据量的聚合查询时,性能显著优于 Elasticsearch。具体表现如下:

ClickHouse 的 count(*) 聚合查询能够高效地利用硬件资源,处理大数据集的延迟至少比 Elasticsearch 低 5 倍。这意味着在相同的延迟下,ClickHouse 所需的硬件更小且成本低 4 倍

基于这些优势,越来越多的用户从 Elasticsearch 迁移到 ClickHouse,以下是一些客户的反馈:

  • 在 PB 级可观测性用例中大幅降低成本:

“从 Elasticsearch 迁移到 ClickHouse,减少了我们可观测性硬件的成本超过 30%。”

——滴滴科技

  • 提升数据分析应用的技术限制:

“这释放了新功能的潜力,推动了增长并简化了扩展。”

——Contentsquare

  • 在监控平台的可扩展性和查询延迟上大幅改善:

“ClickHouse 帮助我们从每月数百万行扩展到数十亿行。” “切换后,我们看到平均读取延迟提升了 100 倍。”

——The Guild

你可能会问,“为什么 ClickHouse 比 Elasticsearch 更快更高效?”这篇博客将为你提供一个深入的技术解析。

ClickHouse 和 Elasticsearch 中的计数聚合

在数据分析场景中,常见的聚合操作是计算和排序数据集中各个值的频率。例如,在这张来自 ClickPy 应用程序的截图中(分析了近 9000 亿行 Python 包下载事件),所有数据可视化在底层都使用了带有 count(*) 聚合操作的 SQL GROUP BY 子句:

图片

同样,在日志记录(或更广泛的可观测性)用例中,聚合操作最常见的应用之一是统计特定日志消息或事件的发生频率(并在频率异常时发出警报)。

在 Elasticsearch 中,类似于 ClickHouse 的 SELECT count(*) FROM ... GROUP BY ... SQL 查询的操作是 terms 聚合,这是 Elasticsearch 的一种桶聚合。 

ClickHouse 的 GROUP BY 与 count(*) 和 Elasticsearch 的 terms aggregation 在功能上基本相当,但在实现、性能和结果质量上存在显著差异,具体如下所述。

我们在一篇附带的博客文章中比较了计数聚合的性能。

除了桶聚合,Elasticsearch 还提供了指标聚合。我们将在另一篇博客中对 ClickHouse 和 Elasticsearch 在指标用例上的表现进行比较。

计数聚合的实现方法

并行化

ClickHouse

ClickHouse 从一开始就被设计为能够快速高效地处理和聚合互联网规模的数据。为了实现这一点,ClickHouse 在列值、表块和表分片这三个层次上并行化 SELECT 查询,包括 count(*) 和其他 90 多种聚合函数:

图片

① SIMD 并行化 

ClickHouse 利用 CPU 的 SIMD 单元(例如 AVX512)对列中的连续值进行相同的操作。本文详细介绍了其工作原理。

② 多核并行化 

在一台拥有 n 个 CPU 核心的机器上,ClickHouse 通过 n 个并行执行通道(或根据用

Starting HiveServer2 Hive Session ID = 1361ffa4-365d-49c8-9a35-a092d8e69dfc Hive Session ID = f3e757c7-e1ec-4997-bd65-d3c2fc81d03e Hive Session ID = d855ac4e-108b-4592-8413-1fc0920b029f Hive Session ID = 270f5e1e-93f2-46cb-900b-bd66e2e58b6c OK OK OK OK OK NoViableAltException(152@[917:1: ddlStatement : ( createDatabaseStatement | switchDatabaseStatement | dropDatabaseStatement | createTableStatement | dropTableStatement | truncateTableStatement | alterStatement | descStatement | showStatement | metastoreCheck | createViewStatement | createMaterializedViewStatement | dropViewStatement | dropMaterializedViewStatement | createFunctionStatement | createMacroStatement | dropFunctionStatement | reloadFunctionStatement | dropMacroStatement | analyzeStatement | lockStatement | unlockStatement | lockDatabase | unlockDatabase | createRoleStatement | dropRoleStatement | ( grantPrivileges )=> grantPrivileges | ( revokePrivileges )=> revokePrivileges | showGrants | showRoleGrants | showRolePrincipals | showRoles | grantRole | revokeRole | setRole | showCurrentRole | abortTransactionStatement | killQueryStatement | resourcePlanDdlStatements );]) at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) at org.antlr.runtime.DFA.predict(DFA.java:116) at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:4244) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2494) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1420) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:220) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:74) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:67) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:616) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1826) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1773) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1768) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:197) at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:260) at org.apache.hive.service.cli.operation.Operation.run(Operation.java:247) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:541) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:527) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:312) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:562) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1557) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1542) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) FAILED: ParseException line 1:5 cannot recognize input near 'SHOW' 'INDEX' 'ON' in ddl statement OK NoViableAltException(152@[917:1: ddlStatement : ( createDatabaseStatement | switchDatabaseStatement | dropDatabaseStatement | createTableStatement | dropTableStatement | truncateTableStatement | alterStatement | descStatement | showStatement | metastoreCheck | createViewStatement | createMaterializedViewStatement | dropViewStatement | dropMaterializedViewStatement | createFunctionStatement | createMacroStatement | dropFunctionStatement | reloadFunctionStatement | dropMacroStatement | analyzeStatement | lockStatement | unlockStatement | lockDatabase | unlockDatabase | createRoleStatement | dropRoleStatement | ( grantPrivileges )=> grantPrivileges | ( revokePrivileges )=> revokePrivileges | showGrants | showRoleGrants | showRolePrincipals | showRoles | grantRole | revokeRole | setRole | showCurrentRole | abortTransactionStatement | killQueryStatement | resourcePlanDdlStatements );]) at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) at org.antlr.runtime.DFA.predict(DFA.java:116) at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:4244) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2494) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1420) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:220) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:74) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:67) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:616) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1826) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1773) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1768) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:197) at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:260) at org.apache.hive.service.cli.operation.Operation.run(Operation.java:247) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:541) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:527) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:312) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:562) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1557) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1542) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) FAILED: ParseException line 1:5 cannot recognize input near 'SHOW' 'INDEX' 'ON' in ddl statement OK NoViableAltException(152@[917:1: ddlStatement : ( createDatabaseStatement | switchDatabaseStatement | dropDatabaseStatement | createTableStatement | dropTableStatement | truncateTableStatement | alterStatement | descStatement | showStatement | metastoreCheck | createViewStatement | createMaterializedViewStatement | dropViewStatement | dropMaterializedViewStatement | createFunctionStatement | createMacroStatement | dropFunctionStatement | reloadFunctionStatement | dropMacroStatement | analyzeStatement | lockStatement | unlockStatement | lockDatabase | unlockDatabase | createRoleStatement | dropRoleStatement | ( grantPrivileges )=> grantPrivileges | ( revokePrivileges )=> revokePrivileges | showGrants | showRoleGrants | showRolePrincipals | showRoles | grantRole | revokeRole | setRole | showCurrentRole | abortTransactionStatement | killQueryStatement | resourcePlanDdlStatements );]) at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) at org.antlr.runtime.DFA.predict(DFA.java:116) at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:4244) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2494) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1420) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:220) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:74) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:67) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:616) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1826) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1773) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1768) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:197) at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:260) at org.apache.hive.service.cli.operation.Operation.run(Operation.java:247) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:541) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:527) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:312) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:562) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1557) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1542) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) FAILED: ParseException line 1:5 cannot recognize input near 'SHOW' 'INDEX' 'ON' in ddl statement
最新发布
11-01
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值