Firebolt whitepaper

官方文档的技术白皮书
官方文档
https://www.firebolt.io/resources/firebolt-cloud-data-warehouse-whitepaper

Abstract 摘要

Companies have usually adopted new cloud technologies for two reasons: to get the latest innovations, and to lower costs. It has been nearly a decade since cloud data warehouses were first released. In the case of cloud data warehouses, most companies adopted them to migrate traditional reporting and dashboard-based analytics to the cloud. Cloud data warehouse vendors did innovate technically in ways that helped with this migration. Decoupled storage and compute architectures gave us elastic scalability and helped simplify administration.

企业上云主要有两个原因：获得最新的创新和降低成本。自云数据仓库首次发布以来，已经过去了将近十年。对于云数据仓库，大多数公司采用它们将传统的报告和基于仪表板的分析迁移到云中。云数据仓库供应商确实以有助于这种迁移的方式进行技术创新。分离的存储和计算架构为我们提供了弹性的可扩展性并帮助简化了管理。

But cloud adoption was mostly to help modernize IT. They have not supported the new needs of the business. Operational and customer-facing interactive analytics have become a top priority to help employees make better decisions faster, and provide greater self-service to customers. These newer use cases require 10x or faster performance for interactive and ad hoc queries; unprecedented data, query, and user scale; and a much lower cost of ownership per user.

但是采用云计算主要是为了帮助IT现代化。他们没有支持企业的新需求。运营和面向客户的交互分析已经成为帮助员工更快地做出更好决策，并为客户提供更多的自助服务。这些新的场景需要10倍或更快的交互查询的性能；空前的数据、查询和用户规模；每个用户需要更低的成本。

During the last decade since cloud data warehouses were first released, they have not delivered the order-of-magnitude improvements in speed, scale and costs required for these new use cases. In fact, cloud data warehouses inherited many of the same limitations that prevented premises data warehouses from supporting these new use cases. For example, most cloud data warehouses are still mostly batch-centric. They do not support streaming ingestion at scale or provide low-latency visibility, both of which are critical for many operational and customer-facing use cases. Those that do not decouple storage and compute have limited data, query, and user scalability. Even those that decouple storage and compute, while they improved scalability, have slower query performance than the older specialized data warehouses. Support for semi-structured data, which is now the dominant form of data, is either incomplete, too slow, or inefficient.

自从云数据仓库首次发布以来，在过去的十年中，云数仓并没有在速度、规模和成本方面提供这些新场景所需的数量级的改进。事实上，云数据仓库继承了许多相同的限制，使数据仓库无法支持这些新场景。例如，大多数云数据仓库仍然主要以批处理为中心。它们不支持大规模的流计算，也不提供低延迟的可见性，这两者对于操作和面向客户的场景来说都是至关重要的。那些不能将存储和计算解耦的应用程序具有有限的数据、查询和用户可伸缩性。即使那些将存储和计算解耦的数据仓库，虽然它们提高了可伸缩性，但查询性能也比旧的专用数据仓库要慢。对半结构化数据(现在是数据的主要形式)的支持要么不完整，要么太慢，要么效率低下。

Cloud data warehouses have also often raised costs, not lowered them. There are too many stories of companies blowing through their monthly credits, or allotted budget. The main solution has been to throttle queries. But limiting the use of analytics goes against the main goal of putting analytics and data into the hands of everyone to make faster, better decisions.

云数据仓库也经常会提高而不是降低成本。有太多的公司花光了他们的月度信贷，或分配的预算。主要的解决方案是限制查询。但是，限制分析的使用违背了将分析和数据交到每个人手中以便更快、更好地做出决策的目标。

The Firebolt cloud data warehouse was designed to deliver the combined improvements in speed, scale, and efficiency needed to cost-effectively deliver fast, interactive analytics for ad hoc, interactive operational and customer-facing analytics; in other words, fast analytics to everyone. This whitepaper explains the overall architecture, the key differences compared to the previous generations of cloud data warehouses, and provides some numbers to show how much of an improvement in speed, scale, cost, and price-performance is possible.

Firebolt云数据仓库的设计目的是在速度、规模和效率方面实现综合改进，以经济有效地为ad-hoc、交互式运营和面向客户的分析提供快速、交互式的分析查询；换句话说，对每个人进行快速分析。本白皮书解释了总体架构，以及与前几代云数据仓库相比的关键区别，并提供了一些数据来说明在速度、规模、成本和性能方面可能会有多大的改进。

Requirements for the modern cloud data warehouse 云数据仓库的需求

Today’s cloud data warehouse has to support more than traditional reporting and dashboards, and the analyst teams behind them. It has to support ad hoc and interactive analytics against both batch and streaming data for 100-1000x more users as companies push operational analytics directly to their employees, and offer self-service analytics to their customers. It also needs to support data engineers and the data engineering lifecycle.

今天的云数据仓库需要支持的不仅仅是传统的报表和仪表盘，以及它们背后的分析团队。它必须支持针对批处理和流数据的临时和交互式分析，为100-1000倍以上的用户提供业务分析，因为公司直接将业务分析下推给员工，并为客户提供自助分析。它还需要支持数据工程师和数据工程生命周期。

Because of this, cloud data warehouses need to provide much more than the elastic scalability and simplicity that first-generation cloud data warehouses introduced. They also have to deliver an order of magnitude improvement in performance, scalability, cost efficiency, and agility to support these newer users and their analytics. Today, the requirements are:

因此，云数据仓库需要提供的不仅仅是第一代云数据仓库所引入的弹性可伸缩性和简单性。他们还必须在性能、可伸缩性、成本效率和灵活性方面提供一个数量级的改进，以支持这些新用户和他们的分析。今天的需求是：

100% SQL - SQL is the de facto language of data, especially for analysts and data engineers. Every task, from ELT to queries of any data, should be achievable in SQL.

100% SQL—SQL是数据的实际语言，尤其是对分析师和数据工程师来说。从ELT到任何数据的查询，每个任务都应该可以在SQL中实现。
Sub-seconds query performance - Ad hoc, interactive analytics by employees, and self-service analytics by customers all require queries that run in a few seconds or less.

秒级查询性能—由员工进行的临时、交互式分析和由客户进行的自助分析都需要在几秒钟或更短时间内运行的查询。
Gigabyte-petabyte scale - The newer types of data - from customer interactions, connected devices or newer applications - are massive in size compared to the data from traditional applications and transactions, and are growing much faster. Most companies have terabytes of data; some have petabytes.

gb级规模—较新的数据类型—来自客户交互、连接设备或较新的应用程序—与来自传统应用程序和事务的数据相比规模巨大，而且增长得更快。大多数公司都有tb级的数据，甚至pb级。
Elastic scale - Ingestion and query workloads are less predictable, which makes elastic scale much more important both for efficiency and service level agreements (SLA).

弹性规模—写入和查询的工作负载的可预测性较差，这使得弹性规模对于效率和SLA来说都非常重要。
Native support for semi-structured data.

对JSON和其他半结构化数据的原生支持—虽然有些产品确实支持处理JSON，但它并不完整、高性能、低成本或简单。部分挑战在于JSON是作为文本存储的，而不是作为原生数据类型。但是数据工程师还是需要能够轻松地在SQL中操作JSON。
Native ELT support with SQL - For data engineers to get new analytics out in hours or days, they need to be able to extract, load, and transform (ELT) their own data entirely in SQL, without having to wait on a separate team.

SQL的原生ELT支持—对于数据工程师来说，要在数小时或数天内完成新的分析，他们需要能够完全用SQL提取、加载和转换(ELT)他们自己的数据，而不需要等待单独的团队的数据结果。
High user and query concurrency - Analytics is now being delivered to much larger groups of employees and end customers than the traditional analyst teams. It can require supporting 100s to 1000s of concurrent users and queries.

高用户和查询并发性—与传统的分析团队相比，分析现在被交付给更多的员工和最终客户群体。它可能需要支持100到1000个并发用户和查询。
Workload isolation - Unlike reporting, which can be done in batch at any time, multiple workloads and users require isolation to help ensure high priority, near real-time SLAs are met, and to protect workloads from each other.

工作负载隔离—报表可以在任何时候批量完成，与此不同的是，多个工作负载和用户需要隔离，以帮助确保高优先级、接近实时的SLA 得到满足，并保护工作负载相互隔离。
Simplicity for data engineers and DataOps - Data warehouse deployments can no longer be controlled in ways that slow down changes to support new data needs. Cloud data warehouses need to support DataOps in ways that provide more control to data engineers and enable faster data analytics cycle times.

数据工程师和DataOps的简单性——数据仓库部署不能再以降低更改以支持新数据需求的方式进行控制。云数据仓库需要支持DataOps，为数据工程师提供更多的控制，并实现更快的数据分析周期。
Cost efficiency - Costs need to drop 10x or more to support analytics by 10-100x more users who consume 10x or more data.

成本效率——成本需要降低10倍或更多，以支持10-100倍以上的用户消耗10倍或更多的数据

The earlier generations of cloud data warehouses do not fulfill all these requirements. Those with decoupled storage and compute are SQL-native and provide elastic scale. They also provide adequate performance for reporting and other repetitive query workloads through caching. But they do not deliver the type of performance or efficiency needed for ad hoc, interactive, operational or customer-facing analytics where caching does not help as much.

早期的云数据仓库并不能满足所有这些需求。那些具有解耦存储和计算的是单机的，并提供了弹性伸缩性。它们还通过缓存为报表和其他重复查询工作负载提供了足够的性能。但是它们不能提供特定的、交互式的、可操作的或面向客户的分析所需的性能或效率，而缓存在这些方面的帮助并不大。

The implied costs in these benchmarks are too expensive, especially as companies reach petabytes of data, and too slow.

这些基准中隐含的成本过于昂贵，特别是当公司达到pb级的数据时，而且速度太慢。

Employee and customer-facing analytics, as well as automation, requires much faster and much more cost-effective analytics. Most people expect data in a few seconds or less when they need to make near real-time decisions. Even 50% of people expect mobile applications to return in 2 seconds or less.

面向员工和客户的分析以及自动化，需要更快、更划算的分析。当需要做接近实时的决策时，大多数人希望数据在几秒钟或更少的时间内得到。甚至有50%的人希望手机应用能够在2秒或更短的时间内返回。

Satisfying all of these requirements requires a redesign of the modern decoupled storage-compute architecture to address three major performance and efficiency limitations.

要满足所有这些需求，就需要重新设计存储计算分离的体系结构，以解决三个主要的性能和效率限制。

Data access - Most cloud data warehouses fetch entire segments or partitions of data over the network despite the network being the biggest bottleneck. In AWS, for example, the 10, 25 or 100 Gbps (gigabits per second) networks transport roughly 1, 2.5 or 10 gigabytes (GB) per second at most. When working with terabytes of data, data access takes seconds or more. Fetching exact data ranges instead of larger segments can cut access times 10x or more.

数据访问—尽管网络是最大的瓶颈，但大多数云数据仓库通过网络获取整个数据段或数据分区。例如，在AWS中，10、25或100 Gbps 的网络最多每秒传输1gb、2.5或10gb。当处理tb级的数据时，数据访问需要几秒钟或更长时间。获取精确的数据范围而不是更大的段可以将访问次数削减10倍或更多。
Query execution - Query optimization makes a huge difference in performance. It is one of the reasons Amazon chose to use ParAccel originally; because building all the optimization takes time. And yet most cloud data warehouses lack a lot of proven optimization techniques - from indexing to cost-based optimization.

查询执行—查询优化在性能上有很大的不同。这也是亚马逊最初选择使用ParAccel的原因之一；因为构建所有的优化都需要时间。然而，大多数云数据仓库缺乏许多经过验证的优化技术—从索引到基于成本的优化。
Compute efficiency - Decoupled storage and compute architectures have been a blessing and a curse. They allowed vendors to use nearly unlimited scale to improve performance instead of improving efficiency.

计算效率—解耦存储和计算架构是福也是祸。它们允许供应商使用几乎无限的规模来提高性能，而不是提高效率。