Introducing The Newly Redesigned Apache HAWQ ［作者：常雷］

最新推荐文章于 2022-01-04 11:06:03 发布

原创最新推荐文章于 2022-01-04 11:06:03 发布 · 1.4k 阅读

CC 4.0 BY-SA版权

本文介绍了HAWQ，一种结合了Pivotal Greenplum数据库的优点与Hadoop分布式存储特性的SQL查询引擎。它旨在解决Hadoop在交互查询响应时间、可扩展性、一致性、可伸缩性和标准合规性方面的需求，通过紧密集成YARN进行查询资源管理，实现动态调度，支持大规模并行处理，同时提供弹性能力，以便在公共云、私有云或共享物理集群环境中运行。

Background

To build the modern, dynamic applications of today directly from Apache Hadoop® file system (HDFS) data, many customers require SQL, specifically advanced Hadoop Native SQL analytics with enterprise-grade processing capabilities. The following are the typical requirements from customers who are using Hadoop for their daily analytical work:

Interactive queries: Interactive query response time is the key to promote data exploration, rapid prototyping and other tasks.
Scalability: Linear scalability is necessary for the explosive growth of data size.
Consistency: Transaction makes sense and it takes the consistency responsibility away from application developers.
Extensibility: The system should support common popular data formats, such as plain text, SequenceFile, and new formats.
Standard compliance: Keeping the investment in existing BI and visualization tools requires standard compliance, for example, SQL and various other BI standards.
Productivity: It is favorable to make use of existing skill sets in the organization without losing productivity.

The Hadoop software stack couldn’t satisfy all of the above requirements. In database community, one of the recent trends is the wide adoption of Massively Parallel Processing (MPP) systems. MPP databases share many features with Hadoop, such as scalable shared-nothing architecture. They particularly excel at fast query processing capability (orders of magnitude faster than other solutions), automatic query optimization, and industry SQL standard compatible interfaces with BI tools, which make them easy to use for data analysts.

Motivated by the limitations of Hadoop and performance advantages of MPP databases, we developed HAWQ, which is a SQL query engine that combines the merits of Pivotal Greenplum Database and Hadoop distributed storage. The HAWQ architecture is different from Greenplum Database because of the underlying distribution storage characteristics and tight integration requirements with Hadoop stack.

In HAWQ 1.0, significant architectural innovations were done for various components, for example, distributed transaction, fault tolerance, unified catalog service, metadata dispatch and various other components. After 1.0 release, customer requirements have begun trending toward the support for greater elasticity to enhance the capabilities of running HAWQ on public cloud, private cloud or shared physical cluster environments. The HAWQ team has developed next generation HAWQ to address these cloud trends and continue the architecture innovation of HAWQ 1.0. Apache HAWQ (incubating) is based on HAWQ 2.0 Beta.

Architecture

The high level architecture of Apache HAWQ is show in Figure 1. On a typical deployment, on each slave node, there is a physical HAWQ segment, an HDFS DataNode and a NodeManager installed. Masters for HAWQ, HDFS and YARN are on separate nodes.

In this new release, HAWQ is tightly integrated with YARN for query resource management. HAWQ caches containers from YARN in a resource pool and then manages those resources locally leveraging its own finer-grained resource management for users and groups. For a query to be executed, it allocates a set of virtual segments according to the cost of a query, resource queue definitions, data locality and the current resource usage in the system. Then the query is dispatched to corresponding physical hosts (can be a subset of nodes of the whole cluster). The HAWQ resource enforcer on each node monitors and controls the real time resources used by the query to avoid resource usage violations.

Figure 1: Apache HAWQ high-level architecture

In this new architecture, nodes can be added dynamically without data redistribution. Expansion is an operation taking only seconds. When a new node is added, it automatically contacts HAWQ master which makes the resource available on the node to be used for future queries immediately.

Component Highlights

Figure 2 shows more details on the internal components.

Masters: Accepts user connections, parses queries, optimizes queries, dispatches queries to segments and coordinates the query execution.
Resource Manager (RM): Obtains resources from YARN and answers any resource requests. Resources are buffered at HAWQ RM to support low latency queries.
Fault Tolerance Service (FTS): Responsible for detecting segment failures and accepting heartbeats from segments.
Dispatcher/Coordinator: Dispatches query plans to selected subset of segments and coordinating the query execution. The dispatcher and the RM are the main components for dynamic scheduling.
Catalog service: Stores all metadata, such as UDF/UDT information, relation information, security information and data file locations.
YARN: Hadoop resource management framework.
Segment: Performs data processing. There is only one physical segment on each host, which makes expansion, shrinking and resource management much easier. Each segment can start many Query Executors (QEs) for each query slice. This makes the single segment act like multiple virtual segments, which enables HAWQ 2.0 to better utilize all available resources.

Figure 2: Apache HAWQ components

Query Execution Flow

After a query is accepted on master, it is parsed and analyzed. After analysis, a query tree is generated. The query tree is given to the query optimizer and the optimizer generates a query plan. Given the cost information of the plan, resources are requested from the HAWQ Resource Manager. The Resource Enforcer on each node will enforce that allocated resources are used properly, and will kill offending queries that use more memory resources than the allocated quota.

Elasticity

For classical MPP execution engine, the number of segments, which are compute resource carriers, used to run a query is fixed. This is true regardless of whether the query is big (e.g., involving large table scans and multi-table join, probably requiring a lot of CPU/memory/IO resources), or small (e.g., a small table scan requiring few resources). This requirement fixes the parallelism of query execution, and thus simplifies the whole system architecture. However, this architecture does not provide efficient system resource utilization, and in a cloud or shared cluster environment this becomes very apparent.

To address this issue, the elasticity feature is introduced. It is based on virtual segments. The number of virtual segments is allocated on demand based on query costs. More specifically, for big queries, a large number of virtual segments are started, while for small queries, less virtual segments are started. Query execution is fully pipelined across query executors started on all virtual segments.

Resource Management

Resource management is a key part to support elasticity. HAWQ supports three-level resource management: global level, internal query level and operator level.

Global level: responsible for getting and returning resources from global resource managers, for example, YARN or Mesos.
Internal query level: responsible for allocating acquired resources to different sessions and queries according to resource queue definitions.
Operator level: responsible for allocating resources across query operators in a query plan.

Resources in HAWQ Resource Manager (RM) are managed within resource queues. When a query is submitted, after query parsing, semantic analysis and query planning, resources are obtained from HAWQ RM through libYARN, which is a C/C++ client in HAWQ to communicate with YARN. The resource allocation for each query is sent with the plan together to the segments. Consequently, each Query Executor knows the resource quota for current query and enforces the resource consumption during the query execution. When query execution finishes (or cancelled). the resource is returned to HAWQ RM.