Introducing The Newly Redesigned Apache HAWQ [作者:常雷]

本文介绍了HAWQ,一种结合了Pivotal Greenplum数据库的优点与Hadoop分布式存储特性的SQL查询引擎。它旨在解决Hadoop在交互查询响应时间、可扩展性、一致性、可伸缩性和标准合规性方面的需求,通过紧密集成YARN进行查询资源管理,实现动态调度,支持大规模并行处理,同时提供弹性能力,以便在公共云、私有云或共享物理集群环境中运行。


Background

To build the modern, dynamic applications of today directly from Apache Hadoop® file system (HDFS) data, many customers require SQL, specifically advanced Hadoop Native SQL analytics with enterprise-grade processing capabilities. The following are the typical requirements from customers who are using Hadoop for their daily analytical work:


  • Interactive queries: Interactive query response time is the key to promote data exploration, rapid prototyping and other tasks.

  • Scalability: Linear scalability is necessary for the explosive growth of data size.

  • Consistency: Transaction makes sense and it takes the consistency responsibility away from application developers.

  • Extensibility: The system should support common popular data formats, such as plain text, SequenceFile, and new formats.

  • Standard compliance: Keeping the investment in existing BI and visualization tools requires standard compliance, for example, SQL and various other BI standards.

  • Productivity: It is favorable to make use of existing skill sets in the organization without losing productivity.


The Hadoop software stack couldn’t satisfy all of the above requirements. In database community, one of the recent trends is the wide adoption of Massively Parallel Processing (MPP) systems. MPP databases share many features with Hadoop, such as scalable shared-nothing architecture. They particularly excel at fast query processing capability (orders of magnitude faster than other solutions), automatic query optimization, and industry SQL standard compatible interfaces with BI tools, which make them easy to use for data analysts.


Motivated by the limitations of Hadoop and performance advantages of MPP databases, we developed HAWQ, which is a SQL query engine that combines the merits of Pivotal Greenplum Database and Hadoop distributed storage. The HAWQ architecture is different from Greenplum Database because of the underlying distribution storage characteristics and tight integration requirements with Hadoop stack.


In HAWQ 1.0, significant architectural innovations were done for various components, for example, distributed transaction, fault tolerance, unified catalog service, metadata dispatch and various other components. After 1.0 release, customer requirements have begun trending toward the support for greater elasticity to enhance the capabilities of running HAWQ on public cloud, private cloud or shared physical cluster environments. The HAWQ team has developed next generation HAWQ to address these cloud trends and continue the architecture innovation of HAWQ 1.0. Apache HAWQ (incubating) is based on HAWQ 2.0 Beta.


Architecture

The high level architecture of Apache HAWQ is show in Figure 1. On a typical deployment, on each slave node, there is a physical HAWQ segment, an HDFS DataNode and a NodeManager installed. Masters for HAWQ, HDFS and YARN are on separate nodes.


In this new release, HAWQ is tightly integrated with YARN for query resource management. HAWQ caches containers from YARN in a resource pool and then manages those resources locally leveraging its own finer-grained resource management for users and groups. For a query to be executed, it allocates a set of virtual segments according to the cost of a query, resource queue definitions, data locality and the current resource usage in the system. Then the query is dispatched to corresponding physical hosts (can be a subset of nodes of the whole cluster). The HAWQ resource enforcer on each node monitors and controls the real time resources used by the query to avoid resource usage violations.



Figure 1: Apache HAWQ high-level architecture


In this new architecture, nodes can be added dynamically without data redistribution. Expansion is an operation taking only seconds. When a new node is added, it automatically contacts HAWQ master which makes the resource available on the node to be used for future queries immediately.


Component Highlights

Figure 2 shows more details on the internal components.

  • Masters: Accepts user connections, parses queries, optimizes queries, dispatches queries to segments and coordinates the query execution.

  • Resource Manager (RM):  Obtains resources from YARN and answers any resource requests. Resources are buffered at HAWQ RM to support low latency queries.

  • Fault Tolerance Service (FTS): Responsible for detecting segment failures and accepting heartbeats from segments.

  • Dispatcher/Coordinator: Dispatches query plans to selected subset of segments and coordinating the query execution. The dispatcher and the RM are the main components for dynamic scheduling.

  • Catalog service: Stores all metadata, such as UDF/UDT information, relation information, security information and data file locations.

  • YARN: Hadoop resource management framework.

  • Segment: Performs data processing. There is only one physical segment on each host, which makes expansion, shrinking and resource management much easier. Each segment can start many Query Executors (QEs) for each query slice. This makes the single segment act like multiple virtual segments, which enables HAWQ 2.0 to better utilize all available resources.


Figure 2: Apache HAWQ components


Query Execution Flow

After a query is accepted on master, it is parsed and analyzed. After analysis, a query tree is generated. The query tree is given to the query optimizer and the optimizer generates a query plan. Given the cost information of the plan, resources are requested from the HAWQ Resource Manager. The Resource Enforcer on each node will enforce that allocated resources are used properly, and will kill offending queries that use more memory resources than the allocated quota.


Elasticity

For classical MPP execution engine, the number of segments, which are compute resource carriers, used to run a query is fixed. This is true regardless of whether the query is big (e.g., involving large table scans and multi-table join, probably requiring a lot of CPU/memory/IO resources), or small (e.g., a small table scan requiring few resources). This requirement fixes the parallelism of query execution, and thus simplifies the whole system architecture. However, this architecture does not provide efficient system resource utilization, and in a cloud or shared cluster environment this becomes very apparent.


To address this issue, the elasticity feature is introduced. It is based on virtual segments. The number of virtual segments is allocated on demand based on query costs. More specifically, for big queries, a large number of virtual segments are started, while for small queries, less virtual segments are started. Query execution is fully pipelined across query executors started on all virtual segments.


Resource Management

Resource management is a key part to support elasticity. HAWQ supports three-level resource management: global level, internal query level and operator level.


  • Global level: responsible for getting and returning resources from global resource managers, for example, YARN or Mesos.

  • Internal query level: responsible for allocating acquired resources to different sessions and queries according to resource queue definitions.

  • Operator level: responsible for allocating resources across query operators in a query plan.


Resources in HAWQ Resource Manager (RM) are managed within resource queues.  When a query is submitted, after query parsing, semantic analysis and query planning, resources are obtained from HAWQ RM through libYARN, which is a C/C++ client in HAWQ to communicate with YARN. The resource allocation for each query is sent with the plan together to the segments. Consequently, each Query Executor knows the resource quota for current query and enforces the resource consumption during the query execution. When query execution finishes (or cancelled). the resource is returned to HAWQ RM.


Further Reading

Read the SIGMOD papers: HAWQ: a massively parallel processing SQL engine in hadoop



本项目采用C++编程语言结合ROS框架构建了完整的双机械臂控制系统,实现了Gazebo仿真环境下的协同运动模拟,并完成了两台实体UR10工业机器人的联动控制。该毕业设计在答辩环节获得98分的优异成绩,所有程序代码均通过系统性调试验证,保证可直接部署运行。 系统架构包含三个核心模块:基于ROS通信架构的双臂协调控制器、Gazebo物理引擎下的动力学仿真环境、以及真实UR10机器人的硬件接口层。在仿真验证阶段,开发了双臂碰撞检测算法和轨迹规划模块,通过ROS控制包实现了末端执行器的同步轨迹跟踪。硬件集成方面,建立了基于TCP/IP协议的实时通信链路,解决了双机数据同步和运动指令分发等关键技术问题。 本资源适用于自动化、机械电子、人工智能等专业方向的课程实践,可作为高年级课程设计、毕业课题的重要参考案例。系统采用模块化设计理念,控制核心与硬件接口分离架构便于功能扩展,具备工程实践能力的学习者可在现有框架基础上进行二次开发,例如集成视觉感知模块或优化运动规划算法。 项目文档详细记录了环境配置流程、参数调试方法和实验验证数据,特别说明了双机协同作业时的时序同步解决方案。所有功能模块均提供完整的API接口说明,便于使用者快速理解系统架构并进行定制化修改。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值