Impala概念和架构 (一) ——Impala服务组件 (英文翻译)

Impala作为一款分布式、大规模并行处理(MPP)数据库引擎,由运行在集群各DataNode上的守护进程构成,负责读写数据、并行化与分发查询任务,以及返回查询结果。状态库组件监测守护进程健康,目录服务则广播元数据变更。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Components of the Impala Server

The Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of different daemon processes that run on specific hosts within your cluster.

IMPRA服务器是一个分布式的、大规模并行处理(MPP)数据库引擎。它由运行在集群内特定主机上的不同守护进程组成。

Parent topic: Impala Concepts and Architecture

  • The Impala Daemon

The core Impala component is a daemon process that runs on each DataNode of the cluster, physically represented by the impalad process. It reads and writes to data files; accepts queries transmitted from the impala-shell command, Hue, JDBC, or ODBC; parallelizes the queries and distributes work across the cluster; and transmits intermediate query results back to the central coordinator node.

这是Impala的核心组件,运行在集群中每个DataNode上运行的守护进程,物理上表现为Impala进程。它读取和写入数据文件;接受来自于impala-shell命令、Hue、JDBC或ODBC发送的查询;并行化查询以及通过集群分配工作;将中间查询结果发送回中央协调器节点。

You can submit a query to the Impala daemon running on any DataNode, and that instance of the daemon serves as the coordinator node for that query. The other nodes transmit partial results back to the coordinator, which constructs the final result set for a query. When running experiments with functionality through the impala-shell command, you might always connect to the same Impala daemon for convenience. For clusters running production workloads, you might load-balance by submitting each query to a different Impala daemon in round-robin style, using the JDBC or ODBC interfaces.

您可以向运行在任何DataNode上的Impala守护进程提交查询,该守护进程的实例充当该查询的协调器节点。其他节点将部分结果发送回协调器,该协调器为查询构造最终结果集。当通过 impala-shell 命令运行功能实验时,为了方便起见,您可能总是连接到相同的Impala守护进程。对于运行生产工作负载的集群来讲,可以通过使用JDBC或ODBC接口以循环方式将每个查询提交到不同的Impala守护进程,从而实现负载平衡

The Impala daemons are in constant communication with the statestore, to confirm which nodes are healthy and can accept new work.

They also receive broadcast messages from the catalogd daemon (introduced in Impala 1.2) whenever any Impala node in the cluster creates, alters, or drops any type of object, or when an INSERT or LOAD DATA statement is processed through Impala. This background communication minimizes the need for REFRESH or INVALIDATE METADATA statements that were needed to coordinate metadata across nodes prior to Impala 1.2.

Impala守护进程与“STATESTORE”不断通信,以确认哪些节点是健康的,并且可以接受新的工作。
它们还从目录守护进程(在Impala 1.2中引入)接收广播消息:每当集群中的任何Impala节点创建、更改或删除任何类型的对象,或者当通过Impala处理INSERT或LOAD DATA语句。这种后台通信最小化了对REFRESH或INVALIDATE METADATA语句的需求,这些语句在Impala 1.2之前需要用于协调跨节点的元数据。

 

In Impala 2.9 and higher, you can control which hosts act as query coordinators and which act as query executors, to improve scalability for highly concurrent workloads on large clusters. See Scalability Considerations for Impala for details.

Related information: Modifying Impala Startup OptionsStarting ImpalaSetting the Idle Query and Idle Session Timeouts for impaladPorts Used by ImpalaUsing Impala through a Proxy for High Availability

在Impala 2.9及更高版本中,您可以控制哪些主机充当查询协调器,哪些主机充当查询执行器,以提高大型集群上高度并发工作负载的可伸缩性。有关详细信息,请参阅Impala的可扩展性考虑。
相关信息:修改Impala启动选项,设置Impala的闲置查询和空闲会话超时时间,Impala端口,使用Impala通过代理实现高可用。

 

  • The Impala Statestore

The Impala component known as the statestore checks on the health of Impala daemons on all the DataNodes in a cluster, and continuously relays its findings to each of those daemons. It is physically represented by a daemon process named statestored; you only need such a process on one host in the cluster. If an Impala daemon goes offline due to hardware failure, network error, software issue, or other reason, the statestore informs all the other Impala daemons so that future queries can avoid making requests to the unreachable node.

称为状态库的Impala组件检查集群中所有DataNode上的Impala守护进程的健康状况,并将其发现持续地传递给每个守护进程。它由一个名为“statestored”的守护进程表示;只需要在集群中的一台主机上执行这样的进程。如果Impala守护进程由于硬件故障、网络错误、软件问题或其他原因而脱机,则状态表通知所有其他Impala守护进程,以便将来的查询可以避免向不可达节点发出请求。

Because the statestore's purpose is to help when things go wrong, it is not critical to the normal operation of an Impala cluster. If the statestore is not running or becomes unreachable, the Impala daemons continue running and distributing work among themselves as usual; the cluster just becomes less robust if other Impala daemons fail while the statestore is offline. When the statestore comes back online, it re-establishes communication with the Impala daemons and resumes its monitoring function.

因为状态表的目的是在出错时提供帮助,所以它对于Impala集群的正常运行并不重要如果状态表没有运行或变得不可访问,则Impala守护进程像往常一样继续运行并在它们之间分配工作;如果其他Impala守护进程在状态表脱机时失败,则集群将变得不那么健壮。当状态库重新联机时,它将重新建立与Impala守护进程的通信并恢复其监视功能

Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and catalogd daemons do not have special requirements for high availability, because problems with those daemons do not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and restart the Impala service.

负载均衡和高可用性的大多数考虑都适用于Impala守护进程。状态和目录守护进程对高可用性没有特殊要求,因为这些守护进程的问题不会导致数据丢失。如果这些守护进程由于特定主机上的中断而变得不可用,则可以停止Impala服务,删除Impala StateStore和Impala Catalog Server角色,然后在另外的主机上添加角色,并重新启动Impala服务。

Related information:

Scalability Considerations for the Impala StatestoreModifying Impala Startup OptionsStarting ImpalaIncreasing the Statestore TimeoutPorts Used by Impala

相关信息:

Impala Statestore的可伸缩性考虑、修改Impala启动选项、启动Impala、增加Statestore超时、Impala使用的端口

 

  • The Impala Catalog Service(目录服务)

The Impala component known as the catalog service relays the metadata changes from Impala SQL statements to all the Impala daemons in a cluster. It is physically represented by a daemon process named catalogd; you only need such a process on one host in the cluster. Because the requests are passed through the statestore daemon, it makes sense to run the statestored and catalogd services on the same host.

称为目录服务的Impala组件将通过Impala SQL语句更改元数据的信息发布到集群中的所有Impala守护进程。该目录服务由一个名为catalogd的守护进程表示;您只需要在集群中的一台主机上运行该进程。因为请求是通过statestore守护进程传递的,所以在同一台主机上运行statestored和catalogd服务是有意义的。

The catalog service avoids the need to issue REFRESH and INVALIDATE METADATA statements when the metadata changes are performed by statements issued through Impala. When you create a table, load data, and so on through Hive, you do need to issue REFRESH or INVALIDATE METADATA on an Impala node before executing a query there.

当通过Impala发布的语句执行元数据更改时,目录服务避免了必须执行REFRESH和INVALIDATE METADATA语句的需要。当通过Hive创建表、加载数据等时,在Impala中执行查询之前,需要在Impala节点上发出REFRESH或INVALIDATE METADATA。

This feature touches a number of aspects of Impala:

  • See Installing ImpalaUpgrading Impala and Starting Impala, for usage information for the catalogd daemon.

  • The REFRESH and INVALIDATE METADATA statements are not needed when the CREATE TABLEINSERT, or other table-changing or data-changing operation is performed through Impala. These statements are still needed if such operations are done through Hive or by manipulating data files directly in HDFS, but in those cases the statements only need to be issued on one Impala node rather than on all nodes. See REFRESH Statement and INVALIDATE METADATA Statement for the latest usage information for those statements.

这一特性触及了Impala的许多方面:

  • 请参阅安装Impala,升级Impala并启动Impala,以获取目录守护程序的使用信息。
  • 当通过Impala执行:CREATE TABLE、INSERT、或其他表的变更或数据变更,这时候不需要执行REFRESH 以及NVALIDATE METADATA语句。如果这些操作是通过HIVE或通过直接在HDFS中操作数据文件来进行的,那么这些语句仍然是必需的,但是在这些情况下,这些语句只需要在一个IMPLA节点上而不是在所有节点上发出。有关这些语句的最新使用信息,请参见REFRESH语句和INVALIDATE METADATA语句。

Use --load_catalog_in_background option to control when the metadata of a table is loaded.

  • If set to false, the metadata of a table is loaded when it is referenced for the first time. This means that the first run of a particular query can be slower than subsequent runs. Starting in Impala 2.2, the default for load_catalog_in_background is false.
  • If set to true, the catalog service attempts to load metadata for a table even if no query needed that metadata. So metadata will possibly be already loaded when the first query that would need it is run. However, for the following reasons, we recommend not to set the option to true.
    • Background load can interfere with query-specific metadata loading. This can happen on startup or after invalidating metadata, with a duration depending on the amount of metadata, and can lead to a seemingly random long running queries that are difficult to diagnose.
    • Impala may load metadata for tables that are possibly never used, potentially increasing catalog size and consequently memory usage for both catalog service and Impala Daemon.

使用 “--load_catalog_in_background” 选项来控制何时加载一个表的元数据。

  • 如果设置为false,则当第一次引用表时,将加载该表的元数据。这意味着特定查询的第一次运行可能比后续运行慢。从IMPRA 2.2开始,对于load_catalog_in_background的默认值是false。
  • 如果设置为true,则目录服务尝试加载表的元数据,即使没有查询需要元数据。因此,当运行第一个查询时,可能已经加载了元数据。但是,由于以下原因,我们建议不要将选项设置为true

         1)后台负载会干扰查询特定的元数据加载。这可以在invalidating metadata启动时或之后发生,持续时间取决于元数据的数量,并且可能导致随机的长时间的运行查询,这很难诊断。

         2) 导致Impala加载可能从未使用过的表的元数据,从而可能增加目录大小,从而增加目录服务和Impala守护进程的内存使用。

Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and catalogd daemons do not have special requirements for high availability, because problems with those daemons do not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and restart the Impala service.

负载均衡和高可用性的大多数考虑都适用于impalad 守护进程。状态存储和目录守护进程对高可用性没有特殊要求,因为这些守护进程的问题不会导致数据丢失。如果这些守护进程由于特定主机上的中断而变得不可用,则可以停止Impala服务,删除Impala StateStore和Impala Catalog Server角色,在另外的主机上添加角色,并重新启动Impala服务。

Note:

In Impala 1.2.4 and higher, you can specify a table name with INVALIDATE METADATA after the table is created in Hive, allowing you to make individual tables visible to Impala without doing a full reload of the catalog metadata. Impala 1.2.4 also includes other changes to make the metadata broadcast mechanism faster and more responsive, especially during Impala startup. See New Features in Impala 1.2.4 for details.

Related information: Modifying Impala Startup OptionsStarting ImpalaPorts Used by Impala

注:
在Impala 1.2.4及更高版本中,可以在Hive中创建表之后使用INVALIDATE METADATA + 表名,从而允许您使单个表对Impala可见,而无需完全重新加载目录元数据。Impala 1.2.4还包括其他更改,以使元数据广播机制更快、响应更快,尤其是在Impala启动期间。详情请参阅Impala 1.2.4 中的新特性。
相关信息:修改Impala 启动选项,启动Impala ,由Impala 使用的端口

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值