impala的 join查询的优化实践

本文分享了在Kudu表上进行复杂查询时遇到的性能瓶颈及解决方案,通过使用STRAIGHT_JOIN关键字、预计算表统计信息和合理分配Impala资源,成功将查询时间从24小时以上缩短至5小时。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

查询语句

CREATE TABLE result as SELECT  t1.crossing_id AS cid,t1.plate_no AS pn1,t2.plate_no AS pn2,t1.pt_timestamp ASptts1,t2.pt_timestamp AS ptts2 FROM (select * FROM datakudu) AS t1 INNER JOIN (select * from datakudu) AS t2 ON t1.crossing_id =t2.crossing_id and t1.plate_no <> t2.plate_no AND abs(t1.pt_timestamp - t2.pt_timestamp) < 60 

环境介绍:

服务器内存32G , cpu8个
3个节点的kudu集群,node1,node2,node3。这3个节点都部署了impalad,Statestored,catalogd只部署在node1之上。
表datakudu是一张存储于kudu之上的表,总共有2700W的记录数。

在没有任何优化的情况下执行上述查询需要花24小时以上的时间(这个性能显然不行,没有跑完就取消查询了)

优化1:查询语句STRAIGHT_JOIN关键字

CREATE TABLE result as SELECT STRAIGHT_JOIN t1.crossing_id AS cid,t1.plate_no AS pn1,t2.plate_no AS pn2,t1.pt_timestamp ASptts1,t2.pt_timestamp AS ptts2 FROM (select * FROM datakudu) AS t1 INNER JOIN (select * from datakudu) AS t2 ON t1.crossing_id =t2.crossing_id and t1.plate_no <> t2.plate_no AND abs(t1.pt_timestamp - t2.pt_timestamp) < 60 

这次的查询时间缩短到了23小时37分钟,提升还是很明显的。

优化2: 查询之前先执行COMPUTE STATS语句来统计一下表信息,据说这样对性能提高明显。

首先执行 COMPUTE STATS hikdatakudu4

再执行

CREATE TABLE result as SELECT STRAIGHT_JOIN t1.crossing_id AS cid,t1.plate_no AS pn1,t2.plate_no AS pn2,t1.pt_timestamp ASptts1,t2.pt_timestamp AS ptts2 FROM (select * FROM datakudu) AS t1 INNER JOIN (select * from datakudu) AS t2 ON t1.crossing_id =t2.crossing_id and t1.plate_no <> t2.plate_no AND abs(t1.pt_timestamp - t2.pt_timestamp) < 60 

耗时9小时

优化3:给impala分配足够的cores资源

对3个节点都需要做如下操作

1、 创建/etc/impala/conf/fair-scheduler.xml 文件,编辑如下:

<allocations>
<queue name="root">
        <aclSubmitApps> </aclSubmitApps>
        <queue name="default">
                <maxResources>50000 mb, 8 vcores</maxResources>
                <aclSubmitApps>*</aclSubmitApps>
        </queue>
        <queue name="development">
                <maxResources>200000 mb, 0 vcores</maxResources>
                <aclSubmitApps>user1,user2 dev,ops,admin</aclSubmitApps>
        </queue>
        <queue name="production">
                <maxResources>1000000 mb, 0 vcores</maxResources>
                <aclSubmitApps> ops,admin</aclSubmitApps>
        </queue>
</queue>
<queuePlacementPolicy>
        <rule name="specified" create="false"/>
        <rule name="default" />
</queuePlacementPolicy>
</allocations>

fair-scheduler.xml文件跟yarn中的作用一样,default是默认的队列,如果提交查询没有特殊指定队列的话都会进入default中。这里我将节点的所有8个cores都分配给了default。对于impala,如果配置了fair-scheduler.xml,那么就必须要配置llama-site.xml才能生效。配置如下

2、 创建/etc/impala/conf/ llama-site.xml 文件,编辑如下:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>llama.am.throttling.maximum.placed.reservations.root.default</
name>
<value>10</value>
</property>
<property>
<name>llama.am.throttling.maximum.queued.reservations.root.default</
name>
<value>50</value>
</property>
<property>
<name>impala.admission-control.pool-default-query-options.root.default</
name>
<value>mem_limit=128m,query_timeout_s=20,max_io_buffers=10</value>
</property>
<property>
<name>impala.admission-control.pool-queue-timeout-ms.root.default</name>
<value>30000</value>
</property>
<property>
<name>llama.am.throttling.maximum.placed.reservations.root.development</
name>
<value>50</value>
</property>
<property>
<name>llama.am.throttling.maximum.queued.reservations.root.development</
name>
<value>100</value>
</property>
<property>
<name>impala.admission-control.pool-default-queryoptions.root.development</name>
<value>mem_limit=256m,query_timeout_s=30,max_io_buffers=10</value>
</property>
<property>
<name>impala.admission-control.pool-queue-timeout-ms.root.development</
name>
<value>15000</value>
</property>
<property>
<name>llama.am.throttling.maximum.placed.reservations.root.production</
name>
<value>100</value>
</property>
<property>
<name>llama.am.throttling.maximum.queued.reservations.root.production</
name>
<value>200</value>
</property>
<!--
Default query options for the 'root.production' pool.
THIS IS A NEW PARAMETER in Impala 2.5.
Note that the MEM_LIMIT query option still shows up in here even
though it is a
separate box in the UI. We do that because it is the most important
query option
| Impala Administration | 74
that people will need (everything else is somewhat advanced).
MEM_LIMIT takes a per-node memory limit which is specified using one
of the following:
- '<int>[bB]?' -> bytes (default if no unit given)
- '<float>[mM(bB)]' -> megabytes
- '<float>[gG(bB)]' -> in gigabytes
E.g. 'MEM_LIMIT=12345' (no unit) means 12345 bytes, and you can
append m or g
to specify megabytes or gigabytes, though that is not required.
-->
<property>
<name>impala.admission-control.pool-default-queryoptions.root.production</name>
<value>mem_limit=386m,query_timeout_s=30,max_io_buffers=10</value>
</property>
<!--
Default queue timeout (ms) for the pool 'root.production'.
If this isn’t set, the process-wide flag is used.
THIS IS A NEW PARAMETER in Impala 2.5.
-->
<property>
<name>impala.admission-control.pool-queue-timeout-ms.root.production</
name>
<value>30000</value>
</property>
</configuration>

3、 修改/etc/default/impala 指定llama-site.xml,fair-scheduler.xml的路径

编辑/etc/default/impala 如下图所示
在这里插入图片描述

4、 重启impalad服务
service impala-server restart
5、 验证配置是否修改成功
访问http://ip:25000/varz
页面修改如下:

在这里插入图片描述
这次再执行sql用时优化到5小时

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值