Apache Tajo Enters the SQL-on-Hadoop Space

Apache Tajo是一款为Hadoop设计的大数据仓库系统,具备全面的SQL支持及分布式查询处理能力。该系统自2010年启动以来,在2013年加入了Apache软件基金会,并由韩国初创公司Gruter作为主要赞助者。尽管知名度不高,但Tajo提供了一系列强大的特性,包括完全分布式查询处理、ETL功能集等。本文通过与Apache Hive和Cloudera Impala的性能对比,展示了Tajo在特定场景下的表现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

The number of SQL options for Hadoop expanded substantially over the last 18 months. Most get a large amount of attention when announced, but a few slip under the radar. One of these low-flying options is Apache Tajo. I learned about Tajo in November of 2013 at a Hadoop User Group meeting.

Billed as a big data data warehousing system for Hadoop, Tajo development started in 2010 and moved to the Apache Software Foundation in March of 2013. Tajo is currently incubating. Its primary development sponsor is Gruter, a big data infrastructure startup in South Korea. Despite the lack of public awareness, Tajo has a fairly robust feature set:

  • SQL compliance
  • Fully distributed query processing against HDFS and other data sources
  • ETL feature set
  • User-defined functions
  • Compatibility with HiveQL and Hive MetaStore
  • Fault tolerance through a restart mechanism for failed tasks
  • Cost-based query optimization and an extensible query rewrite engine

Things get interesting when comparing performance against Apache Hive and Cloudera Impala. SK Telecom, the largest telecommunications provider in South Korea, tested Tajo, Hive and Impala using five sample queries. Hive 0.10 and Impala 1.1.1 on CDH 4.3.0 were used for the test. Test data size was 1.7TB and query results were 8GB or less in size. (The following images were taken from the presentation in the previous link.)

QUERY 1: HEAVY SCAN WITH 20 TEXT MATCHING FILTERS

tajo_q1

QUERY 2: 7 UNIONS WITH JOINS

tajo_q2

QUERY 3: SIMPLE JOINS

tajo_q3

QUERY 4: GROUP BY AND ORDER BY

tajo_q4

QUERY 5: 30 PATTERN MATCHING FILTERS WITH OR CONDITIONS USING GROUP BY, HAVING AND SORTING

tajo_q5

What do these results indicate? Clearly, different SQL-on-Hadoop implementations have different performance characteristics. Until these options mature to be truly multi-purpose, selecting a single option may not result in the best overall performance. Also, these benchmarks are for a specific set of use cases – not your use cases. The tested queries may have no relevance to your data and how you’re using it.

The other important takeaway is the absolute performance of these options. The sample data set and results are small in modern terms, yet none of the results are astounding relative to a modern data warehouse or RDBMS. There’s a difference between “fast” and “fast for Hadoop.”Cloudera appears to be making some headway, but a lot of ground must be covered before any Hadoop distribution is compatible with the systems vendors claim to be replacing.

Ref: http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值