Apache Tajo Enters the SQL-on-Hadoop Space

最新推荐文章于 2019-03-28 14:31:53 发布

转载最新推荐文章于 2019-03-28 14:31:53 发布 · 863 阅读

hadoop 专栏收录该内容

103 篇文章

订阅专栏

Apache Tajo是一款为Hadoop设计的大数据仓库系统，具备全面的SQL支持及分布式查询处理能力。该系统自2010年启动以来，在2013年加入了Apache软件基金会，并由韩国初创公司Gruter作为主要赞助者。尽管知名度不高，但Tajo提供了一系列强大的特性，包括完全分布式查询处理、ETL功能集等。本文通过与Apache Hive和Cloudera Impala的性能对比，展示了Tajo在特定场景下的表现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

The number of SQL options for Hadoop expanded substantially over the last 18 months. Most get a large amount of attention when announced, but a few slip under the radar. One of these low-flying options is Apache Tajo. I learned about Tajo in November of 2013 at a Hadoop User Group meeting.

Billed as a big data data warehousing system for Hadoop, Tajo development started in 2010 and moved to the Apache Software Foundation in March of 2013. Tajo is currently incubating. Its primary development sponsor is Gruter, a big data infrastructure startup in South Korea. Despite the lack of public awareness, Tajo has a fairly robust feature set:

SQL compliance
Fully distributed query processing against HDFS and other data sources
ETL feature set
User-defined functions
Compatibility with HiveQL and Hive MetaStore
Fault tolerance through a restart mechanism for failed tasks
Cost-based query optimization and an extensible query rewrite engine

Things get interesting when comparing performance against Apache Hive and Cloudera Impala. SK Telecom, the largest telecommunications provider in South Korea, tested Tajo, Hive and Impala using five sample queries. Hive 0.10 and Impala 1.1.1 on CDH 4.3.0 were used for the test. Test data size was 1.7TB and query results were 8GB or less in size. (The following images were taken from the presentation in the previous link.)

QUERY 1: HEAVY SCAN WITH 20 TEXT MATCHING FILTERS

QUERY 2: 7 UNIONS WITH JOINS

QUERY 3: SIMPLE JOINS

QUERY 4: GROUP BY AND ORDER BY

QUERY 5: 30 PATTERN MATCHING FILTERS WITH OR CONDITIONS USING GROUP BY, HAVING AND SORTING

What do these results indicate? Clearly, different SQL-on-Hadoop implementations have different performance characteristics. Until these options mature to be truly multi-purpose, selecting a single option may not result in the best overall performance. Also, these benchmarks are for a specific set of use cases – not your use cases. The tested queries may have no relevance to your data and how you’re using it.

The other important takeaway is the absolute performance of these options. The sample data set and results are small in modern terms, yet none of the results are astounding relative to a modern data warehouse or RDBMS. There’s a difference between “fast” and “fast for Hadoop.”Cloudera appears to be making some headway, but a lot of ground must be covered before any Hadoop distribution is compatible with the systems vendors claim to be replacing.

Ref: http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space/