| Hive | Impala | Drill | SparkSQL | |
| Project Goal | Offline batch processing stuff; Long running job performing data heavy operation, such as joins on huge data sets | Run real-time queries on top of existing Hadoop warehouse | Provides distributed query capability across multiple big data platform. Query data from any or all of those data sources at the same time and can push down into the underlying storage system. | Execute SQL query, then deal with the result sets. |
| Similarity | Impala is designed based on Hive. Using the same metadata. All designed for Hadoop env. | Support query data from a variety of different datasources. (RDBMS, NoSQL, File, JSON...) All support JDBC/ODBC drivers. | ||
|
|
|
|
|
|
| Difference | Suitable for Offline data processing | Focus on online real-time data processing | Not only hadoop project
|
|
|
|
| Schema Free: all data is internally represented as either a simple or complex JSON data structure |
| |
|
|
| Fully support SQL Query (ANSI SQL:2003) | Just have SQL query capabilities Subset of SQL (SQL-Like) | |
|
|
| Supported by many BI tools |
| |
|
|
|
| Better security support for data accessing | |
References:
https://www.javacodegeeks.com/2015/12/apache-spark-vs-apache-drill.html
本文对比了Hive、Impala、Drill和SparkSQL等大数据处理工具的特点与适用场景。介绍了它们在离线批处理和实时查询方面的优势,并讨论了各自的相似性和差异。
2万+

被折叠的 条评论
为什么被折叠?



