大数据系统——Hive-A Warehousing Solution Over a Map-Reduce Framework论文分享

1.背景

这篇文章本身就不长,总共三页半,作者来自Facebook数据基础架构团队。下面正式介绍。
传统的数据集存储方案成本过高,Hadoop则可以作为替代方案:在大规模廉价的商用机器上进行大规模数据存储。Hadoop上可以运行MapReduce,但是mapreduce编程模型的级别非常低,难以维护和重用。
在这里插入图片描述
Hive是一种基于Hadoop构建的开源数据仓库解决方案。Hive支持类似SQL的声明性语言查询–HiveQL,它们被编译为在Hadoop上执行的map-reduce作业。此外,HiveQL支持将自定义map-reduce脚本插入查询中。Hive还包括一个类型系统,支持包含基本类型的表,数组和映射的集合,以及相同的嵌套组合。底层IO库可以扩展为以自定义格式查询数据。Hive还包括一个包含模式和统计信息的系统目录Hive-Metastore,它在数据探索和查询优化中很有用。在Facebook中,Hive仓库包含数千个表,其中包含超过700 TB的数据,并且被100多个用户广泛用于报告和临时分析。

2.数据模型

Hive的数据组织为:

  1. Tables(表):类似于关系型数据库中的表。每个表都有一个相应的HDFS目录。HDFS文件系统目录类似于Linux,以“/”为根目录。表中的数据被序列化并存储在该目录中的文件中。用户可以将表与基础数据的序列化格式相关联。Hive提供内置的序列化格式,主要是利用压缩和延迟反序列化(lazy de-serialization)。用户还可以通过定义用Java编写自定义序列化和反序列化方法(称为SerDe‘s)来添加对新数据格式的支持。每个表的序列化格式存储在系统目录中,并在查询编译和执行期间由Hive自动使用。 Hive还支持存储在HDFS,NFS或本地目录中的外部表。
  2. 分区(partitions):每个表可以有一个或多个分区,用于确定
The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, mak- ing traditional warehousing solutions prohibitively expen- sive. Hadoop [3] is a popular open-source map-reduce im- plementation which is being used as an alternative to store and process extremely large data sets on commodity hard- ware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data ware- housing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs executed on Hadoop. In addition, HiveQL supports custom map-reduce scripts to be plugged into queries. The language includes a type sys- tem with support for tables containing primitive types, col- lections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog, Hive-Metastore, containing schemas and statistics, which is useful in data exploration and query optimization. In Facebook, the Hive warehouse contains several thousand tables with over 700 terabytes of data and is being used ex- tensively for both reporting and ad-hoc analyses by more than 100 users. The rest of the paper is organized as follows. Section 2 describes the Hive data model and the HiveQL language with an example. Section 3 describes the Hive system ar- chitecture and an overview of the query life cycle. Section 4 provides a walk-through of the demonstration. We conclude with future work in Section 5.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值