HIVE-架构篇

本文解析了Hive架构,包括UI、Driver、Compiler、Metastore和Execution Engine等五个组件,并详细介绍了元数据存储的优点和使用独立数据存储替代HDFS的不足之处。此外,还讨论了Hive中的三个数据模型:表格、分区和桶。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

hive架构的官方文档

直接上架构图

这里写图片描述

这张图很明显显示hive存在5个组件ui、driver、compiler、metrastore、ExecutionEngine。具体活动流程也比较清楚,一共9步。
这里主要以下两点:
metastore(元数据)

Metastore is an object store with a database or file backed store.
The database backed store is implemented using an object-relational mapping (ORM) solution called the DataNucleus. 
The prime motivation for storing this in a relational database is queriability of metadata.
Some disadvantages of using a separate data store for metadata instead of using HDFS are synchronization and scalability issues. 
Additionally there is no clear way to implement an object store on top of HDFS due to lack of random updates to files. 
This, coupled with the advantages of queriability of a relational store, made our approach a sensible one.

元数据存储是一个使用了数据库或者文件的对象存储,支持存储的数据库使用一个orm 对象关系映射解决方案DataNucleus。
首要的动机是元数据的可存储性,就是将元数据(orm)存储到一个关系数据库中。元数据使用一个独立的数据储存代替hdfs的缺点是同步性和伸缩性。另外由于缺少随机更新文件,而在一个hdfs上实现一个对象存储是一个不聪明的方法。存储到关系数据库是比较明智的。

数据模型

Tables – These are analogous to Tables in Relational Databases. Tables can be filtered, projected, joined and unioned. Additionally all the data of a table is stored in a directory in HDFS. Hive also supports the notion of external tables wherein a table can be created on prexisting files or directories in HDFS by providing the appropriate location to the table creation DDL. The rows in a table are organized into typed columns similar to Relational Databases.
Partitions – Each Table can have one or more partition keys which determine how the data is stored, for example a table T with a date partition column ds had files with data for a particular date stored in the <table location>/ds=<date> directory in HDFS. Partitions allow the system to prune data to be inspected based on query predicates, for example a query that is interested in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files in <table location>/ds=2008-09-01/ directory in HDFS.
Buckets – Data in each partition may in turn be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use the SAMPLE clause on the table).

3个数据模型tables(表格)、partitions(分区)、bucket(桶),这个3个模型是不断细粒度的。一个表格有一个或者多个分区的key,决定数据是怎样储存的。一个分区依次划分多个桶,在表中的一列的hash上。每一个桶在partition目录下就是一个文件。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值