hive架构的官方文档
直接上架构图
这张图很明显显示hive存在5个组件ui、driver、compiler、metrastore、ExecutionEngine。具体活动流程也比较清楚,一共9步。
这里主要以下两点:
metastore(元数据)
Metastore is an object store with a database or file backed store.
The database backed store is implemented using an object-relational mapping (ORM) solution called the DataNucleus.
The prime motivation for storing this in a relational database is queriability of metadata.
Some disadvantages of using a separate data store for metadata instead of using HDFS are synchronization and scalability issues.
Additionally there is no clear way to implement an object store on top of HDFS due to lack of random updates to files.
This, coupled with the advantages of queriability of a relational store, made our approach a sensible one.
元数据存储是一个使用了数据库或者文件的对象存储,支持存储的数据库使用一个orm 对象关系映射解决方案DataNucleus。
首要的动机是元数据的可存储性,就是将元数据(orm)存储到一个关系数据库中。元数据使用一个独立的数据储存代替hdfs的缺点是同步性和伸缩性。另外由于缺少随机更新文件,而在一个hdfs上实现一个对象存储是一个不聪明的方法。存储到关系数据库是比较明智的。
数据模型
Tables – These are analogous to Tables in Relational Databases. Tables can be filtered, projected, joined and unioned. Additionally all the data of a table is stored in a directory in HDFS. Hive also supports the notion of external tables wherein a table can be created on prexisting files or directories in HDFS by providing the appropriate location to the table creation DDL. The rows in a table are organized into typed columns similar to Relational Databases.
Partitions – Each Table can have one or more partition keys which determine how the data is stored, for example a table T with a date partition column ds had files with data for a particular date stored in the <table location>/ds=<date> directory in HDFS. Partitions allow the system to prune data to be inspected based on query predicates, for example a query that is interested in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files in <table location>/ds=2008-09-01/ directory in HDFS.
Buckets – Data in each partition may in turn be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use the SAMPLE clause on the table).
3个数据模型tables(表格)、partitions(分区)、bucket(桶),这个3个模型是不断细粒度的。一个表格有一个或者多个分区的key,决定数据是怎样储存的。一个分区依次划分多个桶,在表中的一列的hash上。每一个桶在partition目录下就是一个文件。