Hive Metastore 初探

本文介绍了Hive元数据存储(Hive Metastore)的概念、使用原因及其工作原理。Hive Metastore作为Hive的核心组件之一,负责存储Hive表和分区的元数据,并通过API为客户端提供访问途径。文章还探讨了不同配置下Metastore的表现形式及优缺点,如内嵌式、本地式和远程式配置。

1. What is Hive Metastore?

The Hive metastore service stores the metadata for Hive tables and partitions in arelational database, and provides clients (including Hive) access to thisinformation via the metastore service API.

The Architecture of Hive below shows that the metastore is in the part of Hiveservices and it connect to a RDBMS to store the meta data of Hive table.

The metastore is the central repository of Hive metadata. The metastore is dividedinto two pieces: a service and the backing store for the data. By default, themetastore service runs in the same JVM as the Hive service and contains anembedded Derby database instance backed by the local disk. This is called theembedded metastore configuration.

Using an embedded metastore is a simple way to get started with Hive; however, onlyone embedded Derby database can access the database files on disk at any onetime, which means you can only have one Hive session open at a time that sharesthe same metastore. Trying to start a second session gives the error: Failed tostart database 'metastore_db' when it attempts to open a connection to themetastore.

The solution to supporting multiple sessions (and therefore multiple users) is touse a standalone database. This configuration is referred to as a localmetastore, since the metastore service still runs in the same process as theHive service, but connects to a database running in a separate process, eitheron the same machine or on a remote machine.

MySQL is a popular choice for the standalone metastore. In this case, javax.jdo.option.ConnectionURLis set to jdbc:mysql://host/dbname?createDatabaseIf NotExist=true, andjavax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. (Theuser name and password should be set, too, of course.) The JDBC driver JAR filefor MySQL (Connector/J) must be on Hive’s classpath, which is simply achievedby placing it in Hive’s lib directory.

Going a step further, there’s another metastore configuration called a remotemeta-store, where one or more metastore servers run in separate processes tothe Hive service. This brings better manageability and security, since thedatabase tier can be completely firewalled off, and the clients no longer needthe database credentials. A Hive service is configured to use a remotemetastore by setting hive.meta store.local to false, and hive.metastore.uris tothe metastore server URIs, separated by commas if there is more than one.Metastore server URIs are of the form thrift:// host:port, where the portcorresponds to the one set by METASTORE_PORT when starting the metastoreserver.

2. Why use Hive Metastore?

In a traditional database, a table’s schema is enforced at data load time. If thedata being loaded doesn’t conform to the schema, then it is rejected. Thisdesign is sometimes called schema on write, since the data is checked againstthe schema when it is written into the database.

Hive,on the other hand, doesn’t verify the data when it is loaded, but rather when aquery is issued. This is called schema on read.

There are trade-offs between the two approaches. Schema on read makes for a very fastinitial load, since the data does not have to be read, parsed, and serializedto disk in the database’s internal format. The load operation is just a filecopy or move. It is more flexible, too: consider having two schemas for thesame underlying data, depending on the analysis being performed.

Schema on write makes query time performance faster, since the database can indexcolumns and perform compression on the data. Thetrade-off, however, is that it takes longerto load data into the database. Furthermore, there are many scenarios where theschema is not known at load time, so there are no indexes to apply, since thequeries have not been formulated yet. These scenarios are where Hive shines.

3. HowHive Metastore works?

UI- The user interface for users to submit queries and other operations to thesystem. Currently the system has a command line interface and a web based GUIis being developed.

Driver- The component which receives the queries. This component implements thenotion of session handles and provides execute and fetch APIs modeled onJDBC/ODBC interfaces.

Compiler- The component that parses the query, does semantic analysis on the differentqurey blocks and query expressions and eventually generates an execution planwith the help of the table and partition metadata looked up from the metastore.

Metastore- The component that stores all the structure information of the various tableand partitions in the warehouse including column and column type information,the serializers and deserializers necessary to read and write data and thecorresponding hdfs files where the data is stored.

ExecutionEngine - The component which executes the execution plan created by thecompiler. The plan is a DAG of stages. The execution engine manages thedependencies between these different stages of the plan and executes thesestages on the appropriate system components.


4. Some detail

4.1.The processof a DDL sql

When we execute a DDL sql such as create table in cli. The process is below:

cli.CliDriver,get the command from user and do something about checking and parsing thecommand.

Then invoke ql.Driver, which is the entry of ql component, Driver contains thecontext of program and the query plan of program. In ql.Driver, the sql becompiled and executed.

The compile method contains about three steps, parse, analyze and getSchema. ThegetSchema will contact to the metastore which we are concerned.

The executor method invokes the class ql.exec.TaskRunner, that will launch aql.exec.DDLTask to execute DDL command. The DDLTask invoke theql.metadata.Hive, that use metastore.HiveMetaStoreClient to connect tometastore.HiveMetaStore.

metastore.HiveMetaStoreis the entry of metastore component. It have some clearly method to supplystore service, such as create table, add partition etc.

metastore.HievMetaStoreinvoke ObjectStore to persistence the model object in metastore.model into thedatabase.

4.2. The process of a sql with MapReduce

The difference between the sql with MapReduce and the DDL sql is that, theql.exec.TaskRunner will launch a ql.exec.MapRedTask. The ql.exec.MapRedTaskwill start a MapReduce job thougth a shell command to calculate something touse.

Theql.exec.MapRedTask will get the result in MapReduce and set it into theContext. Then ql.Driver return the result to cli.CliDriver.

4.3. Column statistics

Initially,we plan to support statistics gathering through an explicit ANALYZE command. Wehave extended the ANALYZE command to compute statistics on columns both withina table and a partition. The basic requirement to compute statistics is tostream the entire table data through mappers. Since some operations such asINSERT, CTAS already stream the table data through mappers, we could piggybackon such operations and compute statistics as part of these operations. Webelieve that by doing so, Hive may be able to compute statistics moreefficiently than most RDBMSs which don’t combine data loading and computingstatistics into a single operation.

Statistical summary of the column data can be logically viewed as an aggregation of thecolumn values rolled up by either the table or the partition. Hence we haveimplemented statistics computation using the generic user defined aggregatefunction (UDAF) framework in Hive. UDAFs in Hive are executed in two stages; Asthe records are streamed through the mappers, a partial aggregate that ismaintained in each mapper is updated. The reducers then combine the result ofthe partial aggregation to produce a final aggregate output.

Even though the stats task is run as a batch job , we want it to be executedas efficiently as possible. We expect to compute statistics on terabytesof data at a given time; hence we expect to scan billions of records. Thereforeit is very important that the algorithms we use for computing statistics useconstant memory. We have invested significant time researching algorithms forthis task and have selected ones which we think provide a good tradeoff betweenaccuracy and efficiency, and at the same time provide knobs for tuning. We havealso attempted to structure the code in a pluggable fashion so that newalgorithms can be easily substituted in the future.

References:

[1] http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html

[2] http://researchbigdatahadoop.blogspot.com/2013/04/hive-architecture_5.html

[3] https://blog.cloudera.com/blog/2012/08/column-statistics-in-hive/

[4] <Hadoop Thedefinitive guide>

VMware中运行tail -f logs/metastore.log 这条命令出现以下信息,有问题吗: 2025-06-21 01:06:57,312 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:createDefaultRoles_core(814)) - Added admin role in metastore 2025-06-21 01:06:57,322 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:createDefaultRoles_core(823)) - Added public role in metastore 2025-06-21 01:06:57,401 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:addAdminUsers_core(863)) - No user is added in admin role, since config is empty 2025-06-21 01:06:57,534 INFO [main] conf.HiveConf (HiveConf.java:findConfigFile(187)) - Found configuration file file:/export/server/hive/conf/hive-site.xml 2025-06-21 01:06:57,772 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(8969)) - Starting DB backed MetaStore Server with SetUGI enabled 2025-06-21 01:06:57,776 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(9041)) - Started the new metaserver on port [9083]... 2025-06-21 01:06:57,776 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(9043)) - Options.minWorkerThreads = 200 2025-06-21 01:06:57,776 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(9045)) - Options.maxWorkerThreads = 1000 2025-06-21 01:06:57,776 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(9047)) - TCP keepalive = true 2025-06-21 01:06:57,776 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(9048)) - Enable SSL = false
最新发布
06-22
从你提供的日志来看,Hive Metastore 启动过程中没有明显的错误信息,以下是日志的关键点解析和可能的注意事项: --- ### 日志解析 1. **添加默认角色**: ```log 2025-06-21 01:06:57,312 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:createDefaultRoles_core(814)) - Added admin role in metastore 2025-06-21 01:06:57,322 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:createDefaultRoles_core(823)) - Added public role in metastore ``` - Hive 在启动时会自动创建 `admin` 和 `public` 两个默认角色。 - 这是正常的初始化行为。 2. **未添加管理员用户**: ```log 2025-06-21 01:06:57,401 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:addAdminUsers_core(863)) - No user is added in admin role, since config is empty ``` - 日志表明当前配置中没有指定任何管理员用户。 - 如果需要为 `admin` 角色添加用户,可以在 `hive-site.xml` 中配置 `hive.security.metastore.admin.role` 属性。例如: ```xml <property> <name>hive.security.metastore.admin.role</name> <value>admin_user</value> </property> ``` 3. **加载配置文件**: ```log 2025-06-21 01:06:57,534 INFO [main] conf.HiveConf (HiveConf.java:findConfigFile(187)) - Found configuration file file:/export/server/hive/conf/hive-site.xml ``` - Hive 成功加载了 `hive-site.xml` 配置文件。 - 确保该文件中的配置正确无误。 4. **Metastore 服务启动成功**: ```log 2025-06-21 01:06:57,776 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(9041)) - Started the new metaserver on port [9083]... ``` - Hive Metastore 已成功启动,默认监听端口为 `9083`。 5. **线程池配置**: ```log 2025-06-21 01:06:57,776 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(9043)) - Options.minWorkerThreads = 200 2025-06-21 01:06:57,776 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(9045)) - Options.maxWorkerThreads = 1000 ``` - Hive Metastore 使用线程池处理请求,最小线程数为 `200`,最大线程数为 `1000`。 6. **TCP Keepalive 和 SSL 配置**: ```log 2025-06-21 01:06:57,776 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(9047)) - TCP keepalive = true 2025-06-21 01:06:57,776 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startMetaStore(9048)) - Enable SSL = false ``` - TCP Keepalive 已启用,有助于保持连接稳定性。 - SSL 未启用,如果需要启用 SSL,可以在 `hive-site.xml` 中配置相关属性。 --- ### 是否存在问题? 从日志来看,Hive Metastore 启动过程正常,没有出现错误或警告信息。以下是需要注意的地方: 1. **未添加管理员用户**: - 如果需要管理权限,建议在 `hive-site.xml` 中配置管理员用户。 2. **SSL 未启用**: - 如果运行环境对安全性要求较高,可以考虑启用 SSL。 3. **检查端口冲突**: - 默认端口为 `9083`,确保该端口未被其他服务占用。 --- ### 示例代码 以下是一个完整的流程示例,展示如何配置管理员用户和启用 SSL: #### 1. 配置管理员用户 编辑 `hive-site.xml` 文件,添加以下内容: ```xml <property> <name>hive.security.metastore.admin.role</name> <value>admin_user</value> </property> ``` #### 2. 启用 SSL 编辑 `hive-site.xml` 文件,添加以下内容: ```xml <property> <name>hive.server2.use.ssl</name> <value>true</value> </property> <property> <name>hive.server2.keystore.path</name> <value>/path/to/keystore.jks</value> </property> <property> <name>hive.server2.keystore.password</name> <value>keystore_password</value> </property> ``` #### 3. 检查端口占用 使用以下命令检查端口占用情况: ```bash netstat -tuln | grep 9083 ``` 如果端口被占用,可以在 `hive-site.xml` 中修改 Metastore 的监听端口: ```xml <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>hive.metastore.port</name> <value>9084</value> </property> ``` --- ### 解释 1. **管理员用户配置**: - 通过 `hive.security.metastore.admin.role` 属性指定管理员用户。 - 管理员用户拥有更高的权限,可用于管理表、数据库等资源。 2. **SSL 配置**: - 启用 SSL 可以提高数据传输的安全性。 - 需要提供密钥库文件(`keystore.jks`)及其密码。 3. **端口冲突检查**: - 如果默认端口 `9083` 被占用,可以通过修改配置文件更改监听端口。 --- ###
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值