大数据之hive安装

最新推荐文章于 2023-06-02 22:19:19 发布

我是一个平凡的人

最新推荐文章于 2023-06-02 22:19:19 发布

阅读量565

点赞数

分类专栏： hive

hive 专栏收录该内容

3 篇文章

订阅专栏

本文介绍Hive的安装步骤及其在Hadoop生态中的作用，通过实例展示如何使用Hive进行web日志分析，包括数据表的创建、数据导入及查询。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

大数据之hive安装及分析web日志实例
数据搜索与分析是运维狗必备技能
“In reality, 90+% of MR jobs are generated by Hive SQL
一、理论知识预热
一句话介绍：基于Hadoop文件系统之上的分布式，按列存储的数据仓库架构，提供数据ETL（抽取、转换和加载）工具、数据存储管理和大型数据集的查询和分析能力。
通俗理解：Hive管理HDFS中存储的数据，并提供基于SQL的查询语言
开发本质：Hive被设计为用MapReduce操作HDFS数据.
大数据之hive安装及常用HQL
Hive核心：Hive数据管理包括：
（1）元数据存储：Hive将元数据存储在RDBMS中，如Derby，MySQL
客户端利用Thrift协议通过MetaStoreServer来访问元数据库
（2）数据存储：hive中所有的数据都存储在HDFS中，大部分的查询由MapReduce完成。用户可以非常自由地组织Hive中的表，只需要在创建表的时候告诉Hive数据中的列分割符和行分隔符就能解析数据。
Hive包含4种数据模型：
例如有一张表叫accesslog，按event_day进行分区，hive的在HDFS上的数据存储地址为/user/hive/warehouse
leix
（3）数据交换：有三种接口：客户端、数据库接口和web界面
二、安装
第一步：安装hadoop
见hadoop伪集群搭建与MapReduce编程入门（待续）
第二步：安装hive
[hadoop@hadoop datas]$
wget http://mirror.bit.edu.cn/apache/hive/stable/apache-hive-1.1.
[hadoop@hadoop datas] $tar zxvf apache-hive-1.1.0-bin.tar.gz 重要目录结构如下： bin 可执行文件： hive命令行接口； HiveServer2； beeline命令行接口 conf 配置文件：hive-site.xml lib jar包 hcatalog： HCatalog server服务 /tmp/roo用户名t/hive.log 日志第三步： hive配置-设置hive环境变量 [hadoop@hadoop datas]$ cp apache-hive-1.1.0-bin hive
[hadoop@hadoop datas] $chmod g+w hive/ cp conf/hive-env.sh.template conf/hive-env.sh [hadoop@hadoop datas]$
vim conf/hive-env.sh
HADOOP_HOME=/home/tanjiti/hadoop-2.6.0[替换成hadoop所在目录]
第四步：hive配置－设置hive数据在hadoop HDFS中的存储地址
cp conf/hive-default.xml.template conf/hive-site.xml
vim conf/hive-site.xml

hive.metastore.warehouse.dir #hive数据存储目录，指定的是HDFS上的位置
/user/hive/warehouse
location of default database for the warehouse

hive.exec.scratchdir
/tmp/hive #hive的数据临时文件目录
HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, a
n HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with$ {hive.scratch.dir.permission}.

在hadoop HDFS中创建相应的目录
bin/hdfs dfs -mkdir /user/hive/
bin/hdfs dfs -mkdir /user/hive/warehouse
bin/hdfs dfs -chmod g+w /user/hive/warehouse
bin/hdfs dfs -chmod g+w /tmp
第五步：hive配置－设置hive数据将元数据存储在MySQL中，hive需要将元数据存储在RDBMS中，默认情况下，配置为Derby数据库
vim conf/hive–site.xml

javax.jdo.option.ConnectionURL
jdbc:derby:;databaseName=metastore_db;create=true #Hive连接数据库的连接字符串
JDBC connect string for a JDBC metastore

javax.jdo.option.ConnectionDriverName
org.apache.derby.jdbc.EmbeddedDriver #jdbc驱动的类入口名称
Driver class name for a JDBC metastore

javax.jdo.option.ConnectionUserName #数据库的用户名
APP
Username to use against metastore database

javax.jdo.option.ConnectionPassword #数据库的密码
mine
password to use against metastore database

Derby JDBC驱动包在lib/derby-10.11.1.1.jar

接下来我们配置使用MySQL来存储元数据库metastore
1. 安装MySQL
apt-get install mysql-server
2. 创建账户，设置用户名与密码
create user ‘hive’@’%’ identified by ‘hive’;
3. 赋予权限
grant all privileges on . to ‘hive’@’%’ with grant option;
4. 强制生效
flush privileges;
5. 创建数据库
create database hive;
血的教训，血的教训，血的教训（2015年7月9日补充）
这里一定要把hive数据库的字符集修改为latin1，而且一定要在hive初次启动的时候就修改字符集（否则就等着删除操作的时候死掉吧）
alter database hive character set latin1;
6. 配置Hive
vim conf/hive-site.xml

hive.metastore.local
true

javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore

javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore

javax.jdo.option.ConnectionUserName
hive
Username to use against metastore database

javax.jdo.option.ConnectionPassword
hive
password to use against metastore database

7. 下载MySQL JDBC 驱动包，放置在lib目录下
lib/mysql-connector-java-5.1.7-bin.jar
第六步：运行hive命令行接口
bin/hive
第七步： hive QL初体验
hive> create table test(id int, name string) row format delimited FIELDS TERMINATED BY ‘,’;
OK
Time taken: 0.201 seconds
hive> load data local inpath ‘/home/tanjiti/apache-hive-1.1.0-bin/test.data’ overwrite into table test;
Loading data to table default.test
Table default.test stats: [numFiles=1, numRows=0, totalSize=25, rawDataSize=0]
OK
Time taken: 0.463 seconds
hive> select * from test;
OK
1 tanjiti
2 kokcc
3 dani
Time taken: 0.218 seconds, Fetched: 3 row(s)

我们在hadoop HFS中也能看到该数据文件，可以看出来hive中的每张表都对应hadoop的一个存储目录
/hadoop-2.6.0/bin/hdfs dfs -cat /user/hive/warehouse/test/test.data
1,tanjiti
2,kokcc
3,dani
我真是bug体质，遇到一堆错误
错误1:
[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.TerminalFactory.create(TerminalFactory.java:101)
at jline.TerminalFactory.get(TerminalFactory.java:158)
at jline.console.ConsoleReader.(ConsoleReader.java:229)
at jline.console.ConsoleReader.(ConsoleReader.java:221)
at jline.console.ConsoleReader.(ConsoleReader.java:209)
at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Exception in thread “main” java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.console.ConsoleReader.(ConsoleReader.java:230)
at jline.console.ConsoleReader.(ConsoleReader.java:221)
at jline.console.ConsoleReader.(ConsoleReader.java:209)
at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
原因：jline版本冲突
find ../ -name jline*
../hadoop-2.6.0/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jline-0.9.94.jar
../hadoop-2.6.0/share/hadoop/yarn/lib/jline-0.9.94.jar
../hadoop-2.6.0/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jline-0.9.94.jar
../apache-hive-1.1.0-bin/lib/jline-2.12.jar
解决方案： http://stackoverflow.com/questions/28997441/hive-startup-error-terminal-initialization-failed-falling-back-to-unsupporte
export HADOOP_USER_CLASSPATH_FIRST=true
Exception in thread “main” java.lang.RuntimeException: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$ %7Bsystem:user.name%7D
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:472)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$ %7Bsystem:user.name%7D
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:515)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:458)
… 8 more
Caused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$ %7Bsystem:user.name%7D
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
… 11 more

解决方案：替换 ${system:java.io.tmpdir%7D/$ %7Bsystem:user.name%7D为绝对路径
vim conf/hive-site.xml

hive.exec.local.scratchdir
/tmp/hive
Local scratch space for Hive jobs

hive.downloaded.resources.dir
/tmp/${hive.session.id}_resources
Temporary local directory for added resources in the remote file system.

hive.querylog.location
/tmp/hive
Location of Hive run time structured log file

hive.server2.logging.operation.log.location
/tmp/hive/operation_logs
Top level directory where operation logs are stored if logging functionality is enabled

错误3:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:javax.jdo.JDODataStoreException: An exception was thrown while adding/validating class(es) : Specified key was too long; max key length is 767 bytes
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes
解决方案：
http://zh.hortonworks.com/community/forums/topic/hive-check-execute-failed-hdp1-3/
alter database hive character set latin1;
错误4:

java版本过低导致的错误，最好使用java 7及以上版本（java6的定位估计是浏览器中的IE6）
Exception in thread “main” java.lang.UnsupportedClassVersionError: org/apache/hadoop/hdfs/server/namenode/NameNode : Unsupported major.minor version 51.0
三、常用HiveQL及UDF编写
一句话：和mysql非常非常像，学起来完全没有压力
1. hive客户端常用命令
hive -e “SQL语句”;
hive -f test.hql 从文件执行hive查询
hive> ! pwd; 执行简单的shell命
hive> dfs -ls /user/hive/warehouse; 执行Hadoop的dfs命令
2. hive支持的数据类型
包括基本类型与集合类型，对日志分析，一般string，bigint，double，map就够用了
3. hive默认的切割文本文件的分隔符
\n 分割行；ctrl+A 分割字段（列），最后自己来指定分割符
4. hive的环境设置
set hive.cli.print.current.db=true; 设置显示当前DB
set hive.cli.print.header=true; 设置显示表头
set hive.exec.mode.local.auto=ture;设置本地模式，避免进行mapreduce，数据量小的时候适用
set hive.mapred.mode=strict;设置严格模式，当开启非本地模式，采用严格的查询语句优化查询性能，例如where需指定分区；order by要和limit一起
5. HiveQL：我们通过分析web日志来熟悉HiveQL
日志样例
127.0.0.1 [12/May/2015:15:16:30 +0800] sqli(194) BAN(226) 403 174 POST “/wp-content/plugins/store-locator-le/downloadcsv.php” “Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0” “-” “-” “-“”query=addr,,1%26%2339;union(select*from((select%20md5(3.1415))a1%20join(select%202)a2))#” “application/x-www-form-urlencoded
remote_addr string 访问者ip
time_local 时间
attack_type 攻击类型（类型ID）
ban_type 事件处理类型（事件响应时间）
status HTTP响应码
body_bytes_sent body字节数
request_method HTTP请求方法
request_uri HTTP请求URI
http_user_agent User_Agent请求头
http_x_forwarded_for X_Forwarded_For请求头
http_referer Referer请求头
http_cookie Cookie请求头
request_body 请求头
http_content_type Content_Type请求头
第一步：创建数据库weblog
hive> create database if not exists weblog comment ‘holds all web logs’ ;
数据库存储在
hive> dfs -ls /user/hive/warehouse/;
drwxrwxr-x - root supergroup 0 2015-05-12 15:01 /user/hive/warehouse/weblog.db
第二步：创建表nginxlog，用来存储原始日志
hive> use default;
hive> create table nginxlog(remote_addr string,time_local string, attack_typestring,ban_type string,status string,body_bytes_sent string,request_methodstring,request_uri string,http_user_agent string,http_x_forwarded_for string,http_refererstring,http_cookie string,request_body string,http_content_type string) row format serde’org.apache.hadoop.hive.serde2.RegexSerDe’ with serdeproperties(“input.regex” = “(\d+\.\d+\.\d+\.\d+)\s+(

[^

$[^$ ]+\])\s+(\w+$\d*$)\s+(\w+$\d*$)\s+(\d{3})\s+(\d+)\s+([A-Z]+)\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”“) stored as textfile;
这个input regex让我血槽速减 99%
教训：双倍转义，双倍转义；双倍转义重要的事情说三遍
也让我学会了如何修改table的SerDe属性
补充知识：hive使用一个inputformat对象将输入流分割成记录，然后使用一个outoutformat对象来记录格式化为输出流，再使用
SerDe（序列化，反序列化配置）在读数据时将记录解析成列，在写数据时将列编码成记录。
hive> alter table nginxlog

set serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’
with serdeproperties(“input.regex” = “(\d+\.\d+\.\d+\.\d+)\s+(
$[^$ $[^$ ]+\])\s+(\w*$\d*$)\s+(\w*$\d*$)\s+(\d{3})\s+(\d+)\s+([A-Z]+)\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”“) ;
表创建成功后，我们可以看到其在hadoop中的存储位置为
hive> dfs -ls /user/hive/warehouse/weblog.db/nginxlog;
Found 1 items
-rwxrwxr-x 1 root supergroup 1861896 2015-05-12 20:22 /user/hive/warehouse/weblog.db/nginxlog/access.log
我们可以查看表的结构
hive> describe nginxlog;
OK
remote_addr string
time_local string
attack_type string
ban_type string
status string
body_bytes_sent string
request_method string
request_uri string
http_user_agent string
http_x_forwarded_for string
http_referer string
http_cookie string
request_body string
http_content_type string
Time taken: 0.055 seconds, Fetched: 14 row(s)
第三步：导入原始日志文件
load data local inpath “/home/tanjiti/nginx/logs/access.log” overwrite into table nginxlog;
第四步：创建另一个表，用来存储url parse后的数据
create table urlparse(