大数据之hive安装

本文介绍Hive的安装步骤及其在Hadoop生态中的作用,通过实例展示如何使用Hive进行web日志分析,包括数据表的创建、数据导入及查询。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

大数据之hive安装及分析web日志实例
数据搜索与分析是运维狗必备技能
“In reality, 90+% of MR jobs are generated by Hive SQL
一、理论知识预热
一句话介绍:基于Hadoop文件系统之上的分布式,按列存储的数据仓库架构,提供数据ETL(抽取、转换和加载)工具、数据存储管理和大型数据集的查询和分析能力。
通俗理解:Hive管理HDFS中存储的数据,并提供基于SQL的查询语言
开发本质:Hive被设计为用MapReduce操作HDFS数据.
大数据之hive安装及常用HQL
Hive核心:Hive数据管理包括:
(1)元数据存储:Hive将元数据存储在RDBMS中,如Derby,MySQL
客户端利用Thrift协议通过MetaStoreServer来访问元数据库
(2)数据存储:hive中所有的数据都存储在HDFS中,大部分的查询由MapReduce完成。用户可以非常自由地组织Hive中的表,只需要在创建表的时候告诉Hive数据中的列分割符和行分隔符就能解析数据。
Hive包含4种数据模型:
例如有一张表叫accesslog,按event_day进行分区,hive的在HDFS上的数据存储地址为/user/hive/warehouse
leix
(3)数据交换:有三种接口:客户端、数据库接口和web界面
二、安装
第一步:安装hadoop
见hadoop伪集群搭建与MapReduce编程入门(待续)
第二步:安装hive
[hadoop@hadoop datas]$
wget http://mirror.bit.edu.cn/apache/hive/stable/apache-hive-1.1.
[hadoop@hadoop datas] tarzxvfapachehive1.1.0bin.tar.gzbinhiveHiveServer2beelineconfhivesite.xmllibjarhcatalogHCatalogserver/tmp/root/hive.loghivehive[hadoop@hadoopdatas] cp apache-hive-1.1.0-bin hive
[hadoop@hadoop datas] chmodg+whive/cpconf/hiveenv.sh.templateconf/hiveenv.sh[hadoop@hadoopdatas]
vim conf/hive-env.sh
HADOOP_HOME=/home/tanjiti/hadoop-2.6.0[替换成hadoop所在目录]
第四步:hive配置-设置hive数据在hadoop HDFS中的存储地址
cp conf/hive-default.xml.template conf/hive-site.xml
vim conf/hive-site.xml

hive.metastore.warehouse.dir #hive数据存储目录,指定的是HDFS上的位置
/user/hive/warehouse
location of default database for the warehouse


hive.exec.scratchdir
/tmp/hive #hive的数据临时文件目录
HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, a
n HDFS scratch dir: {hive.exec.scratchdir}/<username> is created, with{hive.scratch.dir.permission}.

在hadoop HDFS中创建相应的目录
bin/hdfs dfs -mkdir /user/hive/
bin/hdfs dfs -mkdir /user/hive/warehouse
bin/hdfs dfs -chmod g+w /user/hive/warehouse
bin/hdfs dfs -chmod g+w /tmp
第五步:hive配置-设置hive数据将元数据存储在MySQL中,hive需要将元数据存储在RDBMS中,默认情况下,配置为Derby数据库
vim conf/hive–site.xml

javax.jdo.option.ConnectionURL
jdbc:derby:;databaseName=metastore_db;create=true #Hive连接数据库的连接字符串
JDBC connect string for a JDBC metastore


javax.jdo.option.ConnectionDriverName
org.apache.derby.jdbc.EmbeddedDriver #jdbc驱动的类入口名称
Driver class name for a JDBC metastore


javax.jdo.option.ConnectionUserName #数据库的用户名
APP
Username to use against metastore database


javax.jdo.option.ConnectionPassword #数据库的密码
mine
password to use against metastore database

Derby JDBC驱动包在lib/derby-10.11.1.1.jar

接下来我们配置使用MySQL来存储元数据库metastore
1. 安装MySQL
apt-get install mysql-server
2. 创建账户,设置用户名与密码
create user ‘hive’@’%’ identified by ‘hive’;
3. 赋予权限
grant all privileges on . to ‘hive’@’%’ with grant option;
4. 强制生效
flush privileges;
5. 创建数据库
create database hive;
血的教训,血的教训,血的教训(2015年7月9日补充)
这里一定要把hive数据库的字符集修改为latin1,而且一定要在hive初次启动的时候就修改字符集 (否则就等着删除操作的时候死掉吧)
alter database hive character set latin1;
6. 配置Hive
vim conf/hive-site.xml

hive.metastore.local
true


javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore


javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore


javax.jdo.option.ConnectionUserName
hive
Username to use against metastore database


javax.jdo.option.ConnectionPassword
hive
password to use against metastore database

7. 下载MySQL JDBC 驱动包,放置在lib目录下
lib/mysql-connector-java-5.1.7-bin.jar
第六步:运行hive命令行接口
bin/hive
第七步: hive QL初体验
hive> create table test(id int, name string) row format delimited FIELDS TERMINATED BY ‘,’;
OK
Time taken: 0.201 seconds
hive> load data local inpath ‘/home/tanjiti/apache-hive-1.1.0-bin/test.data’ overwrite into table test;
Loading data to table default.test
Table default.test stats: [numFiles=1, numRows=0, totalSize=25, rawDataSize=0]
OK
Time taken: 0.463 seconds
hive> select * from test;
OK
1 tanjiti
2 kokcc
3 dani
Time taken: 0.218 seconds, Fetched: 3 row(s)

我们在hadoop HFS中也能看到该数据文件,可以看出来hive中的每张表都对应hadoop的一个存储目录
/hadoop-2.6.0/bin/hdfs dfs -cat /user/hive/warehouse/test/test.data
1,tanjiti
2,kokcc
3,dani
我真是bug体质,遇到一堆错误
错误1:
[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.TerminalFactory.create(TerminalFactory.java:101)
at jline.TerminalFactory.get(TerminalFactory.java:158)
at jline.console.ConsoleReader.(ConsoleReader.java:229)
at jline.console.ConsoleReader.(ConsoleReader.java:221)
at jline.console.ConsoleReader.(ConsoleReader.java:209)
at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Exception in thread “main” java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.console.ConsoleReader.(ConsoleReader.java:230)
at jline.console.ConsoleReader.(ConsoleReader.java:221)
at jline.console.ConsoleReader.(ConsoleReader.java:209)
at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
原因:jline版本冲突
find ../ -name jline*
../hadoop-2.6.0/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jline-0.9.94.jar
../hadoop-2.6.0/share/hadoop/yarn/lib/jline-0.9.94.jar
../hadoop-2.6.0/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jline-0.9.94.jar
../apache-hive-1.1.0-bin/lib/jline-2.12.jar
解决方案: http://stackoverflow.com/questions/28997441/hive-startup-error-terminal-initialization-failed-falling-back-to-unsupporte
export HADOOP_USER_CLASSPATH_FIRST=true
Exception in thread “main” java.lang.RuntimeException: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: {system:java.io.tmpdir%7D/%7Bsystem:user.name%7D
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:472)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: {system:java.io.tmpdir%7D/%7Bsystem:user.name%7D
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:515)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:458)
… 8 more
Caused by: java.net.URISyntaxException: Relative path in absolute URI: {system:java.io.tmpdir%7D/%7Bsystem:user.name%7D
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
… 11 more

解决方案:替换{system:java.io.tmpdir%7D/%7Bsystem:user.name%7D为绝对路径
vim conf/hive-site.xml

hive.exec.local.scratchdir
/tmp/hive
Local scratch space for Hive jobs


hive.downloaded.resources.dir
/tmp/${hive.session.id}_resources
Temporary local directory for added resources in the remote file system.


hive.querylog.location
/tmp/hive
Location of Hive run time structured log file


hive.server2.logging.operation.log.location
/tmp/hive/operation_logs
Top level directory where operation logs are stored if logging functionality is enabled

错误3:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:javax.jdo.JDODataStoreException: An exception was thrown while adding/validating class(es) : Specified key was too long; max key length is 767 bytes
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes
解决方案:
http://zh.hortonworks.com/community/forums/topic/hive-check-execute-failed-hdp1-3/
alter database hive character set latin1;
错误4:

java版本过低导致的错误,最好使用java 7及以上版本(java6的定位估计是浏览器中的IE6)
Exception in thread “main” java.lang.UnsupportedClassVersionError: org/apache/hadoop/hdfs/server/namenode/NameNode : Unsupported major.minor version 51.0
三、 常用HiveQL及UDF编写
一句话:和mysql非常非常像,学起来完全没有压力
1. hive客户端常用命令
hive -e “SQL语句”;
hive -f test.hql 从文件执行hive查询
hive> ! pwd; 执行简单的shell命
hive> dfs -ls /user/hive/warehouse; 执行Hadoop的dfs命令
2. hive支持的数据类型
包括基本类型与集合类型,对日志分析,一般string,bigint,double,map就够用了
3. hive默认的切割文本文件的分隔符
\n 分割行;ctrl+A 分割字段(列),最后自己来指定分割符
4. hive的环境设置
set hive.cli.print.current.db=true; 设置显示当前DB
set hive.cli.print.header=true; 设置显示表头
set hive.exec.mode.local.auto=ture;设置本地模式,避免进行mapreduce,数据量小的时候适用
set hive.mapred.mode=strict;设置严格模式,当开启非本地模式,采用严格的查询语句优化查询性能,例如where需指定分区;order by要和limit一起
5. HiveQL:我们通过分析web日志来熟悉HiveQL
日志样例
127.0.0.1 [12/May/2015:15:16:30 +0800] sqli(194) BAN(226) 403 174 POST “/wp-content/plugins/store-locator-le/downloadcsv.php” “Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0” “-” “-” “-“”query=addr,,1%26%2339;union(select*from((select%20md5(3.1415))a1%20join(select%202)a2))#” “application/x-www-form-urlencoded
remote_addr string 访问者ip
time_local 时间
attack_type 攻击类型(类型ID)
ban_type 事件处理类型(事件响应时间)
status HTTP响应码
body_bytes_sent body字节数
request_method HTTP请求方法
request_uri HTTP请求URI
http_user_agent User_Agent请求头
http_x_forwarded_for X_Forwarded_For请求头
http_referer Referer请求头
http_cookie Cookie请求头
request_body 请求头
http_content_type Content_Type请求头
第一步:创建数据库weblog
hive> create database if not exists weblog comment ‘holds all web logs’ ;
数据库存储在
hive> dfs -ls /user/hive/warehouse/;
drwxrwxr-x - root supergroup 0 2015-05-12 15:01 /user/hive/warehouse/weblog.db
第二步:创建表nginxlog,用来存储原始日志
hive> use default;
hive> create table nginxlog(remote_addr string,time_local string, attack_typestring,ban_type string,status string,body_bytes_sent string,request_methodstring,request_uri string,http_user_agent string,http_x_forwarded_for string,http_refererstring,http_cookie string,request_body string,http_content_type string) row format serde’org.apache.hadoop.hive.serde2.RegexSerDe’ with serdeproperties(“input.regex” = “(\d+\.\d+\.\d+\.\d+)\s+(

[^
]+\])\s+(\w+\(\d*\))\s+(\w+\(\d*\))\s+(\d{3})\s+(\d+)\s+([A-Z]+)\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”“) stored as textfile;
这个input regex让我血槽速减 99%
教训:双倍转义,双倍转义;双倍转义 重要的事情说三遍
也让我学会了如何修改table的SerDe属性
补充知识:hive使用一个inputformat对象将输入流分割成记录,然后使用一个outoutformat对象来记录格式化为输出流,再使用
SerDe(序列化,反序列化配置)在读数据时将记录解析成列,在写数据时将列编码成记录。
hive> alter table nginxlog

set serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’
with serdeproperties(“input.regex” = “(\d+\.\d+\.\d+\.\d+)\s+(

[^
]+\])\s+(\w*\(\d*\))\s+(\w*\(\d*\))\s+(\d{3})\s+(\d+)\s+([A-Z]+)\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”\s+\\”([^\”]+)\\”“) ;
表创建成功后,我们可以看到其在hadoop中的存储位置为
hive> dfs -ls /user/hive/warehouse/weblog.db/nginxlog;
Found 1 items
-rwxrwxr-x 1 root supergroup 1861896 2015-05-12 20:22 /user/hive/warehouse/weblog.db/nginxlog/access.log
我们可以查看表的结构
hive> describe nginxlog;
OK
remote_addr string
time_local string
attack_type string
ban_type string
status string
body_bytes_sent string
request_method string
request_uri string
http_user_agent string
http_x_forwarded_for string
http_referer string
http_cookie string
request_body string
http_content_type string
Time taken: 0.055 seconds, Fetched: 14 row(s)
第三步:导入原始日志文件
load data local inpath “/home/tanjiti/nginx/logs/access.log” overwrite into table nginxlog;
第四步:创建另一个表,用来存储url parse后的数据
create table urlparse(

request_uri string,

requestfilename string,

param map

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值