hive，hdfs , mysql

最新推荐文章于 2024-10-10 19:25:11 发布

原创最新推荐文章于 2024-10-10 19:25:11 发布 · 254 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#hive

hive 专栏收录该内容

1 篇文章

订阅专栏

本文介绍如何使用Cloudera Manager通过Hive SQL语句从本地或HDFS导入数据，并讲解了表与分区表的创建方法、动态分区的应用及UDF与UDAF函数的编写技巧。

将本地文件导入hive中

本人用的是Cloudera Manager ，用的虚拟机，开了8G还是不行，所以虚拟机不开10 G 估计真的很难用
导包前我们先建表

建表语句

create table order_mulit_partition(order_number string,event_time string)
PARTITIONED BY(event_month string, step string) 
row format delimited 
fields terminated by '\t';

order_mulit_partition 是表名

order_number string,event_time string 这是两个参数前面为字段名，后面为字段类型
PARTITIONED BY(event_month string, step string) 这边是分区，分区可以加快查询速度
row format delimited 支持多行插入
fields terminated by ‘\t’;字段之间支持制表符分割

插入文件


从本地导入文件

load data local inpath '/usr/local/test/order_created.txt' overwrite into table order_mulit_partition 
PARTITION(event_month='201903', step='1');

从hdfs 导入文件

load data  inpath '/usr/local/test/order_created.txt' overwrite into table order_mulit_partition 
PARTITION(event_month='201903', step='1');

hive与hdfs的对应

笔者导入了两次 event_month是一样的 step 分别为1和2
有一种说法是导入了hive 就是导入了hdfs中。为什么这么说请看上图，我只运行了导入文件到hive的表中，可是hadoop 下面也生成了对应的文件。具体可以看/user/hive/warehouse/下面，因为分区的关系所以这个order_mulit_partition下面还是有文件。分区的好处也体现在这里了，能够将记录分类减少过滤的时间。

在hive 表中的记录如下所示
在这里插入图片描述

动态分区

建表

create table order1(order_number string,event_time string)
PARTITIONED BY(event_month string, step string) 
row format delimited 
fields terminated by '\t';

注意这句一定要先运行


 set hive.exec.dynamic.partition.mode=nonstrict;
 set hive.exec.dynamic.partition=true;

从一张表导入到宁外一张表

from order_mulit_partition
insert into table order1
partition(event_month,step)
select order_number, event_time,event_month,step;

UDF 函数编写


package HiveUDF.HiveUDF;

import org.apache.hadoop.hive.ql.exec.UDF;

/**
 * Hello world!
 *
 */
public class App extends UDF {

	/*public static void main(String[] args) {
		// TODO Auto-generated method stub
	}*/

	public int evaluate(int a, int b) {
		return a + b;// 计算两个数之和
	}

	// 重载
	public String evaluate(String input) {
		return input.toLowerCase();// 将大写字母转换成小写
	}
	
	public String evaluate(String a,String b) {
		return a+b;// 将大写字母转换成小写
	}
	
	
	
}

所有的名字都叫evaluate，因为在hive 中运行的是udf 的重载方法evaluate

在hive 界面运行 add jar /usr/local/fuchanghai/HiveUDF.jar

再创建函数 create temporary function addValue as ‘HiveUDF.HiveUDF.App’;
HiveUDF.HiveUDF.App 是上面的包名+类名不对的话会报错找不到
在这里插入图片描述
电脑内存不够卡死运行不出来

UDAF


package HiveUDF.HiveUDF;

import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

public class UDAFTest  extends UDAF{
	public static class Evaluator implements UDAFEvaluator{

		//存放不同学生的总分
	       private static Map<String,Integer> ret;
	 
	       public Evaluator()
	       {
		   super();
	           init();
	       }
	 
	       //初始化
	       public void init()
	       {
		  ret = new HashMap<String,Integer>();
	       }
	       //reduce
	       //map阶段，遍历所有记录
	       public boolean iterate(String strStudent,int nScore)
	       { 
	    	 
	         if(ret.containsKey(strStudent))
	         {
	            int nValue = ret.get(strStudent);
	            nValue +=nScore;
	            ret.put(strStudent,nValue);
	         }
	         else
	         {
	           ret.put(strStudent,nScore);
	         }
	         return true;
	       }
	    
	       //返回最终结果 
	       public Map<String,Integer> terminate()
	       {
	         return ret;
	       }
	 
	       //combiner阶段，本例不需要  hadoop mapreduce  combiner
	       public Map<String,Integer> terminatePartial() 
	       {
	          return ret;
	       }
	 
	       //reduce阶段   将同key 放入同一个JOB
	       public boolean merge(Map<String,Integer> other)
	       {
	            for (Map.Entry<String, Integer> e : other.entrySet()) {
	                ret.put(e.getKey(),e.getValue());
	            }
	            return true;
	       }
		
	}

}

运行 select 函数名（studentNum , score ）from 表名

sqoop

简单的语法

from mysql to hdfs


import
--connect
jdbc:mysql://cm1:3306/mysql
--username
root
--password
123456
--table
db
--target-dir
/user/sqoop/aa
--where
"Db='scm'"
--num-mappers
1
--null-string
'0'
--null-non-string
'0'

sqoop --options-file ./test.opt