在利用EMR的hive构建数据仓库时,需要编写用户自定义函数udf。在实际的经验中,根据EMR的hive版本选择从Maven中央仓库中拉取所需要的jar包这种方式,是存在问题的。采用Maven仓库拉取的jar包编写的udf函数放入hive中,在执行引擎为mapreduce时没有问题,但是一旦将执行引擎换成TEZ,就会报错。
所以,我从EMR的hive安装目录中找到hive-exe这个jar包放入java工程中。
2.登录EMR集群的主节点(Master),因为Hive是安装在EMR集群的主节点上的。
ssh -i ~/Downloads/dfwarehouse-test.pem hadoop@ec2-54-169-197-246.ap-southeast-1.compute.amazonaws.com
5.在/usr/lib/hive/lib下面有2个hive-exec的jar包,那我们应该下载哪个呢?执行命令
ls -lh /usr/lib/hive/lib/hive-exec*
发现实际上只有/usr/lib/hive/lib/hive-exec-2.3.3-amzn-1.jar这一个真实的jar包
6.下载/usr/lib/hive/lib/hive-exec-2.3.3-amzn-1.jar。在Mac电脑的命令行执行如下命令,将jar包下载到Documents目录:
scp -i ~/Downloads/dfwarehouse-test.pem hadoop@ec2-54-169-197-246.ap-southeast-1.compute.amazonaws.com:/usr/lib/hive/lib/hive-exec-2.3.3-amzn-1.jar ~/Documents/
7.创建Maven工程,引入依赖,这个时候Maven肯定会找不到这个jar包
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.3-amzn-1</version>
8.在自己的Maven仓库中进入org/apache/hive目录找到2.3.3-amzn-1,进入2.3.3-amzn-1目录后,删除里面的内容,并将下载的hive-exec-2.3.3-amzn-1.jar复制到这个目录
9.在eclipse的这个Maven项目上执行Maven -- update project 。这个时候创建的Maven工程就能找到我们下载的jar包了。
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
public class MyUdf extends UDF {
public String evaluate(String str) {
if (StringUtils.isBlank(str)) {
<finalName>dfwarehouse</finalName>
<directory>src/main/java</directory>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
12.在eclipse中run as --> maven build中输入clean package 进行打包。在target目录下生产2个jar包,dfwarehouse.jar是包含依赖的jar包。
13.将dfwarehouse.jar包上传到EMR集群的主节点。
14.在EMR的主节点执行hive命令,登录到hive数据库
add jar /home/hadoop/dfwarehouse.jar;
create temporary function myudf as 'dfwarehouse.udf.MyUdf';
附录:
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>df</groupId>
<artifactId>dfwarehouse</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>dfwarehouse</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.3-amzn-1</version>
</dependency>
</dependencies>
<build>
<finalName>dfwarehouse</finalName>
<resources>
<resource>
<directory>src/main/java</directory>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
MyUdf.java
package dfwarehouse.udf;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
public class MyUdf extends UDF {
//将输入字符串转为大写的udf
public String evaluate(String str) {
if (StringUtils.isBlank(str)) {
return str;
}
str = str.toUpperCase();
return str;
}
}