Hadoop集群中增加与ElasticSearch连接的操作

本文介绍如何解决Hive连接Elasticsearch时因缺少elasticsearch-hadoop-xxx.jar而产生的异常,并提供手动配置方法及测试查询语句。

    在没有引入elasticsearch-hadoop-xxx.jar相应的Jar包时,的在Hive中执行ElasticSearch外部表操作,会报如下的异常:        

Exception in thread "main" java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "ip-172-17-30-146/172.17.30.146"; destination host is: "ip-172-17-30-146":9000; 
    通过Spark查看执行任务的MR日志,报错如下:
    
.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.elasticsearch.hadoop.mr.EsOutputFormat not found
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.elasticsearch.hadoop.mr.EsOutputFormat not found

    此时报的是Yarn服务器上面找不到ES-Hadoop相关的类,此时需要做的将elasticsearch-hadoop-xxx.jar增加到Hadoop相关应用的环境中,根据目前我所用到的环境,需要增加的应用有:    

    1.Hive
    2.Spark
    3.Yarn

    需要将elasticsearch-hadoop-xxx.jar增加到所有服务器这些应用的环境中,然后重新执行执行就不会报这个问题了。

    另外:目前我的做法是手动将elasticsearch-hadoop-6.2.4.jar一台一台复制到Yarn服务器的lib目录下,不知道CDH是否有简化的管理功能,可以直接上传对应的Jar包?

    为了操作上的操作,我准备了一个批命令,命令记录如下:    

#/data/share_libs是我的第三共享jar包的目录
cd /data/share_libs
wget https://artifacts.elastic.co/downloads/elasticsearch-hadoop/elasticsearch-hadoop-6.2.4.zip
unzip elasticsearch-hadoop-6.2.4.zip
cd elasticsearch-hadoop-6.2.4/dist
#注:这里不要把所有elasticsearch-hadoop*.jar文件都拷贝过去了,否则Yarn中会报这些不同的Jar包的版本冲突
mv elasticsearch-hadoop-6.2.4.jar /opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/jars/
cd /data/share_libs
#删除不必要的资源
rm -f elasticsearch-hadoop-6.2.4.zip
rm -rf elasticsearch-hadoop-6.2.4
#注:目录/data/share_libs,在Spark中设置为了其第三库的目录,在Hive中也设置为了其auxlib目录,因而在这里建立软件链接后,Spark和Hive都可以使用
#Spark中设置第三库的目录,可以参看前面一篇文章:https://blog.youkuaiyun.com/fenglibing/article/details/80437246
ln -s /opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/jars/elasticsearch-hadoop-6.2.4.jar elasticsearch-hadoop-6.2.4.jar
cd /opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/hadoop-yarn/lib
ln -s /opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/jars/elasticsearch-hadoop-6.2.4.jar elasticsearch-hadoop-6.2.4.jar

    以下是通过创建一个外部表,然后测试查询的语句:  

create external table test_in_es
(
    id string,
    k string,
    v string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes' = 'http://vpc-es-xxxxxxxxx.eu-west-1.es.amazonaws.com:80',
'es.index.auto.create' = 'false',
'es.nodes.wan.only' = 'true',
'es.resource' = 'test/test',
'es.read.metadata' = 'true',
'es.mapping.names' = 'id:_metadata._id,k:k, v:v');

select * from test_in_es;
    如果遇到“EsHadoopIllegalArgumentException:No data nodes with HTTP-enabled available”这样的异常,请查看这篇文章: https://blog.youkuaiyun.com/fenglibing/article/details/80478551
Table of Contents Elasticsearch for Hadoop Credits About the Author About the Reviewers www.PacktPub.com Support files, eBooks, discount offers, and more Why subscribe? Free access for Packt account holders Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Downloading the color images of this book Errata Piracy Questions 1. Setting Up Environment Setting up Hadoop for Elasticsearch Setting up Java Setting up a dedicated user Installing SSH and setting up the certificate Downloading Hadoop Setting up environment variables Configuring Hadoop Configuring core-site.xml Configuring hdfs-site.xml Configuring yarn-site.xml Configuring mapred-site.xml The format distributed filesystem Starting Hadoop daemons Setting up Elasticsearch Downloading Elasticsearch Configuring Elasticsearch Installing Elasticsearch's Head plugin Installing the Marvel plugin Running and testing Running the WordCount example Getting the examples and building the job JAR file Importing the test file to HDFS Running our first job Exploring data in Head and Marvel Viewing data in Head Using the Marvel dashboard Exploring the data in Sense Summary 2. Getting Started with ES-Hadoop Understanding the WordCount program Understanding Mapper Understanding the reducer Understanding the driver Using the old API – org.apache.hadoop.mapred Going real — network monitoring data Getting and understanding the data Knowing the problems Solution approaches Approach 1 – Preaggregate the results Approach 2 – Aggregate the results at query-time Writing the NetworkLogsMapper job Writing the mapper class Writing Driver Building the job Getting the data into HDFS Running the job Viewing the Top N results Getting data from Elasticsearch to HDFS Understanding the Twitter dataset Trying it yourself Creating the MapReduce job to import data from Elasticsearch to HDFS Writing the Tweets2Hdfs mapper Running the example Testing the job execution output Summary ...
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值