【五】Spark SQL中HiveContext的使用(操作hive中的表)(提交到服务器上执行)(Hadoop HA)

本文介绍了如何使用 Spark 的 HiveContext 进行 SQL 查询,并详细解释了如何配置环境以支持 HiveQL 语法,读取 Hive 表数据及使用 Hive 的 UDF。此外,还解决了在 HA 配置下遇到的 UnknownHostException 错误。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

HiveContext在基本的SQLContext上有了一些新的特性,可以用Hive QL写查询,可以读取Hive表中的数据,支持Hive的UDF。

把hive/conf/hive-site.xml文件拷贝到spark/conf下

cd /app/hive/conf

scp hive-site.xml root@node1:/app/spark/spark-2.2.0-bin-2.9.0/conf/

scp hive-site.xml root@node2:/app/spark/spark-2.2.0-bin-2.9.0/conf/

scp hive-site.xml root@node3:/app/spark/spark-2.2.0-bin-2.9.0/conf/

scp hive-site.xml root@node4:/app/spark/spark-2.2.0-bin-2.9.0/conf/

准备hive的emp表数据

cd /app/hive/testData

vi emploaddata.txt

1    sid    12    cq
2    zhangsan    13    bj
3    lisi    14    sh

启动zookeeper

启动hadoop

启动hive

hive创建表emp

cd /app/hive/bin

hive

CREATE TABLE IF NOT EXISTS emp (
id int, 
name String,
salary String,
destination String)
COMMENT 'Employee details'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

给emp表添加数据

LOAD DATA LOCAL inpath '/app/hive/testData/emploaddata.txt' OVERWRITE INTO TABLE emp;  

 

项目目录

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.sid.com</groupId>
  <artifactId>sparksqltrain</artifactId>
  <version>1.0-SNAPSHOT</version>
  <inceptionYear>2008</inceptionYear>
  <properties>
    <scala.version>2.11.8</scala.version>
    <spark.version>2.2.0</spark.version>
  </properties>

  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </pluginRepository>
  </pluginRepositories>

  <dependencies>
    <!-- scala依赖 -->
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <!-- spark依赖 -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <!-- hivecontext要用这个依赖-->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>


  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
          <args>
            <arg>-target:jvm-1.5</arg>
          </args>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-eclipse-plugin</artifactId>
        <configuration>
          <downloadSources>true</downloadSources>
          <buildcommands>
            <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
          </buildcommands>
          <additionalProjectnatures>
            <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
          </additionalProjectnatures>
          <classpathContainers>
            <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
            <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
          </classpathContainers>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <reporting>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
        </configuration>
      </plugin>
    </plugins>
  </reporting>
</project>
HiveContext.scala
package com.sid.com

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext

object HiveContext {
  def main(args: Array[String]): Unit = {
    //创建相应的Context
    val sparkConf = new SparkConf()
    sparkConf//.setAppName("SQLContext").setMaster("local[3]")
    val sc = new SparkContext(sparkConf)
    //这个过时了,在spark1.X中这样用,2.X已经不用这个了
    val hiveContext = new HiveContext(sc);

    hiveContext.table("emp").show()


    sc.stop()
  }
}

打包

传到服务器上运行

cd /app/spark/spark-2.2.0-bin-2.9.0/bin

./spark-submit --class com.sid.com.HiveContext --master local[2] --name HiveContext --jars /app/mysql-connector-java-5.1.46.jar /app/spark/test_data/sparksqltrain-1.0-SNAPSHOT.jar

报错

Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoopcluster
	at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
	at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
	at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:668)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:604)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:314)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2853)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2153)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2153)
	at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2153)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2366)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:245)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:644)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:603)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:612)
	at com.sid.com.HiveContext$.main(HiveContext.scala:15)
	at com.sid.com.HiveContext.main(HiveContext.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.UnknownHostException: hadoopcluster
	... 65 more

因为我的hadoop的HDFS配置了HA高可用,hadoopcluster是hadoop配置文件hdfs-site.xml中dfs.nameservices的值。

需要把hadoop的配置文件hdfs-site.xml core-site.xml也拷贝到每个spark的conf下

cd /app/hadoop/hadoop-2.9.0/etc/hadoop

scp hdfs-site.xml root@node1:/app/spark/spark-2.2.0-bin-2.9.0/conf/

scp hdfs-site.xml root@node2:/app/spark/spark-2.2.0-bin-2.9.0/conf/

scp hdfs-site.xml root@node3:/app/spark/spark-2.2.0-bin-2.9.0/conf/

scp hdfs-site.xml root@node4:/app/spark/spark-2.2.0-bin-2.9.0/conf/

scp core-site.xml root@node1:/app/spark/spark-2.2.0-bin-2.9.0/conf/

scp core-site.xml root@node2:/app/spark/spark-2.2.0-bin-2.9.0/conf/

scp core-site.xml root@node3:/app/spark/spark-2.2.0-bin-2.9.0/conf/

scp core-site.xml root@node4:/app/spark/spark-2.2.0-bin-2.9.0/conf/

cd /app/spark/spark-2.2.0-bin-2.9.0/conf

cp spark-defaults.conf.template spark-defaults.conf

vi spark-defaults.conf

追加一下内容

spark.files file:///app/spark/spark-2.2.0-bin-2.9.0/conf/hdfs-site.xml,file:///app/spark/spark-2.2.0-bin-2.9.0/conf/core-site.xml

重新执行spark作业

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值