Spark SQL 1.x之SQL Context使用

本文详细介绍如何在IntelliJ IDEA中创建一个Scala项目,并配置Spark SQL依赖,实现JSON数据的读取与处理。通过具体步骤与代码示例,展示了从项目搭建到数据处理的全过程。

创建Scala项目

使用Idea创建项目:
在这里插入图片描述
指定<groupId>cn.ac.iie.spark</groupId><artifactId>sql</artifactId>
依赖包如下:

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>cn.ac.iie.spark</groupId>
  <artifactId>sql</artifactId>
  <version>1.0-SNAPSHOT</version>
  <inceptionYear>2008</inceptionYear>
  <properties>
    <scala.version>2.11.12</scala.version>
    <spark.version>2.4.5</spark.version>
  </properties>

  <dependencies>
    <!-- Scala -->
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <!-- Spark -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <version>2.15.0</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
            <configuration>
              <args>
                <arg>-dependencyfile</arg>
                <arg>${project.build.directory}/.scala_dependencies</arg>
              </args>
            </configuration>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <version>2.6</version>
        <configuration>
          <useFile>false</useFile>
          <disableXmlReport>true</disableXmlReport>
          <!-- If you have classpath issue like NoDefClassError,... -->
          <!-- useManifestOnlyJar>false</useManifestOnlyJar -->
          <includes>
            <include>**/*Test.*</include>
            <include>**/*Suite.*</include>
          </includes>
        </configuration>
      </plugin>

      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <configuration>
          <archive>
            <manifest>
              <mainClass></mainClass>
            </manifest>
          </archive>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
        </configuration>
      </plugin>
    </plugins>
  </build>

</project>

创建Object

创建一个Object,内容如下:

package cn.ac.iie.spark

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext

/**
 * SQLContextApp 的使用
 * 注意:IDEA是在本地,而测试数据在服务器上,能不能在本地进行开发测试?可以
 */
object SQLContextApp {
  def main(args: Array[String]): Unit = {

    val path = args(0)

    // 1. 创建相应的Context
    val sparkConf= new SparkConf()
    // 在生产或测试环境中,APPName和Master是通过脚本指定的
    sparkConf.setAppName("SQLContextApp")
    sparkConf.setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val sqlContext = new SQLContext(sc)

    // 2. 相关处理:json
    val people = sqlContext.read.format("json").load(path)
    people.printSchema()
    people.show()
    // 3. 关闭资源
    sc.stop()
  }
}


因为需要传参数,我们设置Edit Configuration
在这里插入图片描述
原始数据如下:

{"name":"Michael", "salary":3000}
{"name":"Andy", "salary":4500}
{"name":"Justin", "salary":3500}
{"name":"Berta", "salary":4000}
{"name":"vincent", "salary":90000}

控制台输出如下:

root
 |-- name: string (nullable = true)
 |-- salary: long (nullable = true)

+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
|   Andy|  4500|
| Justin|  3500|
|  Berta|  4000|
|vincent| 90000|
+-------+------+

可以看到正确打印出来了schema和数据
在目录下使用maven进行打包:mvn clean package -DskipTests

提交到环境中运行

往往我们在生产或测试环境中,APPName和Master是通过脚本指定的,因此需要注释掉sparkConf.setAppName("SQLContextApp") sparkConf.setMaster("local[2]")
在服务器中提交:

./spark-submit --name SQLContextApp --class cn.ac.iie.spark.SQLContextApp   --master local[2]    /home/iie4bu/lib/sql-1.0-SNAPSHOT.jar   /home/iie4bu/app/spark-2.4.5-bin-2.6.0-cdh5.15.1/examples/src/main/resources/people.json

在这里插入图片描述
在这里插入图片描述

WARN metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect. org.apache.thrift.transport.TTransportException: Cannot write to null outputStream at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:142) at org.apache.thrift.protocol.TBinaryProtocol.writeI32(TBinaryProtocol.java:178) at org.apache.thrift.protocol.TBinaryProtocol.writeMessageBegin(TBinaryProtocol.java:106) at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:70) at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_add_partition_with_environment_context(ThriftHiveMetastore.java:1683) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.add_partition_with_environment_context(ThriftHiveMetastore.java:1674) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.add_partition(HiveMetaStoreClient.java:661) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.add_partition(HiveMetaStoreClient.java:655) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154) at com.sun.proxy.$Proxy40.add_partition(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2562) at com.sun.proxy.$Proxy40.add_partition(Unknown Source) at org.apache.hadoop.hive.metastore.SynchronizedMetaStoreClient.add_partition(SynchronizedMetaStoreClient.java:69) at org.apache.hadoop.hive.ql.metadata.Hive.addPartitionToMetastore(Hive.java:1676) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1529) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1489) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.hive.client.Shim_v2_1.loadPartition(HiveShim.scala:1146) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply$mcV$sp(HiveClientImpl.scala:772) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply(HiveClientImpl.scala:770) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply(HiveClientImpl.scala:770) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:770) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply$mcV$sp(HiveExternalCatalog.scala:908) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply(HiveExternalCatalog.scala:896) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply(HiveExternalCatalog.scala:896) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:896) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:178) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:248) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:651) at com.ilotterytech.ocean.bdp.spark.finance.SyncLotterySalesInfo$.saveSyncRecords(SyncLotterySalesInfo.scala:710) at com.ilotterytech.ocean.bdp.spark.finance.FinanceCommon$class.main(FinanceCommon.scala:201) at com.ilotterytech.ocean.bdp.spark.finance.SyncLotterySalesInfo$.main(SyncLotterySalesInfo.scala:39) at com.ilotterytech.ocean.bdp.spark.finance.SyncLotterySalesInfo.main(SyncLotterySalesInfo.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
最新发布
05-20
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值