Spark SQL 能够使用 JDBC 从数据库读取表的数据源。当使用 JDBC 访问其它数据库时,应该首选 JdbcRDD。这是因为结果是以数据框(DataFrame)返回的,且这样 Spark SQL操作轻松或便于连接其它数据源。
通过将所需数据存储在mysql数据库,利用jdbc方式连接获取表数据,从而形成sql拼接进行hive库的数据表的操作。
次处:进行对Hive分区表的删除,创建,加载(插入数据)。
import Spark_Sql.session
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Row, SparkSession}
import scala.util.control._
object Spark_Sql {
//spark初始化
val sconf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("ETL")
val spark: SparkSession = SparkSession.builder().config(sconf).getOrCreate();
spark.catalog.currentDatabase;
def main(args: Array[String]) = {
//定义获取mysql连接
val frameReader = session.read.format("jdbc")
.option("url", "jdbc:mysql://192.168.8.172/Test")
.option("driver", "com.mysql.jdbc.Driver").option("user", "root").option("password", "Spark123!")
.option("dbtable", "student")
val dataFrame = frameReader.load().toDF()
//注册临时表
dataFrame.registerTempTable("test")
val test = spark.sql("select name from test").rdd.collect().toList
val list = spark.sql("select score from test").rdd.collect().toList
//定义判断变量
val outer = new Breaks;
val inner = new Breaks;
outer.breakable {
//遍历插入对应数据库字段数据
for (i <- test) {
inner.breakable {
for (l <- list) {
// 1、删除分区
spark.sql("alter table " + i.mkString("") + " drop if exists partition (ds='20200521')")
// 2、创建分区
spark.sql("alter table " + i.mkString("") + " add partition (ds='20200521')")
//3、加载数据
spark.sql("insert overwrite table " + i.mkString("") + " partition (ds='20200521')" + l.mkString("") + "")
inner.break;
}
}
}
}
}
}
所需的pom.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>SparkETL</groupId>
<artifactId>SparkETL</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.27</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
<!-- <verbal>true</verbal>-->
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>Spark_Sql</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
分享一下打包插件:
<plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <archive> <manifest> <!-- 注意 此为设置程序的主入口,idea选中你的项目具体类,右击Copy Reference就行 添加在下面mainClass中就行--> <mainClass>添加你的程序入口</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> </plugins>
Spark提交运行: spark2-submit --class Spark_Sql --master spark://cdh01:7077 /data/SparkETL-1.0-SNAPSHOT.jar
- -class: 你的应用的启动类 (如 org.apache.spark.examples.SparkPi)
- --master: 集群的master URL (如 spark://node01:7077)
- 你的jar包上传的绝地路径
参数具体配置可参考:https://blog.youkuaiyun.com/Dkey_775/article/details/107063208
本文介绍如何利用Spark SQL通过JDBC连接MySQL,读取数据并进行Hive表的操作,包括删除、创建和加载(插入)分区表。在执行过程中,详细介绍了所需的pom.xml配置和Spark提交运行命令。
1万+

被折叠的 条评论
为什么被折叠?



