导入maven依赖
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.35</version>
</dependency>
读取代码
var url = "jdbc:mysql://" + p.server_ip + ":" + p.server_port + "/" + p.database_name
var row_data = spark.read.jdbc(url, table, predicates, prop)
函数:jdbc
def jdbc(url : scala.Predef.String,
table : scala.Predef.String,
predicates : scala.Array[scala.Predef.String],
connectionProperties : java.util.Properties) : org.apache.spark.sql.DataFrame
)= { /* compiled code */ }
在这个函数里需要设置属性predicates、properties 的值
predicates = []
datelist = {"2014-11-01": "2015-01-01",
"2014-09-01": "2014-11-01",
"2014-07-01": "2014-09-01",
"2014-05-01": "2014-07-01",
"2014-03-01": "2014-05-01",
"2014-01-01": "2014-03-01"}
for startdate, enddate in datelist.items():
predicates.append("STARTDATE >= to_date('" + startdate + "', 'yyyy-MM-dd'" \
+ "and STARTDATE < to_date('" + enddate + "', 'yyyy-MM-dd')")
properties = {"user": db_user_name,
"password" : password,
"driver": driver}
df = spark.read.jdbc(url=url, table=dbtable, predicates=predicates, properties=properties)
最后rdd的分区数量就等于predicates.length。
有一点要注意的是,驱动是放在properties里,网上一般都是连接MySQL数据库,不像oracle数据库一样需要额外的驱动。
还有数据中STARTDATE是date类型的数据,所以需要利用to_date()做数据类型转换。
写入
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
val dataList: List[(Double, String, Double, Double, String