scala实战之spark源码修改（能够将DataFrame按字段增量写入mysql数据表）

最新推荐文章于 2025-04-28 22:28:35 发布

原创

最新推荐文章于 2025-04-28 22:28:35 发布 · 置顶 · 7.7k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#spark #jdbc #mysql #scala #spark源码改写

本文探讨了Spark将DataFrame写入MySQL时遇到的问题，即默认删除并重建表，而非增量更新。作者深入源码，发现mode设置无效且插入方式要求字段严格匹配。为解决这些问题，作者对Spark 1.5.2源码进行优化，修改insertStatement算法，添加DataFrame参数，并调整saveMode默认值。最后展示了修改后的JdbcUtils类部分代码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在上一篇博文中，我们可以简单的应用官网的给出的一些接口提取mysql数据表中的数据到spark中，也可以将spark的运行结果存入mysql中。

但是我们会发现spark将其DF存入mysql的时候，无论你选择什么模式：

jdbcDF.write.mode(SaveMode.Overwrite).jdbc(url,"zfs_test",prop)
jdbcDF.write.mode(SaveMode.Append).jdbc(url,"zbh_test",prop)

结果都是会重建这个表。

这样一来这个表之前的数据就不存在了，而且如果我这个表还有其他字段（比如我有一个自增的主键id），那就没辙了。

本文所有的环境同http://blog.youkuaiyun.com/zfszhangyuan/article/details/52593521

spark版本是1.5.2，这次我们需要从官网下载spark的源码http://www.apache.org/dist/spark/spark-1.5.2/

选择spark-1.5.2.tgz下载

原先项目中添加源码

我们跟一下源码，看看到底什么原因导致，无论我设置什么模式，结果都是删除表，重建，再存入数据

最终的原因是：

mode被写死了，前面你无论设置的是append也好其他也好，最终都是Overwrite。

另外spark在插入数据到mysql的方法也不是很好如下：

他是直接 insert into table values(...); 这样做就要求插入的表的字段名称和顺序都必须和DF中的数据完全一致才能成功。当我们想将DF的数据插入到mysql表指定字段的时候这个方法是做不到的。

既然问题原因找到了，下面就开始我们的源码的优化吧

主要修改了insertStatement算法，JDBC方法添加DF：DataFrame参数，savemode的默认值

为了避免影响源码，我们重新继承Logging类重写JdbcUtils类代码如下：

package JDBC_MySql

import java.sql.{Connection, PreparedStatement}
import java.util.Properties

//import com.besttone.utils.{JDBCRDD, JdbcDialects}
import org.apache.spark.Logging
import org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Row, SaveMode}

import scala.util.Try

/**
  * Util functions for JDBC tables.
  */
object JdbcUtils extends Logging {

  val  mode = SaveMode.Append


  def jdbc(url: String,df: DataFrame, table: String, connectionProperties: Properties): Unit = {
    val props = new Properties()
    props.putAll(connectionProperties)
    val conn = JdbcUtils.createConnection(url, props)

    try {
      var tableExists = JdbcUtils.tableExists(conn, table)

      if (mode == SaveMode.Ignore && tableExists) {
        return
      }

      if (mode == SaveMode.ErrorIfExists && tableExists) {
        sys.error(s"Table $table already exists.")
      }

      if (mode == SaveMode.Overwrite && tableExists) {
        JdbcUtils.dropTable(conn, table)
        tableExists = false
      }

      // Create the table if the table didn't exist.
      if (!tableExists) {
        val schema = JdbcUtils.schemaString(df, url)
        val sql = s"CREATE TABLE $table ($schema)"
        conn.prepareStatement(sql).executeUpdate()
      }
    } finally {
      conn.close()
    }

    JdbcUtils.saveTable(df, url, table, props)
  }

  /**
    * Establishes a JDBC connection.
    */
  def createConnection(url: String, connectionProperties: Properties): Connection = {
    JDBCRDD.getConnector(connectionProperties.getProperty("driver"), url, connectionProperties)()
  }

  /**
    * Returns true if the table already exists in the JDBC database.
    */
  def tableExists(conn: Connection, table: String): Boolean = {
    // Somewhat hacky, but there isn't a good way to identify whether a table exists for all
    // SQL database systems, considering "table" could also include the database name.
    Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT 1").executeQuery().next()).isSuccess
  }

  /**
    * Drops a table from the JDBC database.
    */
  def dropTable(conn: Connection, table: String): Unit = {
    conn.prepareStatement(s"DROP TABLE $table").executeUpdate()
  }

  /**
    * Returns a PreparedStatement that inserts a row into table via conn.
    */
  def insertStatement(conn: Connection, table: String, rddSchema: StructType): PreparedStatement = {
    val fields = rddSchema.fields
    val fieldsSql = new StringBuilder(s"(")
    var i=0;
    for(f <- fields){
      fieldsSql.append(f.name)

      if(i==fields.length-1){
        fieldsSql.append("