在《spark sql 写入hive较慢原因分析》中已经分析了spark sql 写入hive分区文件慢的原因,笔者提供几种优化思路供参考:
(1)spark 直接生成hive库表底层分区文件,然后再使用add partion语句添加分区信息
spark.sql(s"alter table legend.test_log_hive_text add partition (name_par='${dirName}')")
(2)spark 生成文件存放到HDFS目录下,使用hive脚本命令,load数据到hive中
hive -e "load data inpath '/test/test_log_hive/name_par=test$i' overwrite into table legend.test_log_hive_text partition(name_par='test$i') "
(3)修改spark配置文件,指定hive metastore版本及jar所在位置,查看spark源码可看到spark支持的hive版本在0.12.0-2.3.3版本之间,修改参数spark.sql.hive.metastore.version及spark.sql.hive.metastore.jars参数
private[spark] object HiveUtils extends Logging {
def withHiveExternalCatalog(sc: SparkContext): SparkContext = {
sc.conf.set(CATALOG_IMPLEMENTATION.key, "hive")
sc
}
/** The version of hive used internally by Spark SQL. */
val builtinHiveVersion: String = "1.2.1"
val HIVE_METASTORE_VERSION = buildConf("spark.sql.hive.metastore.version")
.doc("Version of the Hive metastore. Available options are " +
s"0.12.0
through 2.3.3
.")
.stringConf
.createWithDefault(builtinHiveVersion)
// A fake config which is only here for backward compatibility reasons. This config has no effect
// to Spark, just for reporting the builtin Hive version of Spark to existing applications that
// already rely on this config.
val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version")
.doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive version in Spark.")
.stringConf
.createWithDefault(builtinHiveVersion)
val HIVE_METASTORE_JARS = buildConf("spark.sql.hive.metastore.jars")
.doc(s"""
| Location of the jars that should be used to instantiate the HiveMetastoreClient.
| This property can be one of three options: "
| 1. "builtin"
| Use Hive ${builtinHiveVersion}, which is bundled with the Spark assembly when
| -Phive
is enabled. When this option is chosen,
| spark.sql.hive.metasto