spark优化之分区插入

最新推荐文章于 2025-10-23 13:26:59 发布

原创最新推荐文章于 2025-10-23 13:26:59 发布 · 1.4k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#spark #scala #sql

bigdata 专栏收录该内容

60 篇文章

订阅专栏

spark.sql.sources.partitionOverwriteMode

STATIC

When INSERT OVERWRITE a partitioned data source table, we currently support 2 modes: static and dynamic. In static mode, Spark deletes all the partitions that match the partition specification(e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting. In dynamic mode, Spark doesn't delete partitions ahead, and only overwrite those partitions that have data written into it at runtime. By default we use static mode to keep the same behavior of Spark prior to 2.3. Note that this config doesn't affect Hive serde tables, as they are always overwritten with dynamic mode. This can also be set as an output option for a data source using key partitionOverwriteMode (which takes precedence over this setting), e.g. dataframe.write.option("partitionOverwriteMode", "dynamic").save(path).

2.3.0

// 覆盖指定分区
table("tv_group").write.option("partitionOverwriteMode", "dynamic").partitionBy("store_id", "group_id").
      mode(SaveMode.Overwrite).save("xxx")

// 删除所有分区再插入
table("tv_group").write.option("partitionOverwriteMode", "STATIC").partitionBy("store_id", "group_id").
      mode(SaveMode.Overwrite).save("xx")

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.saveAsTable(tablename,mode).

The above code works fine, but I have so much data for each day that i want to dynamic partition the hive table based on the creationdate(column in the table).

is there any way to dynamic partition the dataframe and store it to hive warehouse. Want to refrain from Hard-coding the insert statement using hivesqlcontext.sql(insert into table partittioin by(date)....).

Question can be considered as an extension to :How to load DataFrame directly to Hive in Spark

any help is much appreciated.

------------------------

I believe it works something like this:

df is a dataframe with year, month and other columns

df.write.partitionBy('year', 'month').saveAsTable(...)

df.write.partitionBy('year', 'month').insertInto(...)

Tried this Partitionby method. It only works on RDD level, once dataframe is created most of the methods are DBMS styled e.g. groupby, orderby but they don't serve the purpose of writing in different partitions folders on Hive.

Ok, so was able to work it out with 1.4 version. df.write().mode(SaveMode.Append).partitionBy("date").saveAsT‌able("Tablename"); . This however changes my date field to integer value and remove the actual date. e.g. there are 9 unique dates in the column but they are now stored as 1,2,3.... and folder name is date=1,2,3,... instead of date=20141121. Let me know if there is a way to do this

------------------

I was able to write to partitioned hive table using df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")

I had to enable the following properties to make it work.

hiveContext.setConf("hive.exec.dynamic.partition", "true") hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

Where should i set the above 2 parameters ? I tried logging in hive shell and run above commands, it failed. i am sure i am doing it wrong. Could you please tell where can i set these properties ?

@VrushankDoshi You would set it in the spark program, right after you create your hiveContext. val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.setConf("hive.exec.dynamic.partition","true") hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

--------------

I was able to write to partitioned hive table using df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")

I had to enable the following properties to make it work.

hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

------------

I believe it works something like this:

df is a dataframe with year, month and other columns

df.write.partitionBy('year', 'month').saveAsTable(...)

df.write.partitionBy('year', 'month').insertInto(...)

----------------------

$SPARK_HOME/bin/spark-submit --files $SPARK_HOME/conf/hive-site.xml --num-executors 2 --executor-memory 1g --executor-cores 1 --class com.bl.bigdata.cdp.execservice.service.batch.TestApp --name TestApp_wsw /var/lib/hive/ww/TestApp.jar