spark优化之分区插入

spark.sql.sources.partitionOverwriteModeSTATIC

When INSERT OVERWRITE a partitioned data source table, we currently support 2 modes: static and dynamic. In static mode, Spark deletes all the partitions that match the partition specification(e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting. In dynamic mode, Spark doesn't delete partitions ahead, and only overwrite those partitions that have data written into it at runtime. By default we use static mode to keep the same behavior of Spark prior to 2.3. Note that this config doesn't affect Hive serde tables, as they are always overwritten with dynamic mode. This can also be set as an output option for a data source using key partitionOverwriteMode (which takes precedence over this setting), e.g. dataframe.write.option("partitionOverwriteMode", "dynamic").save(path).

2.3.0
// 覆盖指定分区
table("tv_group").write.option("partitionOverwriteMode", "dynamic").partitionBy("store_id", "group_id").
      mode(SaveMode.Overwrite).save("xxx")

// 删除所有分区再插入
table("tv_group").write.option("partitionOverwriteMode", "STATIC").partitionBy("store_id", "group_id").
      mode(SaveMode.Overwrite).save("xx")

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method df.saveAsTable(tablename,mode).

The above code works fine, but I have so much data for each day that i want to dynamic partition the hive table based on the creationdate(column in the table).

is there any way to dynamic partition the dataframe and store it to hive warehouse. Want to refrain from Hard-coding the insert statement using hivesqlcontext.sql(insert into table partittioin by(date)....).

Question can be considered as an extension to :How to load DataFrame directly to Hive in Spark

any help is much appreciated.

------------------------

I believe it works something like this:

df is a dataframe with year, month and other columns

df.write.partitionBy('year', 'month').saveAsTable(...)

or

df.write.partitionBy('year', 'month').insertInto(...)

Tried this Partitionby method. It only works on RDD level, once dataframe is created most of the methods are DBMS styled e.g. groupby, orderby but they don't serve the purpose of writing in different partitions folders on Hive. 

Ok, so was able to work it out with 1.4 version. df.write().mode(SaveMode.Append).partitionBy("date").saveAsT‌​able("Tablename"); . This however changes my date field to integer value and remove the actual date. e.g. there are 9 unique dates in the column but they are now stored as 1,2,3.... and folder name is date=1,2,3,... instead of date=20141121. Let me know if there is a way to do this

------------------

I was able to write to partitioned hive table using df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")

I had to enable the following properties to make it work.

hiveContext.setConf("hive.exec.dynamic.partition", "true") hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

Where should i set the above 2 parameters ? I tried logging in hive shell and run above commands, it failed. i am sure i am doing it wrong. Could you please tell where can i set these properties ?

@VrushankDoshi You would set it in the spark program, right after you create your hiveContext. val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.setConf("hive.exec.dynamic.partition","true") hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

--------------

I was able to write to partitioned hive table using df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")

I had to enable the following properties to make it work.

hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

------------

I believe it works something like this:

df is a dataframe with year, month and other columns

df.write.partitionBy('year', 'month').saveAsTable(...)

or

df.write.partitionBy('year', 'month').insertInto(...)

----------------------

$SPARK_HOME/bin/spark-submit --files $SPARK_HOME/conf/hive-site.xml --num-executors 2 --executor-memory 1g --executor-cores 1 --class com.bl.bigdata.cdp.execservice.service.batch.TestApp --name TestApp_wsw /var/lib/hive/ww/TestApp.jar

Spark SQL 中的动态分区插入指的是根据查询结果动态地创建分区表中的分区。这在处理大规模数据时尤为重要,因为它允许你将数据高效地存储到具有适当分区的表中,进而可以优化后续的数据读取操作。然而,如果不进行适当的优化,动态分区插入可能会成为性能瓶颈。以下是一些优化动态分区插入的策略: 1. 启用并行执行:在 Spark SQL 中,可以通过设置 `spark.sql.shuffle.partitions` 属性来控制任务的并行度。根据集群资源合理设置分区数,以提高数据写入的并行性。 2. 数据倾斜优化:数据倾斜是导致动态分区插入效率低下的常见原因。可以通过增加分区数量、使用 salting 技术、或者对数据进行预处理等方法减少倾斜。 3. 适当配置动态分区插入参数:在执行动态分区插入时,可以通过 `spark.sql.sources.partitionOverwriteMode` 配置参数来指定分区覆盖模式。例如,使用 'dynamic' 模式可以只覆盖存在的分区,避免不必要的数据删除和重写操作。 4. 合理利用广播变量和持久化:对于小表或者需要频繁引用的静态数据,可以使用广播变量来优化跨节点的数据传递。同时,对于重复使用的大型数据集,合理使用持久化功能可以减少数据的重复读取和处理。 5. 控制事务大小:在使用 Spark SQL 进行动态分区插入时,可以控制事务的大小,通过减少每次插入的数据量来减少事务的开销。 6. 关闭检查点机制:Spark SQL 的检查点机制会增加额外的I/O操作,可以通过关闭检查点来减少不必要的性能开销。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

大怀特

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值