Spark transform操作的非常规使用 - SparkContext.runJob()方法的调用

本文介绍了如何在Spark中手动触发纯transform操作的计算任务,特别是利用SparkContext.runJob()方法。通常,transform操作依赖于action如count()或show()来启动计算,但有时需要提前执行计算或在没有action的情况下执行任务。通过runJob(),可以在不需要额外计算成本或不优雅的action调用时,触发并获取计算过程中的信息。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

前文

     一般场景中,Spark计算任务中transform相关的操作都是由action进行触发的,常见的的比如write、collect、show等,或者在数据处理的过程中调用的groupbykey等API,进行shuffle数据重新分发,在提交任务时会把transform的操作作为前置任务进行提交,但是单纯的transform操作是无法触发spark计算任务的。

    但是,单纯由transform API构建的Spark计算任务,应该怎么触发计算任务呢?

    一些场景中会遇到单纯由transforms构成的计算流程,或者需要提前进行流程的计算,以节省时间,举例如下:

    1.第一种,单纯由transforms构成的计算任务,当一个数据处理流程中存在异构计算平台的时候,需要将前一个Spark计算任务的RDD数据向后进行传递,可以采用采用mapPartition的方式,以分区数据为单位进行传递,在这个spark计算任务中,必须要一个操作来触发mapPartition以及其分区操作函数,当然可以用一个简单的count()或者show()操作进行触发。

    2.第二种,需要提前进行流程的计算,尽可能的为流程的后半部分节省时间,当后面的任务触发的时候,可以直接使用前面流程计算好的数据,而不是一个假数据。

 

正文

1.count() / show() 等方式进行触发

    这是最简单的一种方式,但是会引入额外的计算成本,比如count(),而show()的API调用会触发collect函数的调用,而且使用方式上也不够优雅。

2.sparkContext.runjob()方法触发

    runjob()函数,每个action操作的执行方法中都包含一个方法调用,用来触发和提交Spark计算任务,该方法在sparkContext方法中,提供了以下几种调用接口:

其中有一个调用接口是有返回值的,可以返回一些计算过程中的信息,下面是一个简单的包装类,传入一个Dataset,触发其计算任务之后返回新的Dataset:

public class SparkJobSubmiter {
    //静态方法
    public static Dataset<Row> runJob(Dataset<Row> ds){
        RDD<Row> rdd = ds.rdd();
        //记录数据的schema信息,用于恢复
        StructType sche = ds.schema();
        //只能传入RDD
        ds.sparkSession().sparkContext().runJob(rdd, new JobFunc(), ClassTag$.MODULE$.apply( Void.class ));
        
        //恢复schema数据
        return ds.sparkSession().createDataFrame(rdd,sche);
    }
}

//辅助类 可以收集计算过程中的辅助信息
class JobFunc extends AbstractFunction1<Iterator<Row>, Void> implements Serializable {
    
    @Override
    public Void apply(Iterator<Row> iterator) {
    
        //可以添加自己的逻辑操作,监控计算过程
        return null;
    }
}

总结

     手动触发纯transform的计算任务,往往计算结果不重要,而是要在计算过程中执行相关操作,除了transform的非常规操作之外,有时候,还会手动的构造Dataset数据集,Dataset构成为两部分:数据+计算逻辑,我们可以替换计算逻辑或者数据来达到重用的目的。

     这里推荐一个JAVA 代码搜索网址:https://www.codota.com/code  ,一些不常用的API调用都可以在上面找到相关的例子。

    good luck~

 

在dataworks上执行scala编写的spark任务,获取hologres报错如下 2025-08-04 08:55:31,896 ERROR org.apache.spark.deploy.yarn.ApplicationMaster - User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 2364, 3010b5205.cloud.b7.am301, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_1754267297905_2080247398_01_000011 on host: 3010b5205.cloud.b7.am301. Exit status: 137. Diagnostics: 3010b5205.cloud.b7.am301 Driver stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 2364, 3010b5205.cloud.b7.am301, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_1754267297905_2080247398_01_000011 on host: 3010b5205.cloud.b7.am301. Exit status: 137. Diagnostics: 3010b5205.cloud.b7.am301 Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) ~[scala-library-2.11.8.jar:?] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) ~[scala-library-2.11.8.jar:?] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at scala.Option.foreach(Option.scala:257) ~[scala-library-2.11.8.jar:?] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2048) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2067) ~[spark-core_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:368) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.Dataset.head(Dataset.scala:2484) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.Dataset.take(Dataset.scala:2698) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.Dataset.showString(Dataset.scala:254) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.Dataset.show(Dataset.scala:725) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at org.apache.spark.sql.Dataset.show(Dataset.scala:702) ~[spark-sql_2.11-2.3.0-odps0.30.0.jar:?] at com.bmsoft.operate.VsZConductorOverlimiteventPeriod.ConductorconductorOverlimiteventInfo(VsZConductorOverlimiteventPeriod.scala:181) ~[_ed7dd58644e462c4a5e1b90cb86197da.jar:?] at com.bmsoft.operate.VsZConductorOverlimiteventPeriod.conductorOverlimiteventInfo(VsZConductorOverlimiteventPeriod.scala:46) ~[_ed7dd58644e462c4a5e1b90cb86197da.jar:?] at com.bmsoft.task.VsZConductorOverlimiteventPeriodTask$.func(VsZConductorOverlimiteventPeriodTask.scala:29) ~[_ed7dd58644e462c4a5e1b90cb86197da.jar:?] at com.bmsoft.task.VsZConductorOverlimiteventPeriodTask$$anonfun$main$1.apply(VsZConductorOverlimiteventPeriodTask.scala:33) ~[_ed7dd58644e462c4a5e1b90cb86197da.jar:?] at com.bmsoft.task.VsZConductorOverlimiteventPeriodTask$$anonfun$main$1.apply(VsZConductorOverlimiteventPeriodTask.scala:33) ~[_ed7dd58644e462c4a5e1b90cb86197da.jar:?] at com.bmsoft.scala.utils.LeoUtils.package$$anonfun$taskEntry_odps$1.apply$mcVJ$sp(LeoUtils.scala:453) ~[_ed7dd58644e462c4a5e1b90cb86197da.jar:?] at com.bmsoft.scala.utils.LeoUtils.package$$anonfun$taskEntry_odps$1.apply(LeoUtils.scala:431) ~[_ed7dd58644e462c4a5e1b90cb86197da.jar:?] at com.bmsoft.scala.utils.LeoUtils.package$$anonfun$taskEntry_odps$1.apply(LeoUtils.scala:431) ~[_ed7dd58644e462c4a5e1b90cb86197da.jar:?] at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:73) ~[scala-library-2.11.8.jar:?] at com.bmsoft.scala.utils.LeoUtils.package$.taskEntry_odps(LeoUtils.scala:431) ~[_ed7dd58644e462c4a5e1b90cb86197da.jar:?] at com.bmsoft.task.VsZConductorOverlimiteventPeriodTask$.main(VsZConductorOverlimiteventPeriodTask.scala:33) ~[_ed7dd58644e462c4a5e1b90cb86197da.jar:?] at com.bmsoft.task.VsZConductorOverlimiteventPeriodTask.main(VsZConductorOverlimiteventPeriodTask.scala) ~[_ed7dd58644e462c4a5e1b90cb86197da.jar:?] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_65-AliJVM] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_65-AliJVM] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_65-AliJVM] at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_65-AliJVM] at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:708) [spark-yarn_2.11-2.3.0-odps0.30.0.jar:?] 2025-08-04 08:55:31,908 INFO org.apache.spark.deploy.yarn.ApplicationMaster - Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 2364, 3010b5205.cloud.b7.am301, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_1754267297905_2080247398_01_000011 on host: 3010b5205.cloud.b7.am301. Exit status: 137. Diagnostics: 3010b5205.cloud.b7.am301 Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2048) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2067) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:368) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at org.apache.spark.sql.Dataset.head(Dataset.scala:2484) at org.apache.spark.sql.Dataset.take(Dataset.scala:2698) at org.apache.spark.sql.Dataset.showString(Dataset.scala:254) at org.apache.spark.sql.Dataset.show(Dataset.scala:725) at org.apache.spark.sql.Dataset.show(Dataset.scala:702) at com.bmsoft.operate.VsZConductorOverlimiteventPeriod.ConductorconductorOverlimiteventInfo(VsZConductorOverlimiteventPeriod.scala:181) at com.bmsoft.operate.VsZConductorOverlimiteventPeriod.conductorOverlimiteventInfo(VsZConductorOverlimiteventPeriod.scala:46) at com.bmsoft.task.VsZConductorOverlimiteventPeriodTask$.func(VsZConductorOverlimiteventPeriodTask.scala:29) at com.bmsoft.task.VsZConductorOverlimiteventPeriodTask$$anonfun$main$1.apply(VsZConductorOverlimiteventPeriodTask.scala:33) at com.bmsoft.task.VsZConductorOverlimiteventPeriodTask$$anonfun$main$1.apply(VsZConductorOverlimiteventPeriodTask.scala:33) at com.bmsoft.scala.utils.LeoUtils.package$$anonfun$taskEntry_odps$1.apply$mcVJ$sp(LeoUtils.scala:453) at com.bmsoft.scala.utils.LeoUtils.package$$anonfun$taskEntry_odps$1.apply(LeoUtils.scala:431) at com.bmsoft.scala.utils.LeoUtils.package$$anonfun$taskEntry_odps$1.apply(LeoUtils.scala:431) at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:73) at com.bmsoft.scala.utils.LeoUtils.package$.taskEntry_odps(LeoUtils.scala:431) at com.bmsoft.task.VsZConductorOverlimiteventPeriodTask$.main(VsZConductorOverlimiteventPeriodTask.scala:33) at com.bmsoft.task.VsZConductorOverlimiteventPeriodTask.main(VsZConductorOverlimiteventPeriodTask.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:708) ) 表中当日一共有1个亿的数据
最新发布
08-05
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值