How-to: controle tasks for each stage(partitions for each rdd)

The way to controle tasks number is to controle parallize number for each stage. There are two ways to controle spark parallize number. The difference between the two is that, repartition will performace a shuffle while coalesce will not.

Here is an example to controle output file numbers via control rdd patitions. In default, there will be thousands part-00* files for each RDD saving location. This could be daungerous if the files size are small. My goal is to generate just one part-00* file for each RDD.

The total time is just for your reference, but not judging final performace. The performance should based on more jobs running.

Original tasks for saveAsTextFile stage, total time 84.933 s for a job:
15/08/07 15:40:00 INFO scheduler.DAGScheduler: Final stage: Stage 3(saveAsTextFile at NativeMethodAccessorImpl.java:-2)
15/08/07 15:40:00 INFO cluster.YarnClusterScheduler: Adding task set 3.0 with 1020 tasks
15/08/07 15:41:25 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at NativeMethodAccessorImpl.java:-2) finished in 84.933 s

The influnce of tasks numbers for saveAsTextFile stage is:
  • lines.foreachRDD(lambda rdd: rdd.coalesce(1).saveAsTextFile(output+"/"+time.time()))
    Total 64.971 s for a job
    15/08/07 17:50:00 INFO scheduler.DAGScheduler: Final stage: Stage 4(saveAsTextFile at NativeMethodAccessorImpl.java:-2)......
    ./slave11.dc.tj_49877:15/08/07 17:50:00 INFO cluster.YarnClusterScheduler: Adding task set 4.0 with 1 tasks
    15/08/07 17:51:05 INFO scheduler.DAGScheduler: Stage 4 (saveAsTextFile at NativeMethodAccessorImpl.java:-2) finished in 64.971 s
  • lines.foreachRDD(lambda rdd: rdd.repartition(1).saveAsTextFile(output+"/"+str(time.time())))
    Total 72.229s for a job:
    15/08/07 17:25:00 INFO scheduler.DAGScheduler: Final stage: Stage 4(saveAsTextFile at NativeMethodAccessorImpl.java:-2)
    15/08/07 17:26:05 INFO cluster.YarnClusterScheduler: Adding task set 4.0 with 1 tasks
    15/08/07 17:26:05 INFO scheduler.DAGScheduler: Stage 3 (repartition at NativeMethodAccessorImpl.java:-2) finished in 65.344 s
    15/08/07 17:26:12 INFO scheduler.DAGScheduler: Stage 4 (saveAsTextFile at NativeMethodAccessorImpl.java:-2) finished in 6.885 s 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值