The way to controle tasks number is to controle parallize number for each stage. There are two ways to controle spark parallize number. The difference between the two is that, repartition will performace a shuffle
while coalesce will not.
Here is an example to controle output file numbers via control rdd patitions. In default, there will be thousands part-00* files for each RDD saving location. This could be daungerous if the files size are small. My
goal is to generate just one part-00* file for each RDD.
The total time is just for your reference, but not judging final performace. The performance should based on more jobs running.
Original tasks for saveAsTextFile stage, total time 84.933 s for a job:
15/08/07 15:40:00 INFO scheduler.DAGScheduler: Final stage: Stage 3(saveAsTextFile at NativeMethodAccessorImpl.java:-2)
15/08/07 15:40:00 INFO cluster.YarnClusterScheduler: Adding task set 3.0 with 1020 tasks
15/08/07 15:41:25 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at NativeMethodAccessorImpl.java:-2) finished in 84.933 s
15/08/07 15:40:00 INFO cluster.YarnClusterScheduler: Adding task set 3.0 with 1020 tasks
15/08/07 15:41:25 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at NativeMethodAccessorImpl.java:-2) finished in 84.933 s
The influnce of tasks numbers for saveAsTextFile stage is:
- lines.foreachRDD(lambda rdd: rdd.coalesce(1).saveAsTextFile(output+"/"+time.time()))
Total 64.971 s for a job
15/08/07 17:50:00 INFO scheduler.DAGScheduler: Final stage: Stage 4(saveAsTextFile at NativeMethodAccessorImpl.java:-2)......
./slave11.dc.tj_49877:15/08/07 17:50:00 INFO cluster.YarnClusterScheduler: Adding task set 4.0 with 1 tasks
15/08/07 17:51:05 INFO scheduler.DAGScheduler: Stage 4 (saveAsTextFile at NativeMethodAccessorImpl.java:-2) finished in 64.971 s
- lines.foreachRDD(lambda rdd: rdd.repartition(1).saveAsTextFile(output+"/"+str(time.time())))
Total 72.229s for a job:
15/08/07 17:25:00 INFO scheduler.DAGScheduler: Final stage: Stage 4(saveAsTextFile at NativeMethodAccessorImpl.java:-2)
15/08/07 17:26:05 INFO cluster.YarnClusterScheduler: Adding task set 4.0 with 1 tasks
15/08/07 17:26:05 INFO scheduler.DAGScheduler: Stage 3 (repartition at NativeMethodAccessorImpl.java:-2) finished in 65.344 s
15/08/07 17:26:12 INFO scheduler.DAGScheduler: Stage 4 (saveAsTextFile at NativeMethodAccessorImpl.java:-2) finished in 6.885 s