As we can see in previous article "CSV Parser" we may need to create a new object for each record of an RDD as in
|
1
2
3
4
5
6
|
def
mLine(line:String)={ val
parser=new
CSVParser('\t') parser.parseLine(line) }... ...myRDD.map(mLine(_).size)... |
The mLine function is used in the map method
of an RDD. In this case the parser object is created each time
for each record, although they are exactly the same thing.
Actually, whenever we need to apply some complicated operation on each record there is a high chance we need to create some helper objects within map.
By combining mapPartition with Scala map, we can reduce the unnecessary new object creation. Let’s rewrite above example with mapPartitions:
|
1
2
3
4
5
6
|
def
pLines(lines:Iterator[String])={ val
parser=new
CSVParser('\t') lines.map(parser.parseLine(_).size) }... myRDD.mapPartitions(pLines) |
On my single box test machine, execution time of the same task reduced from 65 seconds to 35 seconds. Surprisingly the opencsv parser with the mapPartitions optimization is significantly faster than map(_split('\t')).
本文介绍如何通过使用mapPartitions方法减少Spark任务中不必要的对象创建,从而显著提高处理速度。通过对比map方法,展示了如何利用mapPartitions降低资源消耗并缩短执行时间。
1516

被折叠的 条评论
为什么被折叠?



