I have 450K JSONs, and I want to rename them in hdfs based on certain rules. For the sake of simplicity I just add a suffix .finished to each of them.
A managed to do this, with the following code:
import org.apache.hadoop.fs._
val hdfs = FileSystem.get(sc.hadoopConfiguration)
val files = hdfs.listStatus(new Path(pathToJson))
val originalPath = files.map(_.getPath())
for(i
{
hdfs.rename(originalPath(i), originalPath(i).suffix(".finished"))
}
But it takes 12 minutes to rename all of them. Is there a way to make it faster? (Perhaps parallelize)
I use spark 1.6.0.
解决方案
originalpath.par.foreach( e => hdfs.rename(e,e.suffix("finish")))