HadoopRDD getPartitions

最新推荐文章于 2021-05-13 19:39:12 发布

houzhizhen

最新推荐文章于 2021-05-13 19:39:12 发布

阅读量755

点赞数 1

CC 4.0 BY-SA版权

分类专栏： spark

本文链接：https://blog.youkuaiyun.com/houzhizhen/article/details/64128477

spark 专栏收录该内容

158 篇文章

订阅专栏

本文详细介绍了Hadoop RDD的分区机制，包括如何根据文件大小及默认块大小计算分区数量，确保数据处理的并行性和效率。同时，文中提供了具体的实现代码示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

HadoopRDD getPartitions returns the splits. In other words,
If the size of file smaller than the default block size return 1 for the file.
if the size of file larger than the default block size, then returns file_size/default_block_size + ((file_size) % default_block_size ==0 ? 0: 1)

It at least return minPartitions(default 2). If the file number is 1 and the file smaller than the default block size, it returns minPartitions.

override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }