spark textFille的分区和计算策略

最新推荐文章于 2024-07-09 07:15:00 发布

zhou12314456

最新推荐文章于 2024-07-09 07:15:00 发布

阅读量309

点赞数

CC 4.0 BY-SA版权

分类专栏： java 文章标签： textFile RDD spark partition

本文链接：https://blog.youkuaiyun.com/zhou12314/article/details/88748230

本文探讨了Spark中textFile的使用，以及RDD的常见类型如JDBCRDD、HBaseRDD和sequenceFile。重点分析了textFile的分区流程，从SparkContext的textFile方法开始，涉及minPartitions参数、TextInputFormat等。同时，文章深入讲解了HadoopRDD的getPartitions()方法和FileInputFormat的getSplits()方法。最后，概述了RDD的compute计算策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

哈哈哈，可以先关注一下，我们一起来讨论下spark flink 的使用及性能优化

几种比较常见的RDD

正常开发中我们会经常用到下面的这几种RDD
textFile: 通过文本文件生成的RDD
JDBCRDD: 读取关系型数据库所生成的RDD
HBaseRDD: 读取HBase所生成的RDD
sequenceFile:通过序列化文件所生成的RDD

我会依次根据这几种RDD进行分析,重点结合RDD五大特性的分区和计算策略,会尽量在代码中写清楚注释
重点分析RDD的分区和计算策略，这是在自定义RDD的时候需要重点实现的函数

分区流程

从SparkContext中的textFile方法开始分析

  /**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   */
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

path: 指定读取文件的路径
minPartitions: 指的是最小并行度,默认为2,这个我们也可以自己去指定，这个并不决定我们的分片数，分片数只是由文件的数量和大小决定
hadoopFile函数中的几个参数: TextInputFormat: 输入类型 LongWritable为输入key类型
Text为输入value类型 minPartitions为最小分区数

进入hadoopFile函数

 def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()

    // This is a hack to enforce loading hdfs-site.xml.
    // See SPARK-11227 for details.
    //获取文件系统
    FileSystem.getLocal(hadoopConfiguration)

    // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration)) 
  //将hadoopConfiguration变为广播变量，因为最终是要广播出去的,这个在加载scheme文件系统的时候会用到
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }

以上代码主要做了这几件事情
1.获取hadoopConfiguration,并将其进行广播
2.设置任务的文件读取路径
3.实例化HadoopRdd

紧接着我们进入HadoopRDD中，找到getPartitions()，这是重点哈

  override def getPartitions: Array[Partition] = {
     //获取job的配置文件
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    //获取输入的格式，输入类型就是前面定义的输入类型 
    val inputFormat = getInputFormat(jobConf)
    //获取分片
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
     //构建HadoopPartition，后面我们就可以直接使用了
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }

该方法主要调用FileInputFormat.getSplits来实现分片

进入FileInputFormat中的getSplits方法，FileInputFormat为它的父类

 /** Splits files returned by {@link #listStatus(JobConf)} when
   * they're too big.*/ 
  public InputSplit[] getSplits(JobConf job, int numSplits)
    throws IOException {
    //获取所有FileStatus
    FileStatus[] files = listStatus(job);

    // Save the number of input files for metrics/loadgen
    job.setLong(NUM_INPUT_FILES, files.length);
    long totalSize = 0;                    &n