大数据计算中很关键的一个概念就是分布式并行计算,意思就是将一份原始数据切分成若干份,然后分发到多个机器或者单个机器多个虚拟出来的内存容器中同时执行相同的逻辑,先分发(map),然后聚合(reduce)的一个过程。
那么问题是原始文件是怎么切分的呢,在spark读取不同的数据源,切分的逻辑也是不同的。
首先spark是有改变分区的函数的,分别是Coalesce()方法和rePartition()方法,但是这两个方法只对shuffle过程生效,包括参数spark.default.parallelism也只是对shuffle有效,所以spark读取数据的时候分区数是没有固定的一个参数来设定的,会经过比较复杂的逻辑计算得到的。
这次先看下spark读取 【本地】【可分割】【单个】 的文件,注意括起来的词,因为不同的数据源spark的处理是不一样的。
一、spark读取代码
这里直接复用spark源码中的example【JavaWordCount】
public final class JavaWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: JavaWordCount <file>");
System.exit(1);
}
SparkSession spark = SparkSession
.builder()
.appName("JavaWordCount")
.getOrCreate();
spark.sparkContext().setLogLevel("WARN");
// JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD();
JavaRDD<String> lines = spark.sparkContext().textFile(args[0], 2).toJavaRDD();
System.out.println("lines.getNumPartitions(): " + lines.getNumPartitions());
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
JavaPairRDD<String, Integer> ones = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairRDD<String, Integer> counts = ones.reduceByKey((i1, i2) -> i1 + i2);
List<Tuple2<String, Integer>> output = counts.collect();
// for (Tuple2<?,?> tuple : output) {
// System.out.println(tuple._1() + ": " + tuple._2());
// }
spark.stop();
}
}
二、textFile()方法
spark 提供的读取文件api,发现textFile最后是new一个HadoopRDD返回的。
testFile的两个参数:
path:就是读取文件的路径,可以是精确文件名,也可以是一个目录名
minPartitions: 最小分区数,这个不是确定的最后分区数,这个参数在后面计算分区数时用到。
# SparkContext.scala
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
* @param path path to the text file on a supported file system
* @param minPartitions suggested minimum number of partitions for the resulting RDD
* @return RDD of lines of the text file
*/
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
print("\n>>>>>> pgx code <<<<<<\n")
print("defaultParallelism: " + defaultMinPartitions)
print("\n")
print("minPartitions: " + minPartitions)
print("\n>>>>>> pgx code <<<<<<\n")
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
/** Get an RDD for a Hadoop file with an arbitrary InputFormat
*
* @note Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD or directly passing it to an aggregation or shuffle
* operation will create many references to the same object.
* If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first
* copy them using a `map` function.
* @param path directory to the input data files, the path can be comma separated paths
* as a list of inputs
* @param inputFormatClass storage format of the data to be read
* @param keyClass `Class` of the key associated with the `inputFormatClass` parameter
* @param valueClass `Class` of the value associated with the `inputFormatClass` parameter
* @param minPartitions suggested minimum number of partitions for the resulting RDD
* @return RDD of tuples of key and corresponding value
*/
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
// This is a hack to enforce loading hdfs-site.xml.
// See SPARK-11227 for details.
FileSystem.getLocal(hadoopConfiguration)
// A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}
三、从action追踪代码
因为spark的transform算子不会引发计算,所以我们直接tracked down 最后引发计算的action 算子 collect()
# JavaRDDlike.scala
/**
* Return an array that contains all of the elements in this RDD.
*
* @note this method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): JList[T] =
rdd.collect().toSeq.asJava
collect()方法是调用了SparkContext对象的runJob方法
# RDD.scala
/**
* Return an array that contains all of the elements in this RDD.
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
SparkContext对象中的runJob的方法调用自身另外一个runJob方法,开始执行任务
我们主要关注分区数,所以继续追踪runJonb的第三个参数 【0 until rdd.partitions.length】
# SparkContext.scala
/**
* Run a job on all partitions in an RDD and return the results in an array.
*
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @return in-memory collection with a result of the job (each collection element will contain
* a result from one partition)
*/
def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
runJob(rdd, func, 0 until rdd.partitions.length)
}
SparkContext对象执行runJob时,需要从rdd对象中拿partitions (0 until rdd.partitions.length)
rdd对象中的partitions方法调用了 getPartitions来获取分区数
// RDD.scala
/**
* Get the array of partitions of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def partitions: Array[Partition] = {
checkpointRDD.map(_.partitions).getOrElse {
if (partitions_ == null) {
stateLock.synchronized {
if (partitions_ == null) {
partitions_ = getPartitions
partitions_.zipWithIndex.foreach {
case (partition, index) =>
require(partition.index == index,
s"partitions($index).partition == ${partition.index}, but it should equal $index")
}
}
}
}
partitions_
}
}
这个 getPartitions方法有多个实现类, 因为前面看到textFile最终创建的是一个HadoopRDD,所以这里看的是HadoopRDD.scala中的getPartitions()
我们可以看到getPartitions()方法中,返回的数组是有inputSplits决定
inputSplits 又是来自allInputSplits
所