哈哈哈,可以先关注一下,我们一起来讨论下spark flink 的使用及性能优化
几种比较常见的RDD
正常开发中我们会经常用到下面的这几种RDD
textFile: 通过文本文件生成的RDD
JDBCRDD: 读取关系型数据库所生成的RDD
HBaseRDD: 读取HBase所生成的RDD
sequenceFile:通过序列化文件所生成的RDD
我会依次根据这几种RDD进行分析,重点结合RDD五大特性的分区和计算策略,会尽量在代码中写清楚注释
重点分析RDD的分区和计算策略,这是在自定义RDD的时候需要重点实现的函数
分区流程
从SparkContext中的textFile方法开始分析
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
path: 指定读取文件的路径
minPartitions: 指的是最小并行度,默认为2,这个我们也可以自己去指定,这个并不决定我们的分片数,分片数只是由文件的数量和大小决定
hadoopFile函数中的几个参数: TextInputFormat: 输入类型 LongWritable为输入key类型
Text为输入value类型 minPartitions为最小分区数
进入hadoopFile函数
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
// This is a hack to enforce loading hdfs-site.xml.
// See SPARK-11227 for details.
//获取文件系统
FileSystem.getLocal(hadoopConfiguration)
// A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
//将hadoopConfiguration变为广播变量,因为最终是要广播出去的,这个在加载scheme文件系统的时候会用到
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}
以上代码主要做了这几件事情
1.获取hadoopConfiguration,并将其进行广播
2.设置任务的文件读取路径
3.实例化HadoopRdd
紧接着我们进入HadoopRDD中,找到getPartitions(),这是重点哈
override def getPartitions: Array[Partition] = {
//获取job的配置文件
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
SparkHadoopUtil.get.addCredentials(jobConf)
//获取输入的格式,输入类型就是前面定义的输入类型
val inputFormat = getInputFormat(jobConf)
//获取分片
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) {
//构建HadoopPartition,后面我们就可以直接使用了
array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
}
该方法主要调用FileInputFormat.getSplits来实现分片
进入FileInputFormat中的getSplits方法,FileInputFormat为它的父类
/** Splits files returned by {@link #listStatus(JobConf)} when
* they're too big.*/
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
//获取所有FileStatus
FileStatus[] files = listStatus(job);
// Save the number of input files for metrics/loadgen
job.setLong(NUM_INPUT_FILES, files.length);
long totalSize = 0; &n