
Spark
GScallion
这个作者很懒,什么都没留下…
展开
-
Spark:技术点总结
一、发生oom的解决方案一、Driver 内存不够1、读取数据太大-、增加Driver内存,–driver-memory2、数据回传-、collect会造成大量数据回传Driver,使用foreach二、Executor 内存不够1、map 类操作产生大量数据,包括 map、flatMap、filter、mapPartitions等-、使用repartition,减少每个 task 计算数据的大小,从而减少每个 task 的输出-、减少中间输出:用 mapPartitions 替代多个 m原创 2022-04-25 11:12:02 · 2522 阅读 · 0 评论 -
Spark:coalesce repartition 源码分析
Spark版本:2.4.0源代码位置:org/apache/spark/rdd/RDD.scala应用示例:scala> val x=(1 to 10).toListx: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)scala> val df1=x.toDF("df")df1: org.apache.spark.sql.DataFrame = [number: int]scala> df1.rdd.partitions.siz原创 2021-02-01 15:46:38 · 403 阅读 · 0 评论 -
Spark:accumulator源码分析
Spark版本:2.4.0源代码位置:org/apache/spark/util/AccumulatorV2.scala源代码如下:当需要long型累加器时继承如下抽象类/** * An [[AccumulatorV2 accumulator]] for computing sum, count, and average of 64-bit integers. * * @since 2.0.0 */class LongAccumulator extends AccumulatorV2[jl原创 2021-01-28 16:54:26 · 293 阅读 · 0 评论 -
Spark:broadcast源码分析
Spark版本:2.4.0源代码位置:org/apache/spark/SparkContext.scala应用示例:scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)scala> broadcastVar.valueres0: Array[Int] = Array(1,原创 2021-01-28 15:16:57 · 169 阅读 · 0 评论 -
Spark:sortByKey源码分析
Spark版本:2.4.0源码位置:org.apache.spark.rdd.OrderedRDDFunctions源代码如下:/** * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling * `collect` or `save` on the resulting RDD will return or output an ordered list of原创 2021-01-27 16:42:40 · 195 阅读 · 0 评论 -
Spark:groupByKey源码分析
Spark版本:2.4.0代码位置:org.apache.spark.rdd.PairRDDFunctionsgroupByKey(): RDD[(K, Iterable[V])]groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]使用示例:val source: RDD[(Int, Int)] = sc.parallelize(Seq((1, 1), (1, 2), (2, 2), (2, 3)))val groupByKeyRDD: RD原创 2021-01-26 18:40:33 · 330 阅读 · 0 评论 -
Spark:foldByKey源码分析
Spark版本:2.4.0代码位置:org.apache.spark.rdd.PairRDDFunctionsfoldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)]foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]应用示例object FoldByKeyDemo { def main(args: Array[String]):原创 2021-01-26 17:59:03 · 126 阅读 · 0 评论 -
Spark:aggregateByKey源码分析
Spark版本:2.4.0代码位置:org.apache.spark.rdd.PairRDDFunctions相应代码如下:这两个方法会调用方法3进行计算方法1:/** * Aggregate the values of each key, using given combine functions and a neutral "zero value". * This function can return a different result type, U, than the ty原创 2021-01-26 15:56:26 · 153 阅读 · 0 评论 -
Spark:combineByKey源码分析
Spark版本:2.4.0代码位置:org.apache.spark.rdd.PairRDDFunctions代码片段如下:/** * Generic function to combine the elements for each key using a custom set of aggregation * functions. This method is here for backward compatibility. It does not provide combiner原创 2021-01-26 15:09:20 · 158 阅读 · 0 评论 -
Spark:ReduceByKey源码分析
代码位置:org.apache.spark.rdd.PairRDDFunctions相关的三个方法片段:######方法1原创 2021-01-26 11:26:52 · 363 阅读 · 1 评论