Spark-Dependency/Aggregator

最新推荐文章于 2025-02-26 10:04:07 发布

原创最新推荐文章于 2025-02-26 10:04:07 发布 · 766 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#dependency #依赖关系 #spark

spark 专栏收录该内容

23 篇文章

订阅专栏

本文介绍了Apache Spark中RDD依赖关系的基础知识，包括窄依赖、一对一依赖、范围依赖及洗牌依赖等类型，并详细阐述了Aggregator的作用及其核心方法。

Spark-Dependency/Aggregator

@(spark)[Dependency|Aggregator]
RDD的核心之一：依赖关系

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Base class for dependencies.                                                                                                                                         
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
abstract class Dependency[T] extends Serializable {                                                                                                                     
  def rdd: RDD[T]                                                                                                                                                       
}

Product2 是scala的类
Product2 is a cartesian product of 2 components.

NarrowDependency

比较简单的一类依赖，

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Base class for dependencies where each partition of the child RDD depends on a small number                                                                          
 * of partitions of the parent RDD. Narrow dependencies allow for pipelined execution.                                                                                  
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {    
  /**                                                                                                                                                                   
   * Get the parent partitions for a child partition.                                                                                                                   
   * @param partitionId a partition of the child RDD                                                                                                                    
   * @return the partitions of the parent RDD that the child partition depends upon                                                                                     
   */                                                                                                                                                                   
  def getParents(partitionId: Int): Seq[Int]                                                                                                                            

  override def rdd: RDD[T] = _rdd                                                                                                                                       
}

OneToOneDependency

1:1 的mapping

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Represents a one-to-one dependency between partitions of the parent and child RDDs.                                                                                  
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {                                                                                             
  override def getParents(partitionId: Int) = List(partitionId)                                                                                                         
}

RangeDependency

根据range确定依赖关系，每个range一个dependency？

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.                                                                        
 * @param rdd the parent RDD                                                                                                                                            
 * @param inStart the start of the range in the parent RDD                                                                                                              
 * @param outStart the start of the range in the child RDD                                                                                                              
 * @param length the length of the range                                                                                                                                
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)

ShuffleDependency

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,                                                                          
 * the RDD is transient since we don't need it on the executor side.                                                                                                    
 *                                                                                                                                                                      
 * @param _rdd the parent RDD                                                                                                                                           
 * @param partitioner partitioner used to partition the shuffle output                                                                                                  
 * @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If set to None,                                                                      
 *                   the default serializer, as specified by `spark.serializer` config option, will                                                                     
 *                   be used.                                                                                                                                           
 * @param keyOrdering key ordering for RDD's shuffles                                                                                                                   
 * @param aggregator map/reduce-side aggregator for RDD's shuffle                                                                                                       
 * @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)                                                                        
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
class ShuffleDependency[K, V, C](                                                                                                                                       
    @transient _rdd: RDD[_ <: Product2[K, V]],                                                                                                                          
    val partitioner: Partitioner,                                                                                                                                       
    val serializer: Option[Serializer] = None,                                                                                                                          
    val keyOrdering: Option[Ordering[K]] = None,                                                                                                                        
    val aggregator: Option[Aggregator[K, V, C]] = None,                                                                                                                 
    val mapSideCombine: Boolean = false)                                                                                                                                
  extends Dependency[Product2[K, V]] {

Aggregator

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * A set of functions used to aggregate data.                                                                                                                           
 *                                                                                                                                                                      
 * @param createCombiner function to create the initial value of the aggregation.                                                                                       
 * @param mergeValue function to merge a new value into the aggregation result.                                                                                         
 * @param mergeCombiners function to merge outputs from multiple mergeValue function.                                                                                   
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
case class Aggregator[K, V, C] (                                                                                                                                        
    createCombiner: V => C,                                                                                                                                             
    mergeValue: (C, V) => C,                                                                                                                                            
    mergeCombiners: (C, C) => C) {