Spark修炼之道(进阶篇)——Spark入门到精通:第五节 Spark编程模型(二)

本文深入探讨了大数据开发领域中关键组件的应用与实战经验,包括Hadoop、Spark、Flink等主流技术的使用场景与优化策略,旨在帮助开发者高效解决数据处理与分析难题。

本文主要内容

  1. RDD 常用Transformation函数

1. RDD 常用Transformation函数

(1)union 
union将两个RDD数据集元素合并,类似两个集合的并集 
union函数参数:

<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"> /**
   * Return the union of this RDD <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">and</span> another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
   */
  <span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">union</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(other: RDD[T])</span>:</span> RDD[T] </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>

RDD与另外一个RDD进行Union操作之后,两个数据集中的存在的重复元素 
代码如下:

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>)
<span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>)
<span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">16</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>
//存在重复元素
scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.union</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res13:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>)
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>

这里写图片描述

(2)intersection 
方法返回两个RDD数据集的交集 
函数参数: 
/** 
* Return the intersection of this RDD and another one. The output will not contain any duplicate 
* elements, even if the input RDDs did. 

* Note that this method performs a shuffle internally. 
*/ 
def intersection(other: RDD[T]): RDD[T]

使用示例:

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.intersection</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res14:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>)
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>

(3)distinct 
distinct函数将去除重复元素 
distinct函数参数:

/** 
* Return a new RDD containing the distinct elements in this RDD. 
*/ 
def distinct(): RDD[T]

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>)
<span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>)
<span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.union</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.distinct</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res0:</span> Array[Int] = Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>)
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>

这里写图片描述

(4)groupByKey([numTasks]) 
输入数据为(K, V) 对, 返回的是 (K, Iterable) ,numTasks指定task数量,该参数是可选的,下面给出的是无参数的groupByKey方法 
/** 
* Group the values for each key in the RDD into a single sequence. Hash-partitions the 
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements 
* within each group is not guaranteed, and may even differ each time the resulting RDD is 
* evaluated. 

* Note: This operation may be very expensive. If you are grouping in order to perform an 
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] 
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance. 
*/ 
def groupByKey(): RDD[(K, Iterable[V])]

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>)
<span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>)
<span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.union</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.map</span>((_,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>))<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.groupByKey</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res2:</span> Array[(Int, Iterable[Int])] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)))
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>

(5)reduceByKey(func, [numTasks]) 
reduceByKey函数输入数据为(K, V)对,返回的数据集结果也是(K,V)对,只不过V为经过聚合操作后的值 
/** 
* Merge the values for each key using an associative reduce function. This will also perform 
* the merging locally on each mapper before sending results to a reducer, similarly to a 
* “combiner” in MapReduce. Output will be hash-partitioned with numPartitions partitions. 
*/ 
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

使用示例:

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.union</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.map</span>((_,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>))<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.reduceByKey</span>(_+_)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res4:</span> Array[(Int, Int)] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

这里写图片描述

(6)sortByKey([ascending], [numTasks]) 
对输入的数据集按key排序 
sortByKey方法定义

<code class="hljs coffeescript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">/**
   * Sort the RDD <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">by</span> key, so that each partition contains a sorted range <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> the elements. Calling
   * `<span class="javascript" style="box-sizing: border-box;">collect</span>` <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">or</span> `<span class="javascript" style="box-sizing: border-box;">save</span>` <span class="hljs-literal" style="color: rgb(0, 102, 102); box-sizing: border-box;">on</span> the resulting RDD will <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">or</span> output an ordered list <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> records
   * (<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> the `<span class="javascript" style="box-sizing: border-box;">save</span>` <span class="hljs-reserved" style="box-sizing: border-box;">case</span>, they will be written to multiple `<span class="javascript" style="box-sizing: border-box;">part-X</span>` files <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> the filesystem, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span>
   * order <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">of</span> the keys).
   */
  <span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">//</span> <span class="hljs-attribute" style="box-sizing: border-box; color: rgb(0, 136, 0);">TODO</span>: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">this</span> currently doesn<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'t work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)]</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>

使用示例:

<code class="hljs haskell has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-title" style="box-sizing: border-box;">scala</span>> var <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span> = sc.parallelize<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">List</span>((1,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1,2)</span>,<span class="hljs-container" style="box-sizing: border-box;">(1, 4)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,3)</span>,<span class="hljs-container" style="box-sizing: border-box;">(7,9)</span>,<span class="hljs-container" style="box-sizing: border-box;">(2,4)</span>))</span>
<span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>: org.apache.spark.rdd.<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">RDD</span>[<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>)</span>] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">ParallelCollectionRDD</span>[20] at parallelize at <console>:21</span>

<span class="hljs-title" style="box-sizing: border-box;">scala</span>> <span class="hljs-typedef" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">data</span>.sortByKey<span class="hljs-container" style="box-sizing: border-box;">(<span class="hljs-title" style="box-sizing: border-box;">true</span>)</span>.collect</span>
<span class="hljs-title" style="box-sizing: border-box;">res10</span>: <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>[(<span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>, <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Int</span>)] = <span class="hljs-type" style="box-sizing: border-box; color: rgb(102, 0, 102);">Array</span>((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>))

</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li></ul>

这里写图片描述

(7)join(otherDataset, [numTasks]) 
对于数据集类型为 (K, V) 及 (K, W)的RDD,join操作后返回类型为 (K, (V, W)),join函数有三种: 
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] 
def leftOuterJoin[W]( 
other: RDD[(K, W)], 
partitioner: Partitioner): RDD[(K, (V, Option[W]))] 
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner) 
: RDD[(K, (Option[V], W))]

使用示例:

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))
     | )
<span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[(Int, Int)] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">24</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)))
<span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[(Int, Int)] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">32</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.join</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res13:</span> Array[(Int, (Int, Int))] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)))
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li></ul>

这里写图片描述

<code class="hljs vbscript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> rdd1.leftOuterJoin(rdd2).collect
res15: <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>[(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>, (<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Option</span>[<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>]))] = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,Some(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,Some(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

这里写图片描述

<code class="hljs vbscript has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> rdd1.rightOuterJoin(rdd2).collect
res16: <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>[(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>, (<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">Option</span>[<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>], <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Int</span>))] = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">Array</span>((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(Some(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>),<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(Some(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>),<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)))
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>

这里写图片描述

(8)cogroup(otherDataset, [numTasks]) 
如果输入的RDD类型为(K, V) 和(K, W),则返回的RDD类型为 (K, (Iterable, Iterable)) . 该操作与 groupWith等同

方法定义: 
/** 
* For each key k in this or other, return a resulting RDD that contains a tuple with the 
* list of values for that key in this as well as other
*/ 
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner) 
: RDD[(K, (Iterable[V], Iterable[W]))]

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>),(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))
     | )
<span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[(Int, Int)] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">24</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)))
<span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[(Int, Int)] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">32</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.cogroup</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res17:</span> Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>),CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))))

scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.groupWith</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res18:</span> Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,(CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>),CompactBuffer(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>))))
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>

这里写图片描述

(9)cartesian(otherDataset) 
求两个RDD数据集间的笛卡尔积 
函数定义: 
/** 
* Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of 
* elements (a, b) where a is in this and b is in other
*/ 
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>))
<span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">52</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> val rdd2=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(Array(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>))
<span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">53</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.cartesian</span>(rdd2)<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>
<span class="hljs-label" style="box-sizing: border-box;">res21:</span> Array[(Int, Int)] = Array((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>))
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>

这里写图片描述

(10)coalesce(numPartitions) 
将RDD的分区数减至指定的numPartitions分区数

函数定义: 
/** 
* Return a new RDD that is reduced into numPartitions partitions. 

* This results in a narrow dependency, e.g. if you go from 1000 partitions 
* to 100 partitions, there will not be a shuffle, instead each of the 100 
* new partitions will claim 10 of the current partitions. 

* However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, 
* this may result in your computation taking place on fewer nodes than 
* you like (e.g. one node in the case of numPartitions = 1). To avoid this, 
* you can pass shuffle = true. This will add a shuffle step, but means the 
* current upstream partitions will be executed in parallel (per whatever 
* the current partitioning is). 

* Note: With shuffle = true, you can actually coalesce to a larger number 
* of partitions. This is useful if you have a small number of partitions, 
* say 100, potentially with a few partitions being abnormally large. Calling 
* coalesce(1000, shuffle = true) will result in 1000 partitions with the 
* data distributed using a hash partitioner. 
*/ 
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null) 
: RDD[T]

示例代码:

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val rdd1=sc<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.parallelize</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> to <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">100</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>)
<span class="hljs-label" style="box-sizing: border-box;">rdd1:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = ParallelCollectionRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">55</span>] at parallelize at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">21</span>

scala> val rdd2=rdd1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.coalesce</span>(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)
<span class="hljs-label" style="box-sizing: border-box;">rdd2:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.rdd</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.RDD</span>[Int] = CoalescedRDD[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">56</span>] at coalesce at <console>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">23</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>

这里写图片描述

repartition(numPartitions),功能与coalesce函数相同,实质上它调用的就是coalesce函数,只不是shuffle = true,意味着可能会导致大量的网络开销。 
方法定义: 
/** 
* Return a new RDD that has exactly numPartitions partitions. 

* Can increase or decrease the level of parallelism in this RDD. Internally, this uses 
* a shuffle to redistribute data. 

* If you are decreasing the number of partitions in this RDD, consider using coalesce
* which can avoid performing a shuffle. 
*/ 
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { 
coalesce(numPartitions, shuffle = true) 

}

转载: http://blog.youkuaiyun.com/lovehuangjiaju/article/details/48603009

标题SpringBoot智能在线预约挂号系统研究AI更换标题第1章引言介绍智能在线预约挂号系统的研究背景、意义、国内外研究现状及论文创新点。1.1研究背景与意义阐述智能在线预约挂号系统对提升医疗服务效率的重要性。1.2国内外研究现状分析国内外智能在线预约挂号系统的研究与应用情况。1.3研究方法及创新点概述本文采用的技术路线、研究方法及主要创新点。第2章相关理论总结智能在线预约挂号系统相关理论,包括系统架构、开发技术等。2.1系统架构设计理论介绍系统架构设计的基本原则和常用方法。2.2SpringBoot开发框架理论阐述SpringBoot框架的特点、优势及其在系统开发中的应用。2.3数据库设计与管理理论介绍数据库设计原则、数据模型及数据库管理系统。2.4网络安全与数据保护理论讨论网络安全威胁、数据保护技术及其在系统中的应用。第3章SpringBoot智能在线预约挂号系统设计详细介绍系统的设计方案,包括功能模块划分、数据库设计等。3.1系统功能模块设计划分系统功能模块,如用户管理、挂号管理、医生排班等。3.2数据库设计与实现设计数据库表结构,确定字段类型、主键及外键关系。3.3用户界面设计设计用户友好的界面,提升用户体验。3.4系统安全设计阐述系统安全策略,包括用户认证、数据加密等。第4章系统实现与测试介绍系统的实现过程,包括编码、测试及优化等。4.1系统编码实现采用SpringBoot框架进行系统编码实现。4.2系统测试方法介绍系统测试的方法、步骤及测试用例设计。4.3系统性能测试与分析对系统进行性能测试,分析测试结果并提出优化建议。4.4系统优化与改进根据测试结果对系统进行优化和改进,提升系统性能。第5章研究结果呈现系统实现后的效果,包括功能实现、性能提升等。5.1系统功能实现效果展示系统各功能模块的实现效果,如挂号成功界面等。5.2系统性能提升效果对比优化前后的系统性能
在金融行业中,对信用风险的判断是核心环节之一,其结果对机构的信贷政策和风险控制策略有直接影响。本文将围绕如何借助机器学习方法,尤其是Sklearn工具包,建立用于判断信用状况的预测系统。文中将涵盖逻辑回归、支持向量机等常见方法,并通过实际操作流程进行说明。 一、机器学习基本概念 机器学习属于人工智能的子领域,其基本理念是通过数据自动学习规律,而非依赖人工设定规则。在信贷分析中,该技术可用于挖掘历史数据中的潜在规律,进而对未来的信用表现进行预测。 、Sklearn工具包概述 Sklearn(Scikit-learn)是Python语言中广泛使用的机器学习模块,提供多种数据处理和建模功能。它简化了数据清洗、特征提取、模型构建、验证与优化等流程,是数据科学项目中的常用工具。 三、逻辑回归模型 逻辑回归是一种常用于分类任务的线性模型,特别适用于类问题。在信用评估中,该模型可用于判断借款人是否可能违约。其通过逻辑函数将输出映射为0到1之间的概率值,从而表示违约的可能性。 四、支持向量机模型 支持向量机是一种用于监督学习的算法,适用于数据维度高、样本量小的情况。在信用分析中,该方法能够通过寻找最佳分割面,区分违约与非违约客户。通过选用不同核函数,可应对复杂的非线性关系,提升预测精度。 五、数据预处理步骤 在建模前,需对原始数据进行清理与转换,包括处理缺失值、识别异常点、标准化数值、筛选有效特征等。对于信用评分,常见的输入变量包括收入水平、负债比例、信用历史记录、职业稳定性等。预处理有助于减少噪声干扰,增强模型的适应性。 六、模型构建与验证 借助Sklearn,可以将数据集划分为训练集和测试集,并通过交叉验证调整参数以提升模型性能。常用评估指标包括准确率、召回率、F1值以及AUC-ROC曲线。在处理不平衡数据时,更应关注模型的召回率与特异性。 七、集成学习方法 为提升模型预测能力,可采用集成策略,如结合多个模型的预测结果。这有助于降低单一模型的偏差与方差,增强整体预测的稳定性与准确性。 综上,基于机器学习的信用评估系统可通过Sklearn中的多种算法,结合合理的数据处理与模型优化,实现对借款人信用状况的精准判断。在实际应用中,需持续调整模型以适应市场变化,保障预测结果的长期有效性。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值