笔记 MIT6.824 Lecture 15: Big Data: Spark

最新推荐文章于 2023-08-02 14:56:29 发布

原创最新推荐文章于 2023-08-02 14:56:29 发布 · 393 阅读

0 ·

CC 4.0 BY-SA版权

6.824 同时被 2 个专栏收录

18 篇文章

订阅专栏

课堂笔记

8 篇文章

订阅专栏

本文深入探讨了Spark作为MapReduce的进化，重点讲解其编程模型，包括RDD操作如map、reduceByKey和join。执行策略部分阐述了计算过程和执行细节，特别提到了如何通过并行计算实现任务重试以实现故障容忍。此外，通过checkpoint机制减少了恢复计算的工作量。总结中指出，尽管Spark不支持数据修改，但其数据流视图和内存优化显著提升了性能，相比MapReduce更具优势。

前言

介绍了Spark，属于evolution of MapReduce，介绍了programming model，execution strategy & fault tolerance

一、programming model

**val lines = spark.read.textFile("in").rdd** 读取文件，build computation的图

lines.collect()
– lines yields a list of strings, one per line of input
– if we run lines.collect() again, it re-reads file “in”

val links1 = lines.map{ s => val parts = s.split("\s+"); (parts(0), parts(1)) }
links1.collect()
– map, split, tuple – acts on each line in turn
– parses each string “x y” into tuple ( “x”, “y” )

val links2 = links1.distinct()
把一样的放在一起

val links3 = links2.groupByKey()
– groupByKey() sorts or hashes to bring instances of each key together
之后数据会变成
在这里插入图片描述

开始iteration
val jj = links4.join(ranks)
– the join brings each page’s link list and current rank together

MapReduce 的逻辑
val contribs = jj.values.flatMap{ case (urls, rank) => urls.map(url => (url, rank / urls.size)) }
– for each link, the “from” page’s rank divided by number of its links
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
– sum up the links that lead to each page

第二个iteration
val jj2 = links4.join(ranks)
– join() brings together equal keys; must sort or hash
val contribs2 = jj2.values.flatMap{ case (urls, rank) => urls.map(url => (url, rank / urls.size)) }
ranks = contribs2.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
– reduceByKey() brings together equal keys

有很多separate mapreduce programs