前言
介绍了Spark,属于evolution of MapReduce,介绍了programming model,execution strategy & fault tolerance
一、programming model
**val lines = spark.read.textFile("in").rdd** 读取文件,build computation的图
lines.collect()
– lines yields a list of strings, one per line of input
– if we run lines.collect() again, it re-reads file “in”
val links1 = lines.map{ s => val parts = s.split("\s+"); (parts(0), parts(1)) }
links1.collect()
– map, split, tuple – acts on each line in turn
– parses each string “x y” into tuple ( “x”, “y” )
val links2 = links1.distinct()
把一样的放在一起
val links3 = links2.groupByKey()
– groupByKey() sorts or hashes to bring instances of each key together
之后数据会变成

开始iteration
val jj = links4.join(ranks)
– the join brings each page’s link list and current rank together
MapReduce 的逻辑
val contribs = jj.values.flatMap{ case (urls, rank) => urls.map(url => (url, rank / urls.size)) }
– for each link, the “from” page’s rank divided by number of its links
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
– sum up the links that lead to each page
第二个iteration
val jj2 = links4.join(ranks)
– join() brings together equal keys; must sort or hash
val contribs2 = jj2.values.flatMap{ case (urls, rank) => urls.map(url => (url, rank / urls.size)) }
ranks = contribs2.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
– reduceByKey() brings together equal keys
有很多separate mapreduce programs
二、execution strategy
2.1 计算过程

2.2 Execution

三、Fault tolerance
-
一般来说如果其中一个worker fails,就repeat computation,但是这个再计算可以分配给很多workers,达到parallel computation
-
通过checkpoint减少recovery的计算量
总结
limitation
– all records treated the same way
– Transformations are functional - turn input into output
– No notion of modifying data
优势
– 比mapreduce好
– 引入了直观的dataflow的view
– 可以通过把数据写在内存中提升性能

本文深入探讨了Spark作为MapReduce的进化,重点讲解其编程模型,包括RDD操作如map、reduceByKey和join。执行策略部分阐述了计算过程和执行细节,特别提到了如何通过并行计算实现任务重试以实现故障容忍。此外,通过checkpoint机制减少了恢复计算的工作量。总结中指出,尽管Spark不支持数据修改,但其数据流视图和内存优化显著提升了性能,相比MapReduce更具优势。
235

被折叠的 条评论
为什么被折叠?



