Spark学习笔记四(RDD Persistency)_scala persist store-优快云博客

本文链接：https://blog.youkuaiyun.com/OnlyQi/article/details/50687649

为什么要Persistency
当使用transformations和actions定义好数据操作并执行后，Spark会开始执行数据处理。这期间产生的一些中间RDD全部由Spark自动产生并消除，不需要用户关心。
但有时我们希望能够重用一些RDD来提高效率。例如新建了一个RDD1，之后调用map()生成了RDD2，最后又分别对RDD2调用了count()和reduce()。Spark会从RDD1计算出RDD2，然后得到count()。之后又从RDD1计算出RDD2，然后计算reduce()。也就是说RDD1 >> RDD2计算了两次。
To avoid computing an RDD multiple times, we can ask Spark to persist the data. When we ask Spark to persist an RDD, the nodes that compute the RDD store their partitions.
Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

如何Persistency

lines=sc.parallelize([“test1”,5,“test2”,3,2])
lines.persist(StorageLevel.MEMORY_ONLY)

Persistency的特点：
1, persist是一种transformations，即persist()不会导致Spark开始执行，直到遇到一个action。
2, RDD可以被持久化到内存或硬盘上，但不同语言有一些区别：
In Scala and Java, the default persist() will store the data in the JVM heap as unserialized objects. In Python, we always serialize the data that persist stores, so the default is instead stored in the JVM heap as pickled objects.
3, 持久化有不同的级别，可以选择持久化到内存还是硬盘上，是否序列化等。例如下面就使用MEMORY_ONLY级别persist数据。MEMORY_ONLY是默认级别，其他级别详见官网。
4,如果我们显式的持久化了一些数据，之后并不需要显式的删除这些数据。Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.