什么是RDD:Spark提供了一个抽象的弹性分布式数据集,是一个由集群中各个节点以分区的方式排列的集合,用以支持并行计算。RDD在驱动程序调用hadoop的文件系统的时候就创建(其实就是读取文件的时候就创建),或者通过驱动程序中scala集合转化而来,用户也可以用spark将RDD放入缓存中,来为集群中某台机器宕掉后,确保这些RDD数据可以有效的被复用。
总之,RDD能自动从宕机的节点中恢复过来。
摘抄自官网的说明:
At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.
RDD的的操作类型(以下为个人从官