学习rdd的特性最好是从官网和源码来进行学习,首先看下官网解释:http://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds
Resilient Distributed Datasets (RDDs)
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Spark围绕弹性分布式数据集(RDD)的概念展开,RDD是可并行操作的元素的容错集合。创建RDD的方法有两种:并行化 驱动程序中的现有集合,或引用外部存储系统(例如共享文件系统NFS,HDFS,HBase或可以提供Hadoop InputFormat的任何数据源)中的数据集。
看完官网解释,感觉解释的很笼统,RDD具体的具体特性不是很清晰,ok,接下来查看源码
/** * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, * partitioned collection of elements that can be operated on in parallel. This class contains the * basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition, * [[org.ap