总述
Before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD).
After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood(底层). The RDD interface is still supported. However, we highly recommend you to switch to use Dataset, which has better performance than RDD.
Datasets and DataFrames
- A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
- A DataFrame is a Dataset organized into named co