创建一个DataSet(数据集)
但是,数据集类似于RDDs,它们不使用Java序列化或Kryo(官方文档的地址https://github.com/EsotericSoftware/kryo/blob/master/README.md),而是使用专门的编码器对对象进行序列化,以便通过网络进行处理或传输。尽管编码器和标准序列化都负责将对象转换成字节,编码器是动态生成的代码,并使用允许Spark执行许多操作(如过滤、排序和散列)的格式,但是不会将字节反序列化为对象。
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface case class Person(name: String, age: Long) // 利用case class类来创建encoder编码器 val caseClassDS = Seq(Person("Andy", 32)).toDS() caseClassDS.show() // +----+---+ // |name|age| // +----+---+ // |Andy| 32| // +----+---+ // Encoders for most common types are automatically provided by importing spark.implicits._ //对于常见数据类型通过导入sparkSession的隐士转化来吧数据进行转换 val primitiveDS = Seq(1, 2, 3).toDS() primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4) // DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name people.json :内容如下 {"name":"tom", "age":12} {"name":"jerry", "age":23} {"name":"roky", "age":45} val path = "examples/src/main/resources/people.json" val peopleDS = spark.read.json(path).as[Person] peopleDS.show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+
另外关于Dataframe和Dataset之间的相互转换可以通过sparkSession的隐士转换。当然也是要借助case class,相当于为表提供了schema.