spark2.0之后的Dataset官网信息翻译。。。

最新推荐文章于 2022-09-01 20:54:00 发布

原创最新推荐文章于 2022-09-01 20:54:00 发布 · 303 阅读

0 ·

CC 4.0 BY-SA版权

spark 专栏收录该内容

14 篇文章

订阅专栏

本文介绍如何在Spark中创建DataSet并进行基本操作，包括使用case class定义编码器、自动提供的常见数据类型编码器及DataFrame与DataSet间的转换。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

创建一个DataSet(数据集)

但是，数据集类似于RDDs，它们不使用Java序列化或Kryo(官方文档的地址https://github.com/EsotericSoftware/kryo/blob/master/README.md)，而是使用专门的编码器对对象进行序列化，以便通过网络进行处理或传输。尽管编码器和标准序列化都负责将对象转换成字节，编码器是动态生成的代码，并使用允许Spark执行许多操作(如过滤、排序和散列)的格式，但是不会将字节反序列化为对象。

// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface
case class Person(name: String, age: Long)

// 利用case class类来创建encoder编码器
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+

// Encoders for most common types are automatically provided by importing spark.implicits._
//对于常见数据类型通过导入sparkSession的隐士转化来吧数据进行转换
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name
people.json :内容如下 
 {"name":"tom", "age":12}
 {"name":"jerry", "age":23}
 {"name":"roky", "age":45}
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

另外关于Dataframe和Dataset之间的相互转换可以通过sparkSession的隐士转换。当然也是要借助case class,相当于为表提供了schema.