RDD/DataSet/DataFrame

最新推荐文章于 2022-12-11 16:45:54 发布

原创最新推荐文章于 2022-12-11 16:45:54 发布 · 438 阅读

1 ·

CC 4.0 BY-SA版权

Spark 专栏收录该内容

138 篇文章

订阅专栏

本文介绍了Apache Spark中DataFrame和DataSet的基本概念及用法，对比了它们之间的差异，并演示了如何进行转换。通过具体的代码示例展示了如何使用DataFrame和DataSet进行数据处理任务，包括加载数据、数据转换、聚合操作等。

1.RDD&DataSet

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.

2.DataSet&DataFrame

//DataFrame

// Load a text file and interpret each line as a java.lang.String
val ds = sqlContext.read.text("/home/spark/1.6/lines").as[String]
val result = ds
  .flatMap(_.split(" "))               // Split on whitespace
  .filter(_ != "")                     // Filter empty words
  .toDF()                              // Convert to DataFrame to perform aggregation / sorting
  .groupBy($"value")                   // Count number of occurences of each word
  .agg(count("*") as "numOccurances")
  .orderBy($"numOccurances" desc)      // Show most common words first

//DataSet,完全使用scala编程，不要切换到DataFrame

val wordCount = 
  ds.flatMap(_.split(" "))
    .filter(_ != "")
    .groupBy(_.toLowerCase()) // Instead of grouping on a column expression (i.e. $"value") we pass a lambda function
    .count()