2.Spark SQL、Dataset、DataFrame 初步

原创已于 2022-07-26 22:25:01 修改 · 663 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#spark #sql #大数据

于 2022-07-25 19:19:55 首次发布

读书笔记——Spark 专栏收录该内容

2 篇文章

订阅专栏

SparkSQL是Spark用于结构化数据处理的核心模块，提供SQL接口和Dataset API。用户可以通过SQL查询执行数据操作，或者利用强类型化的Dataset进行计算。DataFrame作为Dataset的一种，类似于关系数据库中的表。SparkSQL支持从多种源加载数据，如JSON文件，并能与其他编程语言交互。用户还可以创建和注册临时视图，执行SQL查询并转换RDD为DataFrame。此外，SparkSQL支持标量和聚合函数，允许用户进行复杂的数据分析。

1.概述

Spark SQL 是用于结构化数据处理的 Spark 模块。

Spark SQL API 提供了有关数据结构和正在执行的计算的更多信息。
在内部，Spark SQL 使用这些额外的信息来执行额外的优化。
与 Spark SQL 交互的方式有多种，包括 SQL 和 Dataset API。
计算结果时，使用相同的执行引擎，与您用于表达计算的 API/语言无关。

1.1.SQL

Spark SQL 的一种用途是执行 SQL 查询。
从另一种编程语言中运行 SQL 时，结果将作为Dataset/DataFrame返回。

1.2.Datasets 和 DataFrames

Dataset 是数据的分布式集合。
Dataset 是 Spark 1.6 中添加的一个新接口，它提供了 RDD 的优势（强类型化、使用强大 lambda 函数的能力）以及 Spark SQL 优化执行引擎的优势。可以从 JVM 对象构造数据集，然后使用功能转换（map、flatMap、filter等）进行操作。
DataFrame 是以命名列组织的数据集（Dataset<Row>）。
它在概念上等同于关系数据库中的表或 R/Python 中的数据框，但在底层进行了更丰富的优化。DataFrames 可以从多种来源构建，例如：结构化数据文件、Hive 中的表、外部数据库或现有 RDD。
我们将 Dataset<Row> 称为 DataFrames。

2.入门

2.1.创建 DataFrame

Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json");

2.2.无类型数据集操作（又名 DataFrame 操作）

// Print the schema in a tree format
df.printSchema();

// Select only the "name" column
df.select("name").show();

// Select everybody, but increment the age by 1
df.select(col("name"), col("age").plus(1)).show();

// Select people older than 21
df.filter(col("age").gt(21)).show();

// Count people by age
df.groupBy("age").count().show();

有关可以对数据集执行的操作类型的完整列表，请参阅API 文档。

2.3.以编程方式运行 SQL 查询

df.createOrReplaceTempView("people");

Dataset<Row> sqlDF = spark.sql("SELECT * FROM people");

2.4.创建 Dataset

使用专门的编码器来序列化对象，允许 Spark 执行许多操作（如过滤、排序和散列），而无需将字节反序列化回对象。

// Create an instance of a Bean class
Person person = new Person();
person.setName("Andy");
person.setAge(32);

// Encoders are created for Java beans
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> javaBeanDS = spark.createDataset(
  Collections.singletonList(person),
  personEncoder
);

// Encoders for most common types are provided in class Encoders
Encoder<Long> longEncoder = Encoders.LONG();
Dataset<Long> primitiveDS = spark.createDataset(Arrays.asList(1L, 2L, 3L), longEncoder);
Dataset<Long> transformedDS = primitiveDS.map(
    (MapFunction<Long, Long>) value -> value + 1L,
    longEncoder);
transformedDS.collect(); 

// DataFrames can be converted to a Dataset by providing a class. Mapping based on name
String path = "examples/src/main/resources/people.json";
Dataset<Person> peopleDS = spark.read().json(path).as(personEncoder);

2.5.与 RDD 互操作

Spark SQL 支持两种不同的方法将现有 RDD 转换为数据集：

第一种方法使用反射来推断包含特定类型对象的 RDD 的模式。
第二种创建数据集的方法是通过一个编程接口，它允许您构建一个模式，然后将其应用到现有的 RDD。

2.5.1.使用反射推断模式

Spark SQL 支持自动将 JavaBean的 RDD 转换为 DataFrame。使用BeanInfo反射获得的定义表的架构。

// Create an RDD of Person objects from a text file
JavaRDD<Person> peopleRDD = spark.read()
  .textFile("examples/src/main/resources/people.txt")
  .javaRDD()
  .map(line -> {
    String[] parts = line.split(",");
    Person person = new Person();
    person.setName(parts[0]);
    person.setAge(Integer.parseInt(parts[1].trim()));
    return person;
  });

// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset<Row> peopleDF = spark.createDataFrame(peopleRDD, Person.class);
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people");

// SQL statements can be run by using the sql methods provided by spark
Dataset<Row> teenagersDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");

// The columns of a row in the result can be accessed by field index
Encoder<String> stringEncoder = Encoders.STRING();
Dataset<String> teenagerNamesByIndexDF = teenagersDF.map(
    (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
    stringEncoder);

// or by field name
Dataset<String> teenagerNamesByFieldDF = teenagersDF.map(
    (MapFunction<Row, String>) row -> "Name: " + row.<String>getAs("name"),
    stringEncoder);

2.5.2.以编程方式指定架构

Dataset<Row>可以通过三个步骤以编程方式创建。

从原始 RDD创建一个RDD<Row>；
创建和步骤 1 中创建的 RDD<Row> 相匹配的由 StructTypes 表示的 schema。
将该 schema 应用于 createDataFrame 方法。

// Create an RDD
JavaRDD<String> peopleRDD = spark.sparkContext()
  .textFile("examples/src/main/resources/people.txt", 1)
  .toJavaRDD();

// The schema is encoded in a string
String schemaString = "name age";

// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
  StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
  fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);

// Convert records of the RDD (people) to Rows
JavaRDD<Row> rowRDD = peopleRDD.map((Function<String, Row>) record -> {
  String[] attributes = record.split(",");
  return RowFactory.create(attributes[0], attributes[1].trim());
});

// Apply the schema to the RDD
Dataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema);

// Creates a temporary view using the DataFrame
peopleDataFrame.createOrReplaceTempView("people");

// SQL can be run over a temporary view created using DataFrames
Dataset<Row> results = spark.sql("SELECT name FROM people");

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
Dataset<String> namesDS = results.map(
    (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
    Encoders.STRING());