教程:
http://spark.apache.org/docs/latest/sql-programming-guide.html
Spark SQL是用于结构化数据处理的Spark模块。与基本的Spark RDD API不同,Spark SQL提供的接口为Spark提供了有关数据结构和正在执行的计算的更多信息。Spark SQL的一个用途是执行SQL查询。Spark SQL还可用于从现有Hive安装中读取数据。
SparkSession
Spark中所有功能的入口点都是SparkSession类。要创建基本的SparkSession,只需使用SparkSession.builder():
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.master("local")
.config("spark.some.config.option", "some-value")
.getOrCreate();
DataFrame
1、RDD、DataFrame、Dataset全都是spark平台下的分布式弹性数据集,为处理超大型数据提供便利
2、RDD不支持sparksql操作
3、DataFrame与Dataset均支持sparksql的操作,比如select,groupby之类,还能注册临时表/视窗,进行sql语句操作
4、在Java API中,用户需要使用Dataset来表示DataFrame
5、DataFrame与Dataset支持一些特别方便的保存方式,比如保存成csv,可以带上表头,这样每一列的字段名一目了然
6、DataFrame也可以叫Dataset[Row],每一行的类型是Row,不解析,每一行究竟有哪些字段,各个字段又是什么类型都无从得知,只能用上面提到的getAS方法或者模式匹配拿出特定字段
使用Dataset
Dataset<Row> df = spark.read().json("/Users/kexin/work/projects/TestSpark/src/main/resources/people.json");
df.show();
df.printSchema();
List list = df.filter(col("age").gt(18)).select("name","age").collectAsList();
list.forEach(row->{
System.out.print(((Row)row).get(0)+"\t");
System.out.println(((Row)row).get(1));
});
DataSet
1、直接创建dataset
// Create an instance of a Bean class
Person person = new Person();
person.setName("Andy");
person.setAge(32);
List<Person> l = new ArrayList<>();
l.add(person);
Person p = new Person();
p.setName("tom");
p.setAge(44);
l.add(p);
// Encoders are created for Java beans
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> javaBeanDS = spark.createDataset(l, personEncoder);
javaBeanDS.show();
List list = javaBeanDS.filter(col("age").lt(33)).select("name", "age").collectAsList();
list.forEach(row -> {
System.out.println(((Row) row).get(0) + "\t" + ((Row) row).get(1));
});
2、从数据文件中创建
// DataFrames can be converted to a Dataset by providing a class. Mapping based on name
String path = "/Users/kexin/work/projects/TestSpark/src/main/resources/people.json";
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> peopleDS = spark.read().json(path).as(personEncoder);
peopleDS.show();
3、与RDD转换
// Create an RDD of Person objects from a text file
JavaRDD<Person> peopleRDD = spark.read().textFile("/Users/kexin/work/projects/TestSpark/src/main/resources/people.txt")
.javaRDD().map(line -> {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
});
// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset<Row> peopleDF = spark.createDataFrame(peopleRDD, Person.class);
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people");
// SQL statements can be run by using the sql methods provided by spark
Dataset<Row> teenagersDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");
// The columns of a row in the result can be accessed by field index
Encoder<String> stringEncoder = Encoders.STRING();
Dataset<String> teenagerNamesByIndexDF = teenagersDF.map(
(MapFunction<Row, String>) row -> "Name: " + row.getString(0), stringEncoder);
teenagerNamesByIndexDF.show();
// or by field name
Dataset<String> teenagerNamesByFieldDF = teenagersDF.map((MapFunction<Row, String>) row -> "Name: " + row.<String>getAs("name"), stringEncoder);
teenagerNamesByFieldDF.show();
4、使用rdd和schema创建dataframe
// Create an RDD
JavaRDD<String> peopleRDD = spark.sparkContext().textFile("/Users/kexin/work/projects/TestSpark/src/main/resources/people.txt", 1).toJavaRDD();
// The schema is encoded in a string
String schemaString = "name age";
// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
// Convert records of the RDD (people) to Rows
JavaRDD<Row> rowRDD = peopleRDD.map((Function<String, Row>) record -> {
String[] attributes = record.split(",");
return RowFactory.create(attributes[0], attributes[1].trim());
});
// Apply the schema to the RDD
Dataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema);
// Creates a temporary view using the DataFrame
peopleDataFrame.createOrReplaceTempView("people");
// SQL can be run over a temporary view created using DataFrames
Dataset<Row> results = spark.sql("SELECT name FROM people");
// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
Dataset<String> namesDS = results.map((MapFunction<Row, String>) row -> "Name: " + row.getString(0), Encoders.STRING());
namesDS.show();
临时视图
将DataFrame注册为临时视图允许您对其数据运行SQL查询
1、临时视图
可以直接执行sql语句,Spark SQL中的临时视图是会话范围的,如果创建它的会话终止,它将消失。
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people");
Dataset<Row> sqlDF = spark.sql("SELECT * FROM people");
sqlDF.show();
2、全局临时视图
如果想要实现一个在所有会话之间共享的临时视图并保持活动状态,直到Spark应用程序终止,可以创建一个全局临时视图。全局临时视图与系统保留的数据库绑定global_temp,需要使用限定名称来引用它,例如SELECT * FROM global_temp.view1
df.createGlobalTempView("people");
// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show();
// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show();