spark.sql中的Array[Struct]类型查询

本文介绍了如何利用LATERALVIEW SQL操作在大数据场景下简化从复杂结构数据中筛选指定key值的过程。通过举例说明,阐述了LATERALVIEW的语法格式和使用方法,包括单个及多个LATERALVIEW语句的应用,以及如何直接对struct类型数据进行操作。这种方法能够一步到位地完成原本需要多步才能实现的数据展开和查询,提高了查询效率。
背景

我们要查询类似以下结构的数据,但是要筛选出指定key值的数据。
在这里插入图片描述

解决方案

一般方法将数组covers字段进行explode()操作展开,生成一个包含covers中struct类型元素的临时表,然后再将临时表中的struct类型字段的各个字段展开。最终生成一个id,key,type的表。得到这个表时便可以指定key进行查询了。但是这个方法需要进行三步,非常麻烦。

使用LATERAL VIEW explode(covers) adTable AS cover可以一步到位。

LATERAL VIEW介绍

语法格式:

LATERAL VIEW [ OUTER ] generator_function ( expression [ , ... ] ) [ table_alias ] AS column_alias [ , ... ]

假设我们已经有如下表:

pageidcol1col2
front_page[1, 2, 3][“a”, “b”, “c”]
contact_page[3, 4, 5][“d”, “e”, “f”]
  • 单个Lateral View语句
select pageid, col1_new, col2 from pageAds lateral view explode(col1) adTable as col1_new;

+--------------+------------+---------------+
| pageid       | col1_new   | col2          |
+--------------+------------+---------------+
| front_page   | 1          | ["a","b","c"] |
| front_page   | 2          | ["a","b","c"] |
| front_page   | 3          | ["a","b","c"] |
| contact_page | 3          | ["d","e","f"] |
| contact_page | 4          | ["d","e","f"] |
| contact_page | 5          | ["d","e","f"] |
+--------------+------------+---------------+
  • 拆分col1并执行聚合统计。
select col1_new, count(1) as count from pageAds lateral view explode(col1) adTable as col1_new group by col1_new;

+------------+------------+
| col1_new   | count      |
+------------+------------+
| 1          | 1          |
| 2          | 1          |
| 3          | 2          |
| 4          | 1          |
| 5          | 1          |
+------------+------------+
  • 多个Lateral View语句
select pageid,mycol1, mycol2 from pageAds 
    lateral view explode(col1) myTable1 as mycol1 
    lateral view explode(col2) myTable2 as mycol2;
    
+--------------+----------+----------+
| pageid       | mycol1   | mycol2   |
+--------------+----------+----------+
| front_page   | 1        | a        |
| front_page   | 1        | b        |
| front_page   | 1        | c        |
| front_page   | 2        | a        |
| front_page   | 2        | b        |
| front_page   | 2        | c        |
| front_page   | 3        | a        |
| front_page   | 3        | b        |
| front_page   | 3        | c        |
| contact_page | 3        | d        |
| contact_page | 3        | e        |
| contact_page | 3        | f        |
| contact_page | 4        | d        |
| contact_page | 4        | e        |
| contact_page | 4        | f        |
| contact_page | 5        | d        |
| contact_page | 5        | e        |
| contact_page | 5        | f        |
+--------------+----------+----------+
  • 对于struct类型可以使用 "."直接取数
select id,cover.key as k, cover.type as t from tablename lateral view explode(covers) myTable1 as cover where cover.key = 'special'
import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ object MultiTableSparkSQL { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("MultiTableQuery") .master("local[*]") .getOrCreate() import spark.implicits._ // 定义student表结构(学号,姓名,性别,出生年月,班级) val studentSchema = StructType(Array( StructField("Sno", StringType, nullable = false), StructField("Sname", StringType, nullable = false), StructField("Ssex", StringType, nullable = false), StructField("Sbirthday", StringType, nullable = true), StructField("class", StringType, nullable = true) )) val studentDF = spark.read .schema(studentSchema) .csv("data/student.txt") studentDF.createOrReplaceTempView("student") // 注册为临时视图 // 定义teacher表结构(教工编号,姓名,性别,出生年份,职称,部门) val teacherSchema = StructType(Array( StructField("Tno", StringType, nullable = false), StructField("Tname", StringType, nullable = false), StructField("Tsex", StringType, nullable = false), StructField("Tbirthday", StringType, nullable = true), StructField("Prof", StringType, nullable = true), StructField("Depart", StringType, nullable = true) )) val teacherDF = spark.read .schema(teacherSchema) .csv("data/teacher.txt") teacherDF.createOrReplaceTempView("teacher") // 定义course表结构(课程号,课程名,教工编号) val courseSchema = StructType(Array( StructField("Cno", StringType, nullable = false), StructField("Cname", StringType, nullable = false), StructField("Tno", StringType, nullable = false) )) val courseDF = spark.read .schema(courseSchema) .csv("data/course.txt") courseDF.createOrReplaceTempView("course") // 定义score表结构(学号,课程号,成绩) val scoreSchema = StructType(Array( StructField("Sno", StringType, nullable = false), StructField("Cno", StringType, nullable = false), StructField("Degree", DoubleType, nullable = true) )) val scoreDF = spark.read .schema(scoreSchema) .csv("data/score.txt") scoreDF.createOrReplaceTempView("score") // 查询1:以班号和年龄从大到小的顺序查询Student表中的全部记录 spark.sql("SELECT * FROM student ORDER BY class DESC, Sbirthday").show() // 查询2:显示不同班级的学生平均成绩 spark.sql(""" SELECT s.class, AVG(sc.Degree) AS avg_score FROM student s JOIN score sc ON s.Sno = sc.Sno GROUP BY s.class """).show() // 查询3:查询“计算机系”与“电子工程系”不同职称的教师的Tname和Prof spark.sql(""" SELECT Tname, Prof FROM teacher WHERE Depart IN ('计算机系', '电子工程系') GROUP BY Tname, Prof """).show() // 查询4:显示性别为女的教师信息 spark.sql("SELECT * FROM teacher WHERE Tsex = '女'").show() // 查询5:显示不重复的教师部门信息 spark.sql("SELECT DISTINCT Depart FROM teacher").show() // 查询6:显示最高成绩 spark.sql("SELECT MAX(Degree) AS max_score FROM score").show() // 查询7:按照班级排序显示每个班级的平均成绩 spark.sql(""" SELECT s.class, AVG(sc.Degree) AS avg_score FROM student s JOIN score sc ON s.Sno = sc.Sno GROUP BY s.class ORDER BY s.class """).show() spark.stop() } }
最新发布
07-03
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值