头歌Spark SQL 多数据源操作(Scala)

第1关:加载与保存操作

编程要求

打开右侧代码文件窗口,在 Begin 至 End 区域补充代码,完善程序。读取本地文件 file:///data/bigfiles/demo.json,根据年龄字段 age 设置降序,输出结果。

demo.json 文件内容如下所示:

    {"name": "zhangsan", "age": 20, "sex": "m"},
    {"name": "lisi", "age": 21, "sex": "m"},
    {"name": "tiantian", "age": 22, "sex": "f"},
    {"name": "lihua", "age": 23, "sex": "f"},
    {"name": "zhaoliu", "age": 24, "sex": "m"},
    {"name": "liguanqing", "age": 25, "sex": "f"},
    {"name": "zhangqi", "age": 26, "sex": "m"},
    {"name": "zhaoai", "age": 27, "sex": "m"},
    {"name": "wangjiu", "age": 28, "sex": "f"}

开始任务前,注意先启动 Hadoop 与 Hive 环境: start-all.sh、nohup hive --service metastore &

命令行分别输入:

start-all.sh
nohup hive --service metastore &

如果`nohup hive --service metastore &`一直不跳转,直接enter

代码实现:

import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}

object First_Question {


  def main(args: Array[String]): Unit = {
    
    val spark: SparkSession = SparkSession
      .builder()
      .appName("First_Question")
      .master("local[*]")
      .getOrCreate()


    /******************* Begin *******************/  
    val dataFrame: DataFrame = spark.read.json("/data/bigfiles/demo.json")
    dataFrame.orderBy(dataFrame.col("age").desc).show()

        
    /******************* End *******************/

    spark.stop()

  }
}

第2关:Parquet 格式文件

打开右侧代码文件窗口,在 Begin 至 End 区域补充代码,根据下列要求,完善程序。

    读取本地文件 file:///data/bigfiles/demo.json,使用 Parquet 完成分区,列名为 student=1,保存到本地路径file:///result/下。

    读取本地文件 file:///data/bigfiles/demo2.json,使用 Parquet 完成分区,列名为 student=2,保存到本地路径file:///result/下。

demo.json 文件内容如下所示:

    {"name": "zhangsan", "age": 20, "sex": "m"},
    {"name": "lisi", "age": 21, "sex": "m"},
    {"name": "tiantian", "age": 22, "sex": "f"},
    {"name": "lihua", "age": 23, "sex": "f"},
    {"name": "zhaoliu", "age": 24, "sex": "m"},
    {"name": "liguanqing", "age": 25, "sex": "f"},
    {"name": "zhangqi", "age": 26, "sex": "m"},
    {"name": "zhaoai", "age": 27, "sex": "m"},
    {"name": "wangjiu", "age": 28, "sex": "f"}

demo2.json 文件内容如下所示:

    {"name": "hongkong", "age": 20, "sex": "m"},
    {"name": "kulu", "age": 21, "sex": "m"},
    {"name": "huxiaotian", "age": 22, "sex": "f"},
    {"name": "yueming", "age": 23, "sex": "f"},
    {"name": "wangsan", "age": 24, "sex": "m"},
    {"name": "zhaojiu", "age": 25, "sex": "f"},
    {"name": "wangqiqi", "age": 26, "sex": "m"},
    {"name": "wangxiantian", "age": 27, "sex": "m"},
    {"name": "zhaoba", "age": 28, "sex": "f"}

   

开始任务前,注意先启动 Hadoop 与 Hive 环境:start-all.sh、nohup hive --service metastore &

 命令行分别输入:

start-all.sh
nohup hive --service metastore &

代码实现:

import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}

object Second_Question {


  def main(args: Array[String]): Unit = {
    
    val spark: SparkSession = SparkSession
      .builder()
      .appName("Second_Question")
      .master("local[*]")
      .getOrCreate()


    /******************* Begin *******************/  
    spark.read.json("file:///data/bigfiles/demo.json").write.parquet("file:///result/student=1")
    spark.read.json("file:///data/bigfiles/demo2.json").write.parquet("file:///result/student=2")





    /******************* End *******************/

    spark.stop()

  }
}

第3关:ORC 格式文件

编程要求

根据下列要求,完善程序。

    创建 Orc 格式的 Hive 数据表 student,添加字段id(int),name(string),age(int),class(string)。

    按顺序插入如下数据:

    1001,"王刚",19,"大数据一班"
    1002,"李虹",18,"大数据一班"
    1003,"张子萱",20,"大数据一班"
    1004,"赵云",18,"大数据一班"
    1005,"李晓玲",19,"大数据一班"
    1006,"张惠",18,"大数据二班"
    1007,"秦散",19,"大数据二班"
    1008,"王丽",18,"大数据二班"
    1009,"田忌",20,"大数据二班"
    1010,"张花",18,"大数据二班"

    打开右侧代码文件窗口,在 Begin 至 End 区域补充代码,编写 spark sql 程序,读取创建的 student 表并按字段 id 升序输出。

命令行:

# 启动 Hadoop
start-all.sh
# 启动 Hive 元数据服务
nohup hive --service metastore &
# 进入 Hive
hive
# 创建 ORC 格式的 Hive 数据表
create table student(id int,name string,age int,class string
)stored as orc;

# 插入数据
insert into table student values(1001,"王刚",19,"大数据一班");
insert into table student values(1002,"李虹",18,"大数据一班");
insert into table student values(1003,"张子萱",20,"大数据一班");
insert into table student values(1004,"赵云",18,"大数据一班");
insert into table student values(1005,"李晓玲",19,"大数据一班");
insert into table student values(1006,"张惠",18,"大数据二班");
insert into table student values(1007,"秦散",19,"大数据二班");
insert into table student values(1008,"王丽",18,"大数据二班");
insert into table student values(1009,"田忌",20,"大数据二班");
insert into table student values(1010,"张花",18,"大数据二班");

是在hive插入数据

显示Time taken字样插入完成

代码实现:

import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}

object Third_Question {


  def main(args: Array[String]): Unit = {
    
    val spark: SparkSession = SparkSession
      .builder()
      .appName("Third_Question")
      .master("local[*]")
      .enableHiveSupport()
      .getOrCreate()


    /******************* Begin *******************/  
 
    spark.sql("select * from student").orderBy("id").show()




    /******************* End *******************/

    spark.stop()

  }
}

第4关:JSON 格式文件

编程要求

打开右侧代码文件窗口,在 Begin 至 End 区域补充代码,完善程序。读取本地文件 file:///data/bigfiles/test.json,不改变原数据排列顺序进行输出。

test.json 文件内容如下所示:

    {"id":1001,"name":"王刚","age":19,"class":"大数据一班"},
    {"id":1002,"name":"李虹","age":18,"class":"大数据一班"},
    {"id":1003,"name":"张子萱","age":20,"class":"大数据一班"},
    {"id":1004,"name":"赵云","age":18,"class":"大数据一班"},
    {"id":1005,"name":"李晓玲","age":19,"class":"大数据一班"},
    {"id":1006,"name":"张惠","age":18,"class":"大数据二班"},
    {"id":1007,"name":"秦散","age":19,"class":"大数据二班"},
    {"id":1008,"name":"王丽","age":18,"class":"大数据二班"},
    {"id":1009,"name":"田忌","age":20,"class":"大数据二班"},
    {"id":1010,"name":"张花","age":18,"class":"大数据二班"}

 命令行分别输入:

start-all.sh
nohup hive --service metastore &

代码实现:

import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}

object Forth_Question {


  def main(args: Array[String]): Unit = {
    
    val spark: SparkSession = SparkSession
      .builder()
      .appName("Forth_Question")
      .master("local[*]")
      .getOrCreate()


    /******************* Begin *******************/  
    val dataFrame:DataFrame = spark.read.json("file:///data/bigfiles/test.json")
    dataFrame.createOrReplaceTempView("data")
    spark.sql("select id,name,age,class from data").orderBy("id").show()




    /******************* End *******************/

    spark.stop()

  }
}

第5关:JDBC 操作数据库

编程要求

打开右侧代码文件窗口,在 Begin 至 End 区域补充代码,完善程序。读取本地 csv 文件 file:///data/bigfiles/job58_data.csv(有表头),将加载的数据以覆盖的方式保存到本地 Mysql 数据库的 work.job_data 表中,数据库连接信息如下:

    账号:root

    密码:123123

    端口:3306

    注意设置 useSSL=false。

 命令行分别输入:

start-all.sh
nohup hive --service metastore &

代码实现:

import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}

object Fifth_Question {


  def main(args: Array[String]): Unit = {
    
    val spark: SparkSession = SparkSession
      .builder()
      .appName("Fifth_Question")
      .master("local[*]")
      .getOrCreate()


    /******************* Begin *******************/  
    val dataFrame:DataFrame = spark.read.option("header","true").csv("file:///data/bigfiles/job58_data.csv")
    dataFrame.write.format("jdbc").option("url", "jdbc:mysql://localhost:3306/work?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("user", "root").option("password", "123123").option("dbtable", "job_data").mode(SaveMode.Overwrite).save()





        
    /******************* End *******************/

    spark.stop()

  }
}

第6关:Hive 表操作

编程要求

打开右侧代码文件窗口,在 Begin 至 End 区域补充代码,根据下列要求,完善程序。

    在 Hive 中创建数据表 employee,添加字段eid(string),ename(string),age(int),part(string)。

    插入如下数据:

    "A568952","王晓",25,"财务部"
    "B256412","张天",28,"人事部"
    "C125754","田笑笑",23,"销售部"
    "D265412","赵云",24,"研发部"
    "F256875","李姿姿",26,"后勤部"

    编写 spark sql 程序,直接采用 Spark on Hive 的方式读取创建的 employee 表并按字段 eid 升序输出。

命令行:

# 启动 Hadoop
start-all.sh
# 启动 Hive 元数据服务
nohup hive --service metastore &
# 进入 Hive
hive
# 创建 Hive 数据表
create table employee(eid string,ename string,age int,part string
);
# 插入数据
insert into table employee values("A568952","王晓",25,"财务部");
insert into table employee values("B256412","张天",28,"人事部");
insert into table employee values("C125754","田笑笑",23,"销售部");
insert into table employee values("D265412","赵云",24,"研发部");
insert into table employee values("F256875","李姿姿",26,"后勤部");

显示`Time taken..`插入完成

代码实现:

import org.apache.spark.sql.{DataFrame, SparkSession}

object Sixth_Question {


  def main(args: Array[String]): Unit = {
    

    /******************* Begin *******************/  

    //val spark: SparkSession.builder()
    val spark: SparkSession = SparkSession.builder()
    .appName("Sixth_Question")
    .master("local[*]")
    .enableHiveSupport()
    .getOrCreate()
    val dataFrame : DataFrame = spark.sql("select * from employee order by eid")
    dataFrame.show()
    spark.stop()








    /******************* End *******************/

  }
}

祝大家评测通过:)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值