1. 数据
下面这种数据结构,节选了几条数据:
Aaron,OperatingSystem,100
Aaron,Python,50
Aaron,ComputerNetwork,30
Aaron,Software,94
Abbott,DataBase,18
Abbott,Python,82
Abbott,ComputerNetwork,76
Abel,Algorithm,30
Abel,DataStructure,38
Abel,OperatingSystem,38
Abel,ComputerNetwork,92
Abraham,DataStructure,12
Abraham,ComputerNetwork,78
Abraham,Software,98
代码
文件上传到容器:docker cp 本地文件路劲 ID全称:你的容器的路径
docker cp D:/JupyterNotebook/测试数据/spark/Data01.txt a89262625a0a:/spark/data
D:/JupyterNotebook/测试数据/spark/Data01.txt
查看容器ID:docker ps -a
或者打开scala,用textFile()上传到集群
docker cp D:\JupyterNotebook\测试数据\spark\课程数据 a89262625a0a:/spark/data
docker cp D:\Project\Spark\test7.py a89262625a0a:/spark/data
scala代码:
从集群中读取数据
val lines = sc.textFile("D:///JupyterNotebook/测试数据/spark/Data01.txt")
val lines = sc.textFile("/spark/data/Data01.txt") //从集群中读取的路径
查看数据:
lines.foreach(elem=>println(elem))
(1) 该系总共有多少个学生? map+distinct+count操作
创建键值对,以学生姓名为主键,统计主键个数即可
scala代码:
lines.map(line=>line.split(",")(0)).distinct().count()
输出:
res2: Long = 265
说明:
-
lines.map(line=>line.split(",")(0)):用逗号分开之后,取1个字符元素
-
distinct:用于去重, 我们生成的RDD可能有重复的元素,使用distinct方法可以去掉重复的元素
-
count:计算数量
(2)总共开设了多少课程?
scala代码:
lines.map(line=>line.split(",")(1)).distinct().count()
输出:
res3: Long = 8
(3)Tom同学总成绩平均分是多少?
val Tom = lines.filter(line=>line.split(",")(0)=="Tom")
val Tom_1 = Tom.map(t=>(t.split(",")(0),(t.split(",")(2).toInt, 1)))
val Tom_2 = Tom_1.reduceByKey((a,b)=>(a._1+b._1, a._2+b._2))
Tom_2.mapValues(a=>a._1/a._2).first()
输出结果:
res4: (String, Int) = (Tom,30)
(4) 求每名同学的选修的课程门数
val student = lines.map(a=>(a.split(",")(0),1))
val student_1 = student.reduceByKey((a,b)=>a+b).foreach(println)
输出:只列出部分
(Bartholomew,5)
(Ford,3)
(Lennon,4)
(Joshua,4)
(Tom,5)
……
student_1: Unit = ()
(5) 该系DataBase课程共有多少人选修
val db = lines.filter(a=>a.split(",")(1)=="DataBase").map(a=>(a.split(",")(1),1)).reduceByKey((a,b)=>a+b).foreach(println)
输出:
(DataBase,126)
(6) 各门课程的平均分是多少?
val course = lines.map(a=>(a.split(",")(1), (a.split(",")(2).toInt, 1))).reduceByKey((a,b)=>(a._1+b._1, a._2+b._2)).mapValues(a=>a._1/a._2).foreach(println)
输出:
(CLanguage,50)
(Python,57)
(OperatingSystem,54)
(Software,50)
(Algorithm,48)
(DataStructure,47)
(DataBase,50)
(ComputerNetwork,51)