1.数字排序
数据:
D:\测试数据\排序\
sortFile1内容为: sortFile2内容为:
2 5956
32 22
654 650
32 92
15
756
65223
结果: (1,2)
(2,15)
(3,22)
(4,32)
(5,32)
(6,92)
(7,650)
(8,654)
(9,756)
(10,5956)
(11,65223)
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MySort").setMaster("local")
val sc = new SparkContext(conf)
val dataFile = "file:///D:/测试数据/排序/*"
val data = sc.textFile(dataFile)
var index = 0
val result = data.filter(_.trim().length>0).map(n=>(n.trim.toInt,""))
.partitionBy(new HashPartitioner(1))
.sortByKey()
.map(t=>{index += 1;(index,t._1)})
result.saveAsTextFile("file:///D:/测试数据/排序/result")
}
2. 给定一组键值对("spark",2),("hadoop",6),("hadoop",4),("spark",6),键值对的 key表示图书名称,value表示某天图书销量,请计算每个键对应的平均值,也 就是计算每种图书的每天平均销量。
val rdd = sc.parallelize(Array(("spark",2),("hadoop",6),("hadoop",4),("spark",6)))
//rdd.mapValues(x => (x,1)) --> ("spark",(2,1)),("hadoop",(6,1)),("hadoop",(4,1)),("spark",(6,1))
//rdd.mapValues(x => (x,1)).reduceByKey((x,y) => (x._1+y._1,x._2 +y._2))
// --> ("spark",(2+6,1+1)),("hadoop",(6+4,1+1))
rdd.mapValues(x => (x,1)).reduceByKey((x,y) => (x._1+y._1,x._2 +y._2)).mapValues(x => (x._1 / x._2)).collect()
3.二次排序
题目:要求先按账户排序,在按金额排序
hadoop@apache 200
hive@apache 550
yarn@apache 580
hive@apache 159
hadoop@apache 300
hive@apache 258
hadoop@apache 150
yarn@apache 560
yarn@apache 260
结果:
(hadoop@apache,List(150, 200, 300))