要求
给出一个TXT文件,文件包含uri下面对应的访问量。求每个域名下面访问量最大的uri
程序
package www.ruozedata.bigdata.homework
import org.apache.spark.{SparkConf, SparkContext}
object URIApp {
def main(args: Array[String]): Unit = {
val sparkConf=new SparkConf().setMaster("local[2]").setAppName("URIApp")
val sc=new SparkContext(sparkConf)
val lines=sc.textFile("file:///C:\\Users\\HJ\\Desktop/secondhomework.txt")
val uri=lines.map(x=>{
val uritemp=x.split("\t")
val uritemp1=uritemp(0)
var number:Long=0
try{
number=uritemp(2).toLong
}catch {
case e: Exception=>println("error number")
}
val net=uritemp1.split("//")(1).split("/")(0)
(net,(uritemp1,number))
}).groupByKey()
val result=uri.map(x=>{
val maxtemp=x._2.toList.sortBy(_._2).reverse//在这里sortBy后面不能跟参数false。因为sortBy可以自定义按照降序排列是RDD的算子,而此步骤我们已经把它转为了集合类型。所以采用了reverse这种集合的算子达到降序排列的要求
(x._1,maxtemp(0))//取第一个值,也就是最大值
}).foreach(println)
sc.stop()
}
}
结果
(segmentfault.com,(https://segmentfault.com/q/1010000000318379,50))
(blog.youkuaiyun.com,(https://blog.youkuaiyun.com/bitcarmanlee/article/details/75949268 ,40))
(www.baidu.com,(https://www.baidu.com/baidu?tn=monline_3_dg&ie=utf-8&wd=%E6%9C%89%E9%81%93%E7%BF%BB%E8%AF%91,5))
(www.cnblogs.com,(https://www.cnblogs.com/MOBIN/p/5384543.html,40))
(ruozedata.com,(http://ruozedata.com/student.html ,56))
输入文件
https://segmentfault.com/q/1010000000318379 [2018-1202:00] 50
http://ruozedata.com/teacher.html 201802:00 j
http://ruozedata.com/student.html 201802:00 56
https://www.cnblogs.com/MOBIN/p/5384543.html [2018-12-12 22:00:00] 40
https://www.cnblogs.com/huxiuqian/p/10152166.html 201802:00 4
https://www.cnblogs.com/littleorange7/p/10152286.html [2018-12-12 22:00:00] 7
http://ruozedata.com/advanced.html [2018-12-14 22:02:00] 8
https://www.baidu.com/baidu?tn=monline_3_dg&ie=utf-8&wd=%E6%9C%89%E9%81%93%E7%BF%BB%E8%AF%91 [2018-1202:00] 5
https://blog.youkuaiyun.com/maybe_fly/article/details/77979867 201802:00 h
https://blog.youkuaiyun.com/bitcarmanlee/article/details/75949268 [2018-12-13 22:02:00] 40
https://blog.youkuaiyun.com/tswisdom/article/details/79882308 [2018-12-13 22:02:00] 30