数据长这样
http://bigdata.cn/laozhang
http://bigdata.cn/laozhang
http://bigdata.cn/laozhao
http://bigdata.cn/laozhao
http://bigdata.cn/laozhao
http://bigdata.cn/laozhao
http://bigdata.cn/laozhao
http://bigdata.cn/laoduan
http://bigdata.cn/laoduan
http://javaee.cn/xiaoxu
http://javaee.cn/xiaoxu
http://javaee.cn/laoyang
http://javaee.cn/laoyang
http://javaee.cn/laoyang
http://php.cn/laoli
http://php.cn/laoliu
http://php.cn/laoli
http://php.cn/laoli
需要 求每个科目的受欢迎老师前三个
将数据按多/切割,中间的再按点" . "切割 ,将结果一个一个对应map映射成((科目,老师),1)用来计数,按key聚合
前面切割聚合都差不多,后面紧跟组内排序
第一种 将类型装换成toList ,相当于将数据存储在内存 . 然后使用scala的sortBy排序,take取前三,得到前三 元组(科目,老师)
适合于科目数据量小的,如果像微博热点下评论,数据量太大就会溢出 ,导致数据丢失
package cn.spark.teacherText
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object TeacherTopN {
def main(args: Array[String]): Unit = {
val isLocal : Boolean = args(0).toBoolean
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
if (isLocal){
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)
val lines = sc.textFile(args(1))
val wordAndCount: RDD[((String, String), Int)] = lines.map(line => {
val arr: Array[String] = line.split("/+") //按多个/切分
//按.切割取index0
val subject = arr(1).split("\\.")(0)
//map((科目,老师),1)
((subject, arr(2)), 1)
}).reduceByKey(_ + _)//分组聚合
//按科目分组
val grouped: RDD[(String, Iterable[((String, String), Int)])] = wordAndCount.groupBy(_._1._1)
//分组在迭代器排序
val res: RDD[(String, List[(String, Int)])] = grouped.mapValues(it => {
it.toList.sortBy(-_._2).take(3).map(t => (t._1._2, t._2))
})
println(res.collect().toBuffer)
sc.stop()
}
}
第二种 遍历科目 按科目过滤 在每个分区 使用RDD方法进行排序 提交到Driver ,但是会触发多个action
数据量太大也是不适合,Driver端可能溢出
package cn.spark.teacherText
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object TeacherTopN1 {
def main(args: Array[String]): Unit = {
val isLocal : Boolean = args(0).toBoolean
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
if (isLocal){
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)
val lines = sc.textFile(args(1))
//将科目写到集合
val subjects = Array("bigdata","php","javaee")
val wordAndCount: RDD[((String, String), Int)] = lines.map(line => {
val arr: Array[String] = line.split("/+") //按多个/切分
//按.切割取index0
val subject = arr(1).split("\\.")(0)
//map((科目,老师),1)
((subject, arr(2)), 1)
}).reduceByKey(_ + _)//分组聚合
//按科目
for (elem <- subjects) {
//按科目过滤
val sub: RDD[((String, String), Int)] = wordAndCount.filter(_._1._1.equals(elem))
//使用RDD排序取前3
val res: Array[((String, String), Int)] = sub.sortBy(_._2).takeOrdered(3)
println(res.toBuffer)
sc.stop()
}
}
}
第三种 先触发一次action 将数据提交到Driver端,确认科目数量,指定自定义分区 ,一个分区对应一个科目,直接调用mapPartitions在分区内排序一个分区一个科目,调用RDD方法就能实现排序,很优雅
package cn.spark.teacherText
import org.apache.spark.rdd.RDD
import org.apache.spark.{Partitioner, SparkConf, SparkContext}
import scala.collection.mutable
object TeacherTopN2 {
def main(args: Array[String]): Unit = {
val isLocal : Boolean = args(0).toBoolean
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
if (isLocal){
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)
val lines = sc.textFile(args(1))
val wordAndCount: RDD[((String, String), Int)] = lines.map(line => {
val arr: Array[String] = line.split("/+") //按多个/切分
//按.切割取index0
val subject = arr(1).split("\\.")(0)
//map((科目,老师),1)
((subject, arr(2)), 1)
}).reduceByKey(_ + _) //分组聚合
//去重提交到driver端确认科目的数量
val subjects: Array[String] = wordAndCount.map(_._1._1).distinct().collect()
//按照指定分区规则分区 , 一个科目对应一个分区
val partitioned: RDD[((String, String), Int)] = wordAndCount.partitionBy(new SubjectsPartitions(subjects))
//在分区内map
val res = partitioned.mapPartitions(it => {
//在分区内排序
it.toList.sortBy(_._2).take(2).iterator
})
//提交
res.saveAsTextFile(args(2))
//关闭资源
sc.stop()
}
}
class SubjectsPartitions(val subjects : Array[String]) extends Partitioner{
//创建map将科目,分区号放进去
val rules = new mutable.HashMap[String, Int]()
var i = 0
for (sub <- subjects) {
rules(sub)=i
i+=1
}
//分区个数
override def numPartitions: Int = subjects.length
override def getPartition(key: Any): Int = {
//获取科目
val subject = key.asInstanceOf[(String, String)]._1
//返回分区号
rules(subject)
}
}
第四种 在上一种的基础上优化一下,减轻资源压力,我们将多个shuffle 优化成较少shuffle 将reduceByKey 和partitionBy的shuffle合并成一个shuffle
package cn.spark.teacherText
import org.apache.spark.rdd.RDD
import org.apache.spark.{Partitioner, SparkConf, SparkContext}
import scala.collection.mutable
object TeacherTopN3 {
def main(args: Array[String]): Unit = {
val isLocal : Boolean = args(0).toBoolean
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
if (isLocal){
conf.setMaster("local[*]")
}
val sc = new SparkContext(conf)
val lines = sc.textFile(args(1))
val wordAndOne: RDD[((String, String), Int)] = lines.map(line => {
val arr: Array[String] = line.split("/+") //按多个/切分
//按.切割取index0
val subject = arr(1).split("\\.")(0)
//map((科目,老师),1)
((subject, arr(2)), 1)
})
//去重提交到driver端确认科目的数量
val subjects: Array[String] = wordAndOne.map(_._1._1).distinct().collect()
//按照指定分区规则分区 , 一个科目对应一个分区
val reduced: RDD[((String, String), Int)] = wordAndOne.reduceByKey(new SubjectsPartitions(subjects),_+_)
//在分区内map
val res = reduced.mapPartitions(it => {
//在分区内排序
it.toList.sortBy(_._2).take(2).iterator
})
//提交
res.saveAsTextFile(args(2))
//关闭资源
sc.stop()
}
}
class SubjectsPartitions(val subjects : Array[String]) extends Partitioner{
//创建map将科目,分区号放进去
val rules = new mutable.HashMap[String, Int]()
var i = 0
for (sub <- subjects) {
rules(sub)=i
i+=1
}
//分区个数
override def numPartitions: Int = subjects.length
override def getPartition(key: Any): Int = {
//获取科目
val subject = key.asInstanceOf[(String, String)]._1
//返回分区号
rules(subject)
}
}
排序还可以使用treeSet,自动排序,去重 或者 有序优先队列,方法很多根据公司需求进行调节