Spark的笔记03

最新推荐文章于 2020-11-30 15:39:25 发布

白菜banger

最新推荐文章于 2020-11-30 15:39:25 发布

阅读量209

点赞数

CC 4.0 BY-SA版权

分类专栏： spark

本文链接：https://blog.youkuaiyun.com/weixin_40105543/article/details/97163218

spark 专栏收录该内容

6 篇文章

订阅专栏

一.Spark的资源调度和任务调度

资源调度

1.集群启动，Worker向Master汇报资源，Master掌握集群的资源信息
2…客户端提交Spark任务，创建两个对象 DAGScheduler 和TaskScheduler ,TaskScheduler 向Master申请资源
3.Master找到到满足资源的Worker，启动Executor
4.Executor启动之后，反向注册给Driver端，Driver掌握了一批计算资源

任务调度

5.遇到一个Action算子时，当前application中就有一个job,job中RDD依赖关系形成DAG有向无环图
6.DAGScheduler将DAG按照RDD宽窄依赖关系，切割job，划分stage,将Stage以TaskSet的形式提交给TaskScheduler
7.TaskScheduler遍历TaskSet,获取一个个的Task，将Task发送到Executor中的线程池中执行
8.TaskSchedulerj监控task执行，回收结果

问题

1.TaskScheduler可以重试执行失败的task，默认重试3次，如果3次之后，依然失败，由DAGScheduler负责重试task 所在的Stage,重试4次之后，如果没有执行成功，stage所在的job就执行失败，job失败，application就失败了

2.TaskScheduler可以重试执行缓慢的task，这就是Spark中的推测执行机制，默认是关闭的，对于ETL的数据，不要开启

二.粗粒度资源申请和细粒度资源申请

粗粒度资源申请(Spark)

Application执行之前，先将所有的资源申请完毕，如果没有资源就等待资源。如果资源够，申请完毕之后才会执行application,application中的每个job就不需要单独自己申请资源，job执行快，application执行快。当最后一个job执行完成之后，这批资源才会被释放

优点：job执行快，application执行快

缺点：集群资源不能充分利用

细粒度资源申请(MR)

Application执行之前，每个job自己申请资源，自己释放资源，这样，每个job执行就慢，整体application执行就慢
优点：集群资源可以充分利用
缺点：application执行慢

三.transformation & Action 算子
transformation:
mapPartitionWithIndex,repartition,coalesce,groupByKey,zip,zipWithIndex
Action:
countByKey,countByValue,reduce

object Day03 {
  def main(args: Array[String]): Unit = {

    val conf  = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("test")
    val sc = new SparkContext(conf)

    /**
      * mapPartitionsWithIndex 平均分算子 把对应的数据放入到对应的分区里去
      * index 当前的分区号
      * iter分区号内有多少条数据
      */

//    val rdd1: RDD[String] = sc.parallelize(List[String](
//      "love1", "love2", "love3", "love4",
//      "love5", "love6", "love7", "love8",
//      "love9", "love10", "love11", "love12"
//    ), 3)
//
//    val rdd2: RDD[String] = rdd1.mapPartitionsWithIndex((index, iter) => {
//
//      val list = new ListBuffer[String]()
//      while (iter.hasNext) {
//        val one: String = iter.next()
//        list.+=(s"rdd1 pratition = [$index],value = [$one]")
//      }
//      list.iterator
//
//    })
//    rdd2.foreach(println)

    /**
      *repartition(常用增多分区)
      * 可以增多分区 ,也可以减少分区,会产生Shuffer
      *
      * coalesce(常用减少分区)
      * 可以增多分区,也可以较少分区
      * 当coalsece 由少的分区分到多的分区时,不让产生shuffer,不起作用(想要起作用就得在后面加上
      * true,但是加上true的话,直接可以用repartition替代了)
      *
      */

//    val rdd3 = rdd2.repartition(4)
//
//    val rdd4: RDD[String] = rdd3.mapPartitionsWithIndex((index, iter) => {
//
//      val list = new ListBuffer[String]()
//      while (iter.hasNext) {
//        val one: String = iter.next()
//        list.+=(s"rdd1 pratition = [$index],value = [$one]")
//      }
//      list.iterator
//
//    })
//
//    val result: Array[String] = rdd4.collect()
//      result.foreach(println)

    /**
      * zip,zipWithIndex
      * 压缩格式,在分区内的个数必须要一致,否则会报错
      *
      * zipWithIndex
      * 跟自己的分区的下表位压缩在一起
      */

//    val rdd1 =  val rdd1 = sc.parallelize(List[String]("张三", "李四", "王五", "赵六"))
//    var rdd2 = sc.parallelize(List[String]("100","200","300","400"))
//    var rdd2 = sc.parallelize(List[String]("100","200","300","400"))
      //zip
//    val result: RDD[(String, String)] = rdd1.zip(rdd2)
//    result.foreach(println)

    //ZipWithIndex
//    val result: RDD[(String, Long)] = rdd1.zipWithIndex()
//    result.foreach(println)

    /**
      * groupByKey
      *
      */

//    val rdd1: RDD[(String, Int)] = sc.parallelize(List[(String,Int)](("a",100),("a",200),("b",100),("b",200),("c",200)))
//    rdd1.groupByKey().foreach(println)


  }

  /**
    * countByKey 相同的Key相加
    * countByValue 相同的 一个tuple 算作一个 进行相加
    * reduce
    */
}

package com.zmd.testJava.day03;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;

import java.util.Arrays;
import java.util.List;

public class zipTestJava {

    public static void main(String[] args) {

        SparkConf conf =  new SparkConf();
        conf.setMaster("local");
        conf.setAppName("test");
        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> rdd1 = sc.parallelize(Arrays.asList(
                "love1", "love2", "love3", "love4",
                "love1", "love2", "love7", "love8",
                "love1", "love2", "love3", "love8"
        ));

        JavaRDD<String> rdd2 = sc.parallelize(Arrays.asList(
                "love1", "love2", "love3", "love4",
                "love5", "love6", "love7", "love8",
                "love9", "love10", "love11", "love12"
        ));

        /**
         * zip
         *
         */
//        JavaPairRDD<String, String> rdd3 = rdd1.zip(rdd2);
//        List<Tuple2<String, String>> result = rdd3.collect();
//        for ( Tuple2<String,String> s: result){
//            System.out.println(s);
//        }

        /**
         * zipWithIndex 跟自己的下标位压在一起
         *
         */
//        JavaPairRDD<String, Long> rdd4 = rdd1.zipWithIndex();
//        List<Tuple2<String, Long>> result = rdd4.collect();
//        for (Tuple2<String,Long> s : result){
//            System.out.println(s);
//        }

        /**
         *
         * groupByKey:在Key相同的情况下,不相同的下表位放在一起
         *
         */

        JavaPairRDD<String, Long> rdd5 = rdd1.zipWithIndex();
        rdd5.groupByKey().foreach(new VoidFunction<Tuple2<String, Iterable<Long>>>() {
            @Override
            public void call(Tuple2<String, Iterable<Long>> stringIterableTuple2) throws Exception {
                System.out.println(stringIterableTuple2);
            }
        });


    }
}

4.掌握PV,UV 和代码实现

package com.zmd.testSpark

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable
import scala.collection.mutable.ListBuffer

object SparkPvUv {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local")
    conf.setAppName("test")
    val sc = new SparkContext(conf)
    val lines: RDD[String] = sc.textFile("./data/pvnvdata")

    //每个网址的每个地区访问量,由大到小排序
      val site_local: RDD[(String, String)] = lines.map(line=>{(line.split("\t")(5),line.split("\t")(1))})
      val ste_localIterable: RDD[(String, Iterable[String])] = site_local.groupByKey()
    val result: RDD[(String, List[(String, Int)])] = ste_localIterable.map(one => {
      val localMap = mutable.Map[String, Int]()
      val site = one._1
      val localIter = one._2.iterator
      while (localIter.hasNext) {
        val local: String = localIter.next()
        if (localMap.contains(local)) {
          localMap.put(local, localMap.get(local).get + 1)
        } else {
          localMap.put(local, 1)
        }
      }
      val tuples: List[(String, Int)] = localMap.toList.sortBy(one => {
        -one._2
      })

      //判断

      if(tuples.size>3){
        val retrunlist = new ListBuffer[(String,Int)]()
        for (i <- 2 to 2){
          retrunlist.append(tuples(i))
        }
        (site,tuples)
      }else {
        (site, tuples)
      }
    })
    result.foreach(println)


    //PV->>>>
//    lines.map(line=>{(line.split("\t")(5),1)}).reduceByKey((v1:Int,v2:Int)=>{
//      v1+v2
//    }).sortBy(tp=>{tp._2},false).foreach(println)
//
//    lines.map(line=>{line.split("\t")(0)+"_"+line.split("\t")(5)})
//        .distinct()
//        .map(one=>{(one.split("_")(1),1)})
//        .reduceByKey(_+_)
//        .sortBy(_._2,false)
//        .foreach(println)



  }

}

5.Spark-submit 提交参数

Spark任务提交参数 

 --master   
 --name  
 --deploy-mode   
 --conf xx=xxx  
 --jars  ./spark-submit --master spark://mynode1:7077 --jars xxx ,xxx ,xxx --class .xxxx  /xxx.jar 参数  
 --driver-class-path  
 --files  
 --driver-cores  
 --driver-memory  
 --executor-cores  
 --executor-memory  
 --total-executor-cores  -- standalone  
 --num-executor --yarn  
 --supervise

6.Spark源码-Master启动

Spark源码