Flink笔记

reducescal第一天-Flink—流式计算框架

课程安排:

Flink的介绍(特点,整合),FLink的环境安装(standAlone,yarn),Flink dataSet(批处理)

flink的介绍

特点

  • 高吞吐,低延迟
  • 窗口行数:事件时间(重点)
  • Exactly-once一致性语义(理解)
  • 容错机制(checkpoint,重点)
  • 自己实现内存管理
  • 水位线(waterMark:网络乱序,网络延迟)
  • 状态管理

flink核心计算模块:runtime
角色:

  • Jobmanager :主节点,监控从节点
  • taskManager: 从节点,负责具体的任务的执行
    类库支持:
  • 图计算
  • CEP
  • flink sql
  • 机器学习
  • dataStream (重点)
  • dataSet
    整合支持:
  • 支持Flink on YARN
  • 支持HDFS
  • 支持来自Kafka的输入数据 (第三天)
  • 支持Apache HBase(项目)

两种模式

yarn-session模式
  • 一次性申请好内存
  • 适用于大量的小作业
  • 申请的资源不会主动关闭,需要手动关闭
    命令:
bin/yarn-session.sh -n 2 -tm 800 -jm 800 -s 1 -d

-n : 表示taskmanager容器的数量

-s : 表示slot的数量

-tm: 表示taskmanager容器的内存

-jm: 表示jobmanager容器的内存

-d: 表示分离模式

总共三个容器,tm有2个,jm有1个

提交任务:

 bin/flink run examples/batch/WordCount.jar 

查看yarn-session资源列表

yarn application -list

删除指定yarn-session

yarn application -kill application_1571196306040_0002

查看其他命令:

yarn-session -help
yarn-cluster模式
  • 会自动关闭session资源,使用时申请,完成时完毕
  • 使用大作业,使用于批量和离线
  • 直接提交任务给yarn集群

提交命令:

 bin/flink run -m yarn-cluster -yn 2 -ys 2 -ytm 1024 -yjm 1024 /export/servers/flink-1.7.0/examples/batch/WordCount.jar 

-m : 指定模式

-yn : 容器的数量

-ys: slot的数量

-ytm: tm的内存

-yjm: jm的内存

/export/servers/flink-1.7.0/examples/batch/WordCount.jar : 执行jar包

查看help

 bin/flink run -m yarn-cluster -help

扩展: flume自定义source

https://www.cnblogs.com/nstart/p/7699904.html

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-77a1IdGX-1577071557946)(photo/1571198075931.png)]

Flink 应用开发

1.wordCount开发

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object WordCount {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取批处理执行环境
      * 2.加载数据源
      * 3.数据转换:切分,分组,聚合
      * 4.数据打印
      * 5.触发执行
      */
    //1.获取批处理执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.加载数据源
    val source: DataSet[String] = env.fromElements("ss dd dd ss ff")

    source.flatMap(line=>line.split("\\W+"))//正则表达式W+ ,多个空格
      .map(line=>(line,1))  //(ss,1)
      //分组
      .groupBy(0)
      .sum(1)  //求和
      .print()//数据打印,在批处理中,print是一个触发算子

    //env.execute()  //表示触发执行
  }
}

2.打包部署

第一种:采用maven打包

第二种:idea打包

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fybmK7v8-1577071557949)(photo/1571208964476.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PSauyXLa-1577071557950)(photo/1571209075510.png)]

任务执行:

bin/flink run -m yarn-cluster -yn 1 -ys 1 -ytm 1024 -yjm 1024 /export/servers/tmp/flink-1016.jar 

不将jar包打入工程:

jar变小

升级维护方便

算子

1.map

将一个元素转换成另一个元素

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3unS2gpd-1577071557952)(photo/1571209951856.png)]

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object MapDemo {

  def main(args: Array[String]): Unit = {

    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 使用 fromCollection 构建数据源
      * 3. 创建一个 User 样例类
      * 4. 使用 map 操作执行转换
      * 5. 打印测试
      */
    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    //2. 使用 fromCollection 构建数据源
    val source: DataSet[String] = env.fromCollection(List("1,张三", "2,李四", "3,王五", "4,赵六"))

    //3.数据转换
    source.map(line=>{
      val arr: Array[String] = line.split(",")
      User(arr(0).toInt,arr(1))
    }).print()
  }
}

case class User(id:Int,userName:String)

2.flatmap

将一个元素转换成0/1/n个元素

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object FlatMap {

  def main(args: Array[String]): Unit = {
    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 使用 fromElements构建数据源
      * 3. 使用 flatMap 执行转换
      * 4. 使用groupBy进行分组
      * 5. 使用sum求值
      * 6. 打印测试
      */
    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 使用 fromElements构建数据源
    val source: DataSet[List[(String, Int)]] = env.fromElements(List(("java", 1), ("scala", 1), ("java", 1)) )
    source.flatMap(line=>line)
      .groupBy(0)  //对第一个元素进行分组
      .sum(1)  //对第二个元素求和
      .print()  //打印和触发执行
  }
}

3.mapPartition

分区转换算子,将一个分区中的元素转换为另一个元素

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object MapPartition {

  def main(args: Array[String]): Unit = {

    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 使用 fromElements构建数据源
      * 3. 创建一个 Demo 样例类
      * 4. 使用 mapPartition 操作执行转换
      * 5. 打印测试
      */
    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 使用 fromElements构建数据源
    val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1) )
    //3.数据转换
    source.mapPartition(line=>{
      line.map(y=>(y._1,y._2))
    }).print()
  }
}

4.filter

过滤boolean值为true的元素

package cn.itcast

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object Filter {


  def main(args: Array[String]): Unit = {

    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 使用 fromElements构建数据源
    val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1) )
    //3.数据过滤
    source.filter(line=>line._1.contains("java"))
      .print()

  }
}

5.reduce

增量聚合函数,将数据集最终聚合成一个元素

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object Reduce {

  def main(args: Array[String]): Unit = {

    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 使用 fromElements 构建数据源
      * 3. 使用 map和group执行转换操作
      * 4.使用reduce进行聚合操作
      * 5.打印测试
      */
    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 使用 fromElements 构建数据源
    val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1)   )
    //3.数据转换
    source.groupBy(0)
      //4.使用reduce进行聚合操作
      .reduce((x,y)=>(x._1,x._2+y._2))
      //5.打印测试
      .print()
  }

}

6.reduce和reduceGroup

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Lzf15Vrd-1577071557955)(photo/1571213462938.png)]

package cn.itcast

import java.lang

import akka.stream.impl.fusing.Collect
import org.apache.flink.api.common.functions.{GroupCombineFunction, GroupReduceFunction}
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

/**
  * @Date 2019/10/16
  */
object Reduce {

  def main(args: Array[String]): Unit = {

    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 使用 fromElements 构建数据源
      * 3. 使用 map和group执行转换操作
      * 4.使用reduce进行聚合操作
      * 5.打印测试
      */
    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 使用 fromElements 构建数据源
    val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1))
    //3.数据转换
    source.groupBy(0)
      //4.使用reduce进行聚合操作
      //.reduce((x,y)=>(x._1,x._2+y._2))
      //reduceGroup写法
//      .reduceGroup(line => {
//      line.reduce((x, y) => (x._1, x._2 + y._2))
//    })

//      .reduceGroup{
//      (in:Iterator[(String,Int)],out:Collector[(String,Int)])=>{
//        val tuple: (String, Int) = in.reduce((x,y)=>(x._1,x._2+y._2))
//        out.collect(tuple)
//      }
//    }

      //combine
      .combineGroup(new GroupCombineAndReduce)

      //5.打印测试
      .print()
  }

}

//导入包 java语法改成可以使用scala语法
import collection.JavaConverters._
class GroupCombineAndReduce extends GroupReduceFunction[(String,Int),(String,Int)]
  with GroupCombineFunction[(String,Int),(String,Int)] {

  //后执行
  override def reduce(values: lang.Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
    for(line<- values.asScala){
      out.collect(line)
    }
  }

  //先执行,能够预先合并数据
  override def combine(values: lang.Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {

    var key= ""
    var sum:Int =0
    for(line<- values.asScala){
      key =line._1
      sum = sum+ line._2
    }
    out.collect((key,sum))

  }
}

注意接收数据量不能太大

7.聚合函数

package cn.itcast

import org.apache.flink.api.java.aggregation.Aggregations
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._

import scala.collection.mutable

/**
  * @Date 2019/10/16
  */
object Aggregate {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.加载数据源
    val data = new mutable.MutableList[(Int, String, Double)]
    data.+=((1, "yuwen", 89.0))
    data.+=((2, "shuxue", 92.2))
    data.+=((3, "yingyu", 89.99))
    data.+=((4, "wuli", 98.9))
    data.+=((5, "yuwen", 88.88))
    data.+=((6, "wuli", 93.00))
    data.+=((7, "yuwen", 94.3))

    val source: DataSet[(Int, String, Double)] = env.fromCollection(data)

    //3.数据分组
    val groupData: GroupedDataSet[(Int, String, Double)] = source.groupBy(1)

    //4.数据聚合
    groupData
      //根据第三个元素取最小值
//      .minBy(2)
      //.maxBy(2)  //返回满足条件的一组元素
        //.min(2)
//      .max(2) //返回满足条件的最值
        .aggregate(Aggregations.MAX,2)
      .print()
  }
}

8.distinct

数据去重

package cn.itcast

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import scala.collection.mutable

/**
  * @Date 2019/10/16
  */
object DistinctDemo {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.加载数据源
    val data = new mutable.MutableList[(Int, String, Double)]
    data.+=((1, "yuwen", 89.0))
    data.+=((2, "shuxue", 92.2))
    data.+=((3, "yingyu", 89.99))
    data.+=((4, "wuli", 93.00))
    data.+=((5, "yuwen", 89.0))
    data.+=((6, "wuli", 93.00))

    val source: DataSet[(Int, String, Double)] = env.fromCollection(data)
    source.distinct(1) //去重
      .print()

  }

}

9.左连接/右连接/全连接

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object LeftAndRightAndFull {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.加载数据源
    val s1: DataSet[(Int, String)] = env.fromElements((1, "zhangsan") , (2, "lisi") ,(3 , "wangwu") ,(4 , "zhaoliu"))
    val s2: DataSet[(Int, String)] = env.fromElements((1, "beijing"), (2, "shanghai"), (4, "guangzhou"))

    //3.join关联
    //leftJoin
//    s1.leftOuterJoin(s2).where(0).equalTo(0){
//      (s1,s2)=>{
//        if(s2 == null){
//          (s1._1,s1._2,null)
//        }else{
//          (s1._1,s1._2,s2._2)
//        }
//      }
//    }

      //rightJoin
//    s1.rightOuterJoin(s2).where(0).equalTo(0) {
//      (s1, s2) => {
//        if (s1 == null) {
//          (s2._1, null, s2._2)
//        } else {
//          (s2._1, s1._2, s2._2)
//        }
//      }
//    }

      //fullJoin
      s1.fullOuterJoin(s2).where(0).equalTo(0){
        (s1,s2)=>{
          if (s1 == null) {
            (s2._1, null, s2._2)
          }else if(s2 == null){
            (s1._1,s1._2,null)
          } else {
            (s2._1, s1._2, s2._2)
          }
        }
      }
      .print()
  }
}

10.union

多数据流合并

object Union {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    val s1: DataSet[String] = env.fromElements("java")
    val s2: DataSet[String] = env.fromElements("scala")
    val s3: DataSet[String] = env.fromElements("java")

    //union数据合并
    s1.union(s2).union(s3).print()
  }

}

11.rebalance

package cn.itcast

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
/**
  * @Date 2019/10/16
  */
object Rebalance {

  def main(args: Array[String]): Unit = {

    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 生成序列数据源
      * 3. 使用filter过滤大于50的数字
      * 4. 执行rebalance操作
      * 5.使用map操作传入 RichMapFunction ,将当前子任务的ID和数字构建成一个元组
      * 6. 打印测试
      */

    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 生成序列数据源
    val source: DataSet[Long] = env.generateSequence(0,100)
    //3. 使用filter过滤大于50的数字
    val filterData: DataSet[Long] = source.filter(_>50)

    //4.避免数据倾斜
    val rebData: DataSet[Long] = filterData.rebalance()

    //5.数据转换
    rebData.map(new RichMapFunction[Long,(Int,Long)] {
      var subtask: Int = 0
      //open方法会在map方法之前执行
      override def open(parameters: Configuration): Unit = {
        //获取线程任务执行id
        //通过上下文对象获取
        subtask = getRuntimeContext.getIndexOfThisSubtask

      }

      override def map(value: Long): (Int, Long) = {
        (subtask,value)
      }
    })
    //数据打印,触发执行
      .print()
  }
}

12.partiton

分区算子

package cn.itcast

import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode

import scala.collection.mutable

/**
  * @Date 2019/10/16
  */
object PartitionDemo {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    //加载数据源
    val data = new mutable.MutableList[(Int, Long, String)]
    data.+=((1, 1L, "Hi"))
    data.+=((2, 2L, "Hello"))
    data.+=((3, 2L, "Hello world"))
    data.+=((4, 3L, "Hello world, how are you?"))
    data.+=((5, 3L, "I am fine."))
    data.+=((6, 3L, "Luke Skywalker"))
    data.+=((7, 4L, "Comment#1"))
    data.+=((8, 4L, "Comment#2"))
    data.+=((9, 4L, "Comment#3"))
    data.+=((10, 4L, "Comment#4"))
    data.+=((11, 5L, "Comment#5"))
    data.+=((12, 5L, "Comment#6"))
    data.+=((13, 5L, "Comment#7"))
    data.+=((14, 5L, "Comment#8"))
    data.+=((15, 5L, "Comment#9"))
    data.+=((16, 6L, "Comment#10"))
    data.+=((17, 6L, "Comment#11"))
    data.+=((18, 6L, "Comment#12"))
    data.+=((19, 6L, "Comment#13"))
    data.+=((20, 6L, "Comment#14"))
    data.+=((21, 6L, "Comment#15"))
    val source = env.fromCollection(data)

    //3.partitionByHash分区
//    val result: DataSet[(Int, Long, String)] = source.partitionByHash(0).setParallelism(2).mapPartition(line => {
//      line.map(line => (line._1, line._2, line._3))
//    })

    //partitionByRange
//    val result: DataSet[(Int, Long, String)] = source.partitionByRange(0).setParallelism(2).mapPartition(line => {
//      line.map(line => (line._1, line._2, line._3))
//    })

    //sortPartition
    val result: DataSet[(Int, Long, String)] = source.sortPartition(0,Order.DESCENDING).setParallelism(2).mapPartition(line => {
      line.map(line => (line._1, line._2, line._3))
    })

    //4.数据落地
    result.writeAsText("sort",WriteMode.OVERWRITE)

    //5.触发执行
    env.execute("partition")
  }
}

13.first

取前N条数据,下标从1开始

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import scala.collection.mutable

/**
  * @Date 2019/10/16
  */
object FirstDemo {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    //2.加载数据
    val data = new mutable.MutableList[(Int, Long, String)]
    data.+=((1, 1L, "Hi"))
    data.+=((2, 2L, "Hello"))
    data.+=((3, 2L, "Hello world"))
    data.+=((4, 3L, "Hello world, how are you?"))
    data.+=((5, 3L, "I am fine."))
    data.+=((6, 3L, "Luke Skywalker"))
    data.+=((7, 4L, "Comment#1"))
    data.+=((8, 4L, "Comment#2"))
    data.+=((9, 4L, "Comment#3"))
    data.+=((10, 4L, "Comment#4"))
    data.+=((11, 5L, "Comment#5"))
    data.+=((12, 5L, "Comment#6"))
    data.+=((13, 5L, "Comment#7"))
    data.+=((14, 5L, "Comment#8"))
    data.+=((15, 5L, "Comment#9"))
    data.+=((16, 6L, "Comment#10"))
    data.+=((17, 6L, "Comment#11"))
    data.+=((18, 6L, "Comment#12"))
    data.+=((19, 6L, "Comment#13"))
    data.+=((20, 6L, "Comment#14"))
    data.+=((21, 6L, "Comment#15"))
    val ds = env.fromCollection(data)
    //    ds.first(10).print()
    //还可以先goup分组,然后在使用first取值
    ds.first(2).print()

  }
}

source

  • 本地集合
  • 基于文件
    • hdfs
      • txt
      • csv
    • 本地文件
基于本地和hdfs
package cn.itcast

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode
/**
  * @Date 2019/10/16
  */
object TeadTxtDemo {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.读取本地磁盘文件
    //val source: DataSet[String] = env.readTextFile("C:\\Users\\zhb09\\Desktop\\tmp\\user.txt")
    //读取hdfs文件
    val source: DataSet[String] = env.readTextFile("hdfs://node01:8020/tmp/user.txt")

    //3.数据转换,单词统计
    val result: AggregateDataSet[(String, Int)] = source.flatMap(_.split(","))
      .map((_, 1))
      .groupBy(0)
      .sum(1)

    //4.数据写入hdfs,OVERWRITE:数据覆盖
    result.writeAsText("hdfs://node01:8020/tmp/user2.txt",WriteMode.OVERWRITE)

    env.execute()
  }
}

基于CSV
package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object ReadCsv {


  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //2.读取CSV文件
    val result: DataSet[(String, String, Int)] = env.readCsvFile[(String, String, Int)](
      "C:\\Users\\zhb09\\Desktop\\write\\test\\test.csv",
      lineDelimiter = "\n", //行分隔符
      fieldDelimiter = ",", //字段之间的分隔符
      ignoreFirstLine = true, //忽略首行
      lenient = false, //不忽略解析错误的行
      includedFields = Array(0, 1, 2) //读取列
    )
    result.first(5).print()
  }
}

第二天-Flink—流式计算框架

课程复习:

Flink的环境搭建
  • standAlone

    • 提交作业: flink run …/*.jar
    • HA模式
  • flink on yarn

    • yarn-session
      • yarn-session -n 2 -tm 1024 -jm 1024 -s 1 -d
      • flin run …/*.jar
      • 适用于大量的小作业
      • 一次性的申请好yarn资源
      • 需要手动关闭yarn-session(yarn application -kill id)
    • yarn-cluster
      • flink run -m yarn-cluster -yn 1 -yjm 1024 -ytm 1024 -ys 1 …/*.jar
      • 使用与批量和离线作业
      • 适用于大作业
      • 一条任务,既申请资源,任务完成之后,会自动关闭资源
  • 打包方式:代码与依赖包分离,代码与配置文件分离

  • 算子

    • map/flatmap/reduce/reduceGroup/filter/union
    • source
      • 基于本地
        • csv
        • txt
      • hdfs
        • csv
        • txt

课程安排

1.批处理

  • 广播变量
  • 分布式缓存

2.流处理

  • 算子
    • keyBy
    • connect
    • split和select
  • source
    • hdfs
    • 本地
    • 自定义数据源(kafka)
    • mysql
  • sink
    • mysql
    • kafka
    • hbase
    • redis
  • window和time

DataSet

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TEdjTiIs-1577071557958)(photo/1571277131670.png)]

1.广播变量

package cn.itcast.dataset

import java.util

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

import scala.collection.mutable

/**
  * @Date 2019/10/17
  */
object BrocastDemo {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取批处理执行环境
      * 2.加器数据源
      * 3.数据转换
      *   (1)共享广播变量
      *   (2)获取广播变量
      *   (3)数据合并
      * 4.数据打印/触发执行
      *需求:从内存中拿到data2的广播数据,再与data1数据根据第二列元素组合成(Int, Long, String, String)
      */
    //1.获取批处理执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.加器数据源
    val data1 = new mutable.MutableList[(Int, Long, String)]
    data1 .+=((1, 1L, "xiaoming"))
    data1 .+=((2, 2L, "xiaoli"))
    data1 .+=((3, 2L, "xiaoqiang"))
    val ds1 = env.fromCollection(data1)

    val data2 = new mutable.MutableList[(Int, Long, Int, String, Long)]
    data2 .+=((1, 1L, 0, "Hallo", 1L))
    data2 .+=((2, 2L, 1, "Hallo Welt", 2L))
    data2 .+=((2, 3L, 2, "Hallo Welt wie", 1L))
    val ds2 = env.fromCollection(data2)

    //3.数据转换
    ds1.map(new RichMapFunction[(Int,Long,String),(Int, Long, String, String)] {

      var ds: util.List[(Int, Long, Int, String, Long)] = null
      //open在map方法之前先执行
      override def open(parameters: Configuration): Unit = {
        //(2)获取广播变量
        ds = getRuntimeContext.getBroadcastVariable[(Int, Long, Int, String, Long)]("ds2")

      }

      //(3)数据合并
      import collection.JavaConverters._
      override def map(value: (Int, Long, String)): (Int, Long, String, String) = {
        var tuple: (Int, Long, String, String) = null
        for(line<- ds.asScala){
          if(line._2 == value._2){
            tuple = (value._1,value._2,value._3,line._4)
          }
        }
        tuple
      }
    }).withBroadcastSet(ds2,"ds2")  //(1)共享广播变量
    //4.数据打印/触发执行
      .print()
  }
}

2.分布式缓存

package cn.itcast.dataset

import java.io.File

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

import scala.collection.mutable.ArrayBuffer
import scala.io.Source
/**
  * @Date 2019/10/17
  */
object DistributeCache {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取执行环境
      * 2.加载数据源
      * 3.注册分布式缓存
      * 4.数据转换
      *   (1)获取缓存文件
      *   (2)解析文件
      *   (3)数据转换
      * 5.数据打印/以及触发执行
      */

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    //2.加载数据源
    val clazz:DataSet[Clazz] = env.fromElements(
      Clazz(1,"class_1"),
      Clazz(2,"class_1"),
      Clazz(3,"class_2"),
      Clazz(4,"class_2"),
      Clazz(5,"class_3"),
      Clazz(6,"class_3"),
      Clazz(7,"class_4"),
      Clazz(8,"class_1")
    )

    //3.注册分布式缓存
    val url = "hdfs://node01:8020/tmp/subject.txt"
    env.registerCachedFile(url,"cache")

    //4.数据转换
    clazz.map(new RichMapFunction[Clazz,Info] {

      val buffer = new ArrayBuffer[String]()
      override def open(parameters: Configuration): Unit = {
        //(1)获取缓存文件
        val file: File = getRuntimeContext.getDistributedCache.getFile("cache")
        //(2)解析文件
        val strs: Iterator[String] = Source.fromFile(file.getAbsoluteFile).g
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值