reducescal第一天-Flink—流式计算框架
课程安排:
Flink的介绍(特点,整合),FLink的环境安装(standAlone,yarn),Flink dataSet(批处理)
flink的介绍
特点
- 高吞吐,低延迟
- 窗口行数:事件时间(重点)
- Exactly-once一致性语义(理解)
- 容错机制(checkpoint,重点)
- 自己实现内存管理
- 水位线(waterMark:网络乱序,网络延迟)
- 状态管理
flink核心计算模块:runtime
角色:
- Jobmanager :主节点,监控从节点
- taskManager: 从节点,负责具体的任务的执行
类库支持: - 图计算
- CEP
- flink sql
- 机器学习
- dataStream (重点)
- dataSet
整合支持: - 支持Flink on YARN
- 支持HDFS
- 支持来自Kafka的输入数据 (第三天)
- 支持Apache HBase(项目)
两种模式
yarn-session模式
- 一次性申请好内存
- 适用于大量的小作业
- 申请的资源不会主动关闭,需要手动关闭
命令:
bin/yarn-session.sh -n 2 -tm 800 -jm 800 -s 1 -d
-n : 表示taskmanager容器的数量
-s : 表示slot的数量
-tm: 表示taskmanager容器的内存
-jm: 表示jobmanager容器的内存
-d: 表示分离模式
总共三个容器,tm有2个,jm有1个
提交任务:
bin/flink run examples/batch/WordCount.jar
查看yarn-session资源列表
yarn application -list
删除指定yarn-session
yarn application -kill application_1571196306040_0002
查看其他命令:
yarn-session -help
yarn-cluster模式
- 会自动关闭session资源,使用时申请,完成时完毕
- 使用大作业,使用于批量和离线
- 直接提交任务给yarn集群
提交命令:
bin/flink run -m yarn-cluster -yn 2 -ys 2 -ytm 1024 -yjm 1024 /export/servers/flink-1.7.0/examples/batch/WordCount.jar
-m : 指定模式
-yn : 容器的数量
-ys: slot的数量
-ytm: tm的内存
-yjm: jm的内存
/export/servers/flink-1.7.0/examples/batch/WordCount.jar : 执行jar包
查看help
bin/flink run -m yarn-cluster -help
扩展: flume自定义source
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-77a1IdGX-1577071557946)(photo/1571198075931.png)]
Flink 应用开发
1.wordCount开发
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object WordCount {
def main(args: Array[String]): Unit = {
/**
* 1.获取批处理执行环境
* 2.加载数据源
* 3.数据转换:切分,分组,聚合
* 4.数据打印
* 5.触发执行
*/
//1.获取批处理执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val source: DataSet[String] = env.fromElements("ss dd dd ss ff")
source.flatMap(line=>line.split("\\W+"))//正则表达式W+ ,多个空格
.map(line=>(line,1)) //(ss,1)
//分组
.groupBy(0)
.sum(1) //求和
.print()//数据打印,在批处理中,print是一个触发算子
//env.execute() //表示触发执行
}
}
2.打包部署
第一种:采用maven打包
第二种:idea打包
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fybmK7v8-1577071557949)(photo/1571208964476.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PSauyXLa-1577071557950)(photo/1571209075510.png)]
任务执行:
bin/flink run -m yarn-cluster -yn 1 -ys 1 -ytm 1024 -yjm 1024 /export/servers/tmp/flink-1016.jar
不将jar包打入工程:
jar变小
升级维护方便
算子
1.map
将一个元素转换成另一个元素
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3unS2gpd-1577071557952)(photo/1571209951856.png)]
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object MapDemo {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 使用 fromCollection 构建数据源
* 3. 创建一个 User 样例类
* 4. 使用 map 操作执行转换
* 5. 打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromCollection 构建数据源
val source: DataSet[String] = env.fromCollection(List("1,张三", "2,李四", "3,王五", "4,赵六"))
//3.数据转换
source.map(line=>{
val arr: Array[String] = line.split(",")
User(arr(0).toInt,arr(1))
}).print()
}
}
case class User(id:Int,userName:String)
2.flatmap
将一个元素转换成0/1/n个元素
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object FlatMap {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 使用 fromElements构建数据源
* 3. 使用 flatMap 执行转换
* 4. 使用groupBy进行分组
* 5. 使用sum求值
* 6. 打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromElements构建数据源
val source: DataSet[List[(String, Int)]] = env.fromElements(List(("java", 1), ("scala", 1), ("java", 1)) )
source.flatMap(line=>line)
.groupBy(0) //对第一个元素进行分组
.sum(1) //对第二个元素求和
.print() //打印和触发执行
}
}
3.mapPartition
分区转换算子,将一个分区中的元素转换为另一个元素
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object MapPartition {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 使用 fromElements构建数据源
* 3. 创建一个 Demo 样例类
* 4. 使用 mapPartition 操作执行转换
* 5. 打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromElements构建数据源
val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1) )
//3.数据转换
source.mapPartition(line=>{
line.map(y=>(y._1,y._2))
}).print()
}
}
4.filter
过滤boolean值为true的元素
package cn.itcast
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object Filter {
def main(args: Array[String]): Unit = {
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromElements构建数据源
val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1) )
//3.数据过滤
source.filter(line=>line._1.contains("java"))
.print()
}
}
5.reduce
增量聚合函数,将数据集最终聚合成一个元素
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object Reduce {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 使用 fromElements 构建数据源
* 3. 使用 map和group执行转换操作
* 4.使用reduce进行聚合操作
* 5.打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromElements 构建数据源
val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1) )
//3.数据转换
source.groupBy(0)
//4.使用reduce进行聚合操作
.reduce((x,y)=>(x._1,x._2+y._2))
//5.打印测试
.print()
}
}
6.reduce和reduceGroup
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Lzf15Vrd-1577071557955)(photo/1571213462938.png)]
package cn.itcast
import java.lang
import akka.stream.impl.fusing.Collect
import org.apache.flink.api.common.functions.{GroupCombineFunction, GroupReduceFunction}
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector
/**
* @Date 2019/10/16
*/
object Reduce {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 使用 fromElements 构建数据源
* 3. 使用 map和group执行转换操作
* 4.使用reduce进行聚合操作
* 5.打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 使用 fromElements 构建数据源
val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1))
//3.数据转换
source.groupBy(0)
//4.使用reduce进行聚合操作
//.reduce((x,y)=>(x._1,x._2+y._2))
//reduceGroup写法
// .reduceGroup(line => {
// line.reduce((x, y) => (x._1, x._2 + y._2))
// })
// .reduceGroup{
// (in:Iterator[(String,Int)],out:Collector[(String,Int)])=>{
// val tuple: (String, Int) = in.reduce((x,y)=>(x._1,x._2+y._2))
// out.collect(tuple)
// }
// }
//combine
.combineGroup(new GroupCombineAndReduce)
//5.打印测试
.print()
}
}
//导入包 java语法改成可以使用scala语法
import collection.JavaConverters._
class GroupCombineAndReduce extends GroupReduceFunction[(String,Int),(String,Int)]
with GroupCombineFunction[(String,Int),(String,Int)] {
//后执行
override def reduce(values: lang.Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
for(line<- values.asScala){
out.collect(line)
}
}
//先执行,能够预先合并数据
override def combine(values: lang.Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
var key= ""
var sum:Int =0
for(line<- values.asScala){
key =line._1
sum = sum+ line._2
}
out.collect((key,sum))
}
}
注意接收数据量不能太大
7.聚合函数
package cn.itcast
import org.apache.flink.api.java.aggregation.Aggregations
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import scala.collection.mutable
/**
* @Date 2019/10/16
*/
object Aggregate {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val data = new mutable.MutableList[(Int, String, Double)]
data.+=((1, "yuwen", 89.0))
data.+=((2, "shuxue", 92.2))
data.+=((3, "yingyu", 89.99))
data.+=((4, "wuli", 98.9))
data.+=((5, "yuwen", 88.88))
data.+=((6, "wuli", 93.00))
data.+=((7, "yuwen", 94.3))
val source: DataSet[(Int, String, Double)] = env.fromCollection(data)
//3.数据分组
val groupData: GroupedDataSet[(Int, String, Double)] = source.groupBy(1)
//4.数据聚合
groupData
//根据第三个元素取最小值
// .minBy(2)
//.maxBy(2) //返回满足条件的一组元素
//.min(2)
// .max(2) //返回满足条件的最值
.aggregate(Aggregations.MAX,2)
.print()
}
}
8.distinct
数据去重
package cn.itcast
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import scala.collection.mutable
/**
* @Date 2019/10/16
*/
object DistinctDemo {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val data = new mutable.MutableList[(Int, String, Double)]
data.+=((1, "yuwen", 89.0))
data.+=((2, "shuxue", 92.2))
data.+=((3, "yingyu", 89.99))
data.+=((4, "wuli", 93.00))
data.+=((5, "yuwen", 89.0))
data.+=((6, "wuli", 93.00))
val source: DataSet[(Int, String, Double)] = env.fromCollection(data)
source.distinct(1) //去重
.print()
}
}
9.左连接/右连接/全连接
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object LeftAndRightAndFull {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val s1: DataSet[(Int, String)] = env.fromElements((1, "zhangsan") , (2, "lisi") ,(3 , "wangwu") ,(4 , "zhaoliu"))
val s2: DataSet[(Int, String)] = env.fromElements((1, "beijing"), (2, "shanghai"), (4, "guangzhou"))
//3.join关联
//leftJoin
// s1.leftOuterJoin(s2).where(0).equalTo(0){
// (s1,s2)=>{
// if(s2 == null){
// (s1._1,s1._2,null)
// }else{
// (s1._1,s1._2,s2._2)
// }
// }
// }
//rightJoin
// s1.rightOuterJoin(s2).where(0).equalTo(0) {
// (s1, s2) => {
// if (s1 == null) {
// (s2._1, null, s2._2)
// } else {
// (s2._1, s1._2, s2._2)
// }
// }
// }
//fullJoin
s1.fullOuterJoin(s2).where(0).equalTo(0){
(s1,s2)=>{
if (s1 == null) {
(s2._1, null, s2._2)
}else if(s2 == null){
(s1._1,s1._2,null)
} else {
(s2._1, s1._2, s2._2)
}
}
}
.print()
}
}
10.union
多数据流合并
object Union {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val s1: DataSet[String] = env.fromElements("java")
val s2: DataSet[String] = env.fromElements("scala")
val s3: DataSet[String] = env.fromElements("java")
//union数据合并
s1.union(s2).union(s3).print()
}
}
11.rebalance
package cn.itcast
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
/**
* @Date 2019/10/16
*/
object Rebalance {
def main(args: Array[String]): Unit = {
/**
* 1. 获取 ExecutionEnvironment 运行环境
* 2. 生成序列数据源
* 3. 使用filter过滤大于50的数字
* 4. 执行rebalance操作
* 5.使用map操作传入 RichMapFunction ,将当前子任务的ID和数字构建成一个元组
* 6. 打印测试
*/
//1. 获取 ExecutionEnvironment 运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2. 生成序列数据源
val source: DataSet[Long] = env.generateSequence(0,100)
//3. 使用filter过滤大于50的数字
val filterData: DataSet[Long] = source.filter(_>50)
//4.避免数据倾斜
val rebData: DataSet[Long] = filterData.rebalance()
//5.数据转换
rebData.map(new RichMapFunction[Long,(Int,Long)] {
var subtask: Int = 0
//open方法会在map方法之前执行
override def open(parameters: Configuration): Unit = {
//获取线程任务执行id
//通过上下文对象获取
subtask = getRuntimeContext.getIndexOfThisSubtask
}
override def map(value: Long): (Int, Long) = {
(subtask,value)
}
})
//数据打印,触发执行
.print()
}
}
12.partiton
分区算子
package cn.itcast
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode
import scala.collection.mutable
/**
* @Date 2019/10/16
*/
object PartitionDemo {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//加载数据源
val data = new mutable.MutableList[(Int, Long, String)]
data.+=((1, 1L, "Hi"))
data.+=((2, 2L, "Hello"))
data.+=((3, 2L, "Hello world"))
data.+=((4, 3L, "Hello world, how are you?"))
data.+=((5, 3L, "I am fine."))
data.+=((6, 3L, "Luke Skywalker"))
data.+=((7, 4L, "Comment#1"))
data.+=((8, 4L, "Comment#2"))
data.+=((9, 4L, "Comment#3"))
data.+=((10, 4L, "Comment#4"))
data.+=((11, 5L, "Comment#5"))
data.+=((12, 5L, "Comment#6"))
data.+=((13, 5L, "Comment#7"))
data.+=((14, 5L, "Comment#8"))
data.+=((15, 5L, "Comment#9"))
data.+=((16, 6L, "Comment#10"))
data.+=((17, 6L, "Comment#11"))
data.+=((18, 6L, "Comment#12"))
data.+=((19, 6L, "Comment#13"))
data.+=((20, 6L, "Comment#14"))
data.+=((21, 6L, "Comment#15"))
val source = env.fromCollection(data)
//3.partitionByHash分区
// val result: DataSet[(Int, Long, String)] = source.partitionByHash(0).setParallelism(2).mapPartition(line => {
// line.map(line => (line._1, line._2, line._3))
// })
//partitionByRange
// val result: DataSet[(Int, Long, String)] = source.partitionByRange(0).setParallelism(2).mapPartition(line => {
// line.map(line => (line._1, line._2, line._3))
// })
//sortPartition
val result: DataSet[(Int, Long, String)] = source.sortPartition(0,Order.DESCENDING).setParallelism(2).mapPartition(line => {
line.map(line => (line._1, line._2, line._3))
})
//4.数据落地
result.writeAsText("sort",WriteMode.OVERWRITE)
//5.触发执行
env.execute("partition")
}
}
13.first
取前N条数据,下标从1开始
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import scala.collection.mutable
/**
* @Date 2019/10/16
*/
object FirstDemo {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.加载数据
val data = new mutable.MutableList[(Int, Long, String)]
data.+=((1, 1L, "Hi"))
data.+=((2, 2L, "Hello"))
data.+=((3, 2L, "Hello world"))
data.+=((4, 3L, "Hello world, how are you?"))
data.+=((5, 3L, "I am fine."))
data.+=((6, 3L, "Luke Skywalker"))
data.+=((7, 4L, "Comment#1"))
data.+=((8, 4L, "Comment#2"))
data.+=((9, 4L, "Comment#3"))
data.+=((10, 4L, "Comment#4"))
data.+=((11, 5L, "Comment#5"))
data.+=((12, 5L, "Comment#6"))
data.+=((13, 5L, "Comment#7"))
data.+=((14, 5L, "Comment#8"))
data.+=((15, 5L, "Comment#9"))
data.+=((16, 6L, "Comment#10"))
data.+=((17, 6L, "Comment#11"))
data.+=((18, 6L, "Comment#12"))
data.+=((19, 6L, "Comment#13"))
data.+=((20, 6L, "Comment#14"))
data.+=((21, 6L, "Comment#15"))
val ds = env.fromCollection(data)
// ds.first(10).print()
//还可以先goup分组,然后在使用first取值
ds.first(2).print()
}
}
source
- 本地集合
- 基于文件
- hdfs
- txt
- csv
- 本地文件
- hdfs
基于本地和hdfs
package cn.itcast
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode
/**
* @Date 2019/10/16
*/
object TeadTxtDemo {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.读取本地磁盘文件
//val source: DataSet[String] = env.readTextFile("C:\\Users\\zhb09\\Desktop\\tmp\\user.txt")
//读取hdfs文件
val source: DataSet[String] = env.readTextFile("hdfs://node01:8020/tmp/user.txt")
//3.数据转换,单词统计
val result: AggregateDataSet[(String, Int)] = source.flatMap(_.split(","))
.map((_, 1))
.groupBy(0)
.sum(1)
//4.数据写入hdfs,OVERWRITE:数据覆盖
result.writeAsText("hdfs://node01:8020/tmp/user2.txt",WriteMode.OVERWRITE)
env.execute()
}
}
基于CSV
package cn.itcast
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
* @Date 2019/10/16
*/
object ReadCsv {
def main(args: Array[String]): Unit = {
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.读取CSV文件
val result: DataSet[(String, String, Int)] = env.readCsvFile[(String, String, Int)](
"C:\\Users\\zhb09\\Desktop\\write\\test\\test.csv",
lineDelimiter = "\n", //行分隔符
fieldDelimiter = ",", //字段之间的分隔符
ignoreFirstLine = true, //忽略首行
lenient = false, //不忽略解析错误的行
includedFields = Array(0, 1, 2) //读取列
)
result.first(5).print()
}
}
第二天-Flink—流式计算框架
课程复习:
Flink的环境搭建
-
standAlone
- 提交作业: flink run …/*.jar
- HA模式
-
flink on yarn
- yarn-session
- yarn-session -n 2 -tm 1024 -jm 1024 -s 1 -d
- flin run …/*.jar
- 适用于大量的小作业
- 一次性的申请好yarn资源
- 需要手动关闭yarn-session(yarn application -kill id)
- yarn-cluster
- flink run -m yarn-cluster -yn 1 -yjm 1024 -ytm 1024 -ys 1 …/*.jar
- 使用与批量和离线作业
- 适用于大作业
- 一条任务,既申请资源,任务完成之后,会自动关闭资源
- yarn-session
-
打包方式:代码与依赖包分离,代码与配置文件分离
-
算子
- map/flatmap/reduce/reduceGroup/filter/union
- source
- 基于本地
- csv
- txt
- hdfs
- csv
- txt
- 基于本地
课程安排
1.批处理
- 广播变量
- 分布式缓存
2.流处理
- 算子
- keyBy
- connect
- split和select
- source
- hdfs
- 本地
- 自定义数据源(kafka)
- mysql
- sink
- mysql
- kafka
- hbase
- redis
- window和time
DataSet
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TEdjTiIs-1577071557958)(photo/1571277131670.png)]
1.广播变量
package cn.itcast.dataset
import java.util
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import scala.collection.mutable
/**
* @Date 2019/10/17
*/
object BrocastDemo {
def main(args: Array[String]): Unit = {
/**
* 1.获取批处理执行环境
* 2.加器数据源
* 3.数据转换
* (1)共享广播变量
* (2)获取广播变量
* (3)数据合并
* 4.数据打印/触发执行
*需求:从内存中拿到data2的广播数据,再与data1数据根据第二列元素组合成(Int, Long, String, String)
*/
//1.获取批处理执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加器数据源
val data1 = new mutable.MutableList[(Int, Long, String)]
data1 .+=((1, 1L, "xiaoming"))
data1 .+=((2, 2L, "xiaoli"))
data1 .+=((3, 2L, "xiaoqiang"))
val ds1 = env.fromCollection(data1)
val data2 = new mutable.MutableList[(Int, Long, Int, String, Long)]
data2 .+=((1, 1L, 0, "Hallo", 1L))
data2 .+=((2, 2L, 1, "Hallo Welt", 2L))
data2 .+=((2, 3L, 2, "Hallo Welt wie", 1L))
val ds2 = env.fromCollection(data2)
//3.数据转换
ds1.map(new RichMapFunction[(Int,Long,String),(Int, Long, String, String)] {
var ds: util.List[(Int, Long, Int, String, Long)] = null
//open在map方法之前先执行
override def open(parameters: Configuration): Unit = {
//(2)获取广播变量
ds = getRuntimeContext.getBroadcastVariable[(Int, Long, Int, String, Long)]("ds2")
}
//(3)数据合并
import collection.JavaConverters._
override def map(value: (Int, Long, String)): (Int, Long, String, String) = {
var tuple: (Int, Long, String, String) = null
for(line<- ds.asScala){
if(line._2 == value._2){
tuple = (value._1,value._2,value._3,line._4)
}
}
tuple
}
}).withBroadcastSet(ds2,"ds2") //(1)共享广播变量
//4.数据打印/触发执行
.print()
}
}
2.分布式缓存
package cn.itcast.dataset
import java.io.File
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
/**
* @Date 2019/10/17
*/
object DistributeCache {
def main(args: Array[String]): Unit = {
/**
* 1.获取执行环境
* 2.加载数据源
* 3.注册分布式缓存
* 4.数据转换
* (1)获取缓存文件
* (2)解析文件
* (3)数据转换
* 5.数据打印/以及触发执行
*/
//1.获取执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//2.加载数据源
val clazz:DataSet[Clazz] = env.fromElements(
Clazz(1,"class_1"),
Clazz(2,"class_1"),
Clazz(3,"class_2"),
Clazz(4,"class_2"),
Clazz(5,"class_3"),
Clazz(6,"class_3"),
Clazz(7,"class_4"),
Clazz(8,"class_1")
)
//3.注册分布式缓存
val url = "hdfs://node01:8020/tmp/subject.txt"
env.registerCachedFile(url,"cache")
//4.数据转换
clazz.map(new RichMapFunction[Clazz,Info] {
val buffer = new ArrayBuffer[String]()
override def open(parameters: Configuration): Unit = {
//(1)获取缓存文件
val file: File = getRuntimeContext.getDistributedCache.getFile("cache")
//(2)解析文件
val strs: Iterator[String] = Source.fromFile(file.getAbsoluteFile).g