一 Environment
- Flink Job在提交执行计算时,需要首先建立和Flink框架之间的联系,也就指的是当前的flink运行环境,只有获取了环境信息,才能将task调度到不同的taskManager执行。而这个环境对象的获取方式相对比较简单。
val env1 = ExecutionEnvironment.getExecutionEnvironment
val env2 = StreamExecutionEnvironment.getExecutionEnvironment
- 每次编写程序时,友情建议先导入一个用于隐式转换的类,后面的addSource,addSink等等都需要用到。
import org.apache.flink.streaming.api.scala._
二 Source
- Flink框架可以从不同的来源获取数据,将数据提交给框架进行处理, 我们将获取数据的来源称之为数据源.
1、从集合中读取数据fromCollection
- 一般情况下,可以将数据临时存储到内存中,形成特殊的数据结构后,作为数据源使用。这里的数据结构采用集合类型是比较普遍的
package cn.kgc.kb09.source
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
object SourceTest {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream1: DataStream[String] = env.fromCollection(List(
"hello java",
"hello flink"
))
stream1.print("demo")
env.execute("demo")
}
}
2、从文件中读取数据readTextFile
- 通常情况下,我们会从存储介质中获取数据,比较常见的就是将日志文件作为数据源
package cn.kgc.kb09.source
import org.apache.flink.api.common.functions.{
MapFunction, RichMapFunction}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.{
DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
object SourceFile {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val fileDS: DataStream[String] = env.readTextFile("input/data.log")
val waterStream1: DataStream[WaterSensor] = fileDS.map(x => WaterSensor(x.split(",")(0), x.split(",")(1).toLong, x.split(",")(2).toDouble))
waterStream1.print("匿名函数")
val waterStream2: DataStream[WaterSensor] = fileDS.map(new MyMapFunction)
waterStream2.print("自定义类")
env.execute("sensor")
}
}
class MyMapFunction extends MapFunction[String,WaterSensor]{
override def map(t: String): WaterSensor = {
val strings: Array[String] = t.split(",")
WaterSensor(strings(0),strings(1).toLong,strings(2).toDouble)
}
}
class MyRichMapFunction extends RichMapFunction[String,WaterSensor]{
override def open(parameters: Configuration): Unit = super.open(parameters)
override def close(): Unit = super.close()
override def map(in: String): WaterSensor = {
null
}
}
3、从Kafka读取数据
- Kafka作为消息传输队列,是一个分布式的,高吞吐量,易于扩展地基于主题发布/订阅的消息系统。在现今企业级开发中,Kafka 和 Flink成为构建一个实时的数据处理系统的首选
- 引入kafka连接器的依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.7.2</version>
</dependency>
package cn.kgc.kb09.source
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.{
DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.kafka.clients.consumer.ConsumerConfig
object SourceKafka {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val prop = new Properties()
prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"192.168.198.201:9092")
prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG,"flink-kafka-demo")
prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka,common.serialization.StringDeserializer")
prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka,common.serialization.StringDeserializer")
prop.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"latest")
val kafkaDStream: DataStream[String] = env.addSource(
new FlinkKafkaConsumer[String]("FlinkKafka", new SimpleStringSchema(), prop)
)
kafkaDStream.print()
env.execute("kafkademo")
}
}
4、自定义数据源
- 大多数情况下,前面的数据源已经能够满足需要,但是难免会存在特殊情况的场合,所以flink也提供了能自定义数据源的方式
package cn.kgc.kb09.source
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala._
import scala.util.Random
case class WaterSensor(id:String,ts:Long,vc:Double)
class MySensorSource extends SourceFunction[WaterSensor] {
var flag=true
override def run(sourceContext: SourceFunction.SourceContext[WaterSensor]): Unit = {
while (flag){
sourceContext.collect(
WaterSensor(
"sensor_"+new Random().nextInt(3),
System.currentTimeMillis(),
new Random().nextInt(5)+40
)
)
Thread.sleep(1000)
}
}
override def cancel(): Unit ={
flag=false
}
}
object SourceMy{
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val mydefDStream: DataStream[WaterSensor] = env.addSource(new MySensorSource)
mydefDStream.print()
env.execute("mydefsource")
}
}
三 Transform
- 在Spark中,算子分为转换算子和行动算子,转换算子的作用可以通过算子方法的调用将一个RDD转换另外一个RDD,Flink中也存在同样的操作,可以将一个数据流转换为其他的数据流。
- 转换过程中,数据流的类型也会发生变化,那么到底Flink支持什么样的数据类型呢,其实我们常用的数据类型,Flink都是支持的。比如:Long, String, Integer, Int, 元组,样例类,List, Map等。
1、map
- 映射:将数据流中的数据进行转换, 形成新的数据流,消费一个元素并产出一个元素
- 参数:Scala匿名函数或MapFunction
- 返回:DataStream
package cn.kgc.kb09.transform
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.{
DataStream, StreamExecutionEnvironment}
case class WaterSensor(id: String, ts: Long, vc: Double)
object Transform_Map {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val sensorDS: DataStream[WaterSensor] = env.fromCollection(
List(
WaterSensor("sensor_0", 1609142261216l, 44.0),
WaterSensor("sensor_1", 1609142261289l, 43.0),
WaterSensor("sensor_2", 1609142261223l, 45.0)
)
)
val mapDStream: DataStream[(String, String, String)] = sensorDS.map(x=>(x.id+"_bak",x.ts+"_bak",x.vc+"_bak"))
mapDStream.print()
env.execute("kafkademo")
}
}
MapFunction
- Flink为每一个算子的参数都至少提供了Scala匿名函数和函数类两种的方式,其中如果使用函数类作为参数的话,需要让自定义函数继承指定的父类或实现特定的接口。例如:MapFunction
import org.apache.flink.api.common.functions.MapFunction
import org.apache.flink.streaming.api.scala._
object Transform_MapFunction {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val sensorDS: DataStream[String] = env.readTextFile("input/sensor-data.log")
sensorDS.map(new MyMapFunction)
env.execute("map")
}
class MyMapFunction extends MapFunction[String,WaterSensor]{
override def map(t: String): WaterSensor = {
val datas: Array[String] = t.split(",")
WaterSensor(datas(0),datas(1).toLong,datas(2).toDouble)
}
}
case class WaterSensor(id: String, ts: Long, vc: Double)
}
RichMapFunction
- 所有Flink函数类都有其Rich版本。它与常规函数的不同在于,可以获取运行环境的上下文,并拥有一些生命周期方法,所以可以实现更复杂的功能。也有意味着提供了更多的,更丰富的功能。例如:RichMapFunction
import org.apache.flink.api.common.functions.{
MapFunction, RichMapFunction}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._
object Transform_RichMapFunction {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val sensorDS: DataStream[String] = env.readTextFile("input/sensor-data.log")
val myMapDS: DataStream[WaterSensor] = sensorDS.map(new MyRichMapFunction)
myMapDS.print()
env.execute("map")
}
class MyRichMapFunction extends RichMapFunction[String,WaterSensor]{
override def map(value: String): WaterSensor = {
val datas: Array[String] = value.split(",")
WaterSensor(getRuntimeContext.getTaskName, datas(1).toLong, datas(2).toDouble)
}
override def open(parameters: Configuration): Unit = {
}
override def close(): Unit = {
}
}
case class WaterSensor(id: String, ts: Long, vc: Double)
}
- Rich Function有一个生命周期的概念。典型的生命周期方法有:
- open()方法是rich function的初始化方法,当一个算子例如map或者filter被调 用之前open()会被调用
- close()方法是生命周期中的最后一个调用的方法,做一些清理工作
- getRuntimeContext()方法提供了函数的RuntimeContext的一些信息,例如函数执行的并行度,任务的名字,以及state状态
2、KeyBy
- 在Spark中有一个GroupBy的算子,用于根据指定的规则将数据进行分组,在flink中也有类似的功能,那就是keyBy,根据指定的key对数据进行分流
- 分流:根据指定的Key将元素发送到不同的分区,相同的Key会被分到一个分区(这里分区指的就是下游算子多个并行节点的其中一个)。keyBy()是通过哈希来分区的。
- 参数:Scala匿名函数或POJO属性或元组索引,不能使用数组
- 返回:KeyedStream
package cn.kgc.kb09.transform
import cn.kgc.kb09.source.WaterSensor
import org.apache.flink.api.java.functions.KeySelector
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.{
DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
object Transfrom_KeyBy {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val sensorDS: DataStream[WaterSensor] = env.fromCollection(
List(
WaterSensor("sensor_1", 1609142261216l