文章目录
两种状态,键控状态 (keyed state) 和算子状态 (operator state)。
1 键控状态 (keyed state)
键控状态用于keyed stream,对于每一个 key, Flink将会维护一个状态实例。
keyed state 仅可用于 KeyedStream。 Flink 支持以下数据类型的状态变量:
• ValueState[T] 保存单个的值,值的类型为 T。
– get 操作: ValueState.value()
– set 操作: ValueState.update(value: T)
• ListState[T] 保存一个列表,列表里的元素的数据类型为 T。基本操作如下:
– ListState.add(value: T)
– ListState.addAll(values: java.util.List[T])
– ListState.get() 返回 Iterable[T]
– ListState.update(values: java.util.List[T])
• MapState[K, V] 保存 Key-Value 对。
– MapState.get(key: K)
– MapState.put(key: K, value: V)
– MapState.contains(key: K)
– MapState.remove(key: K)
• ReducingState[T] (和ListState差不多)
• AggregatingState[I, O] (和ListState差不多)
State.clear() 是清空操作。
不同 key 对应的 keyed state 是相互隔离的
当一个函数注册了 StateDescriptor 描述符, Flink 会检查状态后端是否已经存在这个状态。这种情况通常出现在应用挂掉要从检查点或者保存点恢复的时候。在这两种情况下, Flink 会将注册的状态连接到已经存在的状态。如果不存在状态,则初始化一个空的状态。
1.1 ValueState 案例
package org.example.state
import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
import org.example.source.self.{SensorReading, SensorSource}
/**
* 实现上一次温度和这次温度差如果超过1.7度,输出当前温度
*/
object ValueStateInFlatMap {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.addSource(new SensorSource)
.keyBy(_.id)
.flatMap(new TemperatureAlert(1.7))
.print()
env.execute()
}
class TemperatureAlert(val diff:Double) extends RichFlatMapFunction[SensorReading,(String,Double,Double)] {
var lastTemp: ValueState[Double] = _
override def open(parameters: Configuration): Unit = {
/**
* 初始化一个状态变量保存上一次的温度
*/
lastTemp = getRuntimeContext.getState(
new ValueStateDescriptor[Double]("last-temp", Types.of[Double])
)
}
override def flatMap(in: SensorReading, collector: Collector[(String, Double, Double)]): Unit = {
val last = lastTemp.value() //获取上次温度
//求这次温度和上次温度的温度差的绝对值
val tempDiff = (in.temperature - last).abs
//如果设置的差值,输出
if (tempDiff > diff){
collector.collect((in.id,in.temperature,tempDiff))
}
//更新状态变量
lastTemp.update(in.temperature)
}
}
}
1.2 flatMapWithState 实现以上需求
package org.example.state
import org.apache.flink.streaming.api.scala._
import org.example.source.self.{SensorReading, SensorSource}
import org.example.state.ValueStateInFlatMap.TemperatureAlert
object FlatMapWithStateExample {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.addSource(new SensorSource)
.keyBy(_.id)
// 第一个参数是输出的元素的类型,第二个参数是状态变量的类型
.flatMapWithState[(String,Double,Double),Double]{
case (in:SensorReading,None) =>{ //匹配第一个元素
(List.empty,Some(in.temperature)) //只更新状态变量,不返回任何值到下游算子
}
case (in:SensorReading,lastTemp:Some[Double]) =>{ //状态中已经和有值
var tempDiff = (in.temperature - lastTemp.get).abs
if(tempDiff > 1.7){
(List((in.id,in.temperature,tempDiff)),Some(in.temperature)) //输出结果到下游
}else{
(List.empty,Some(in.temperature)) //值更新状态
}
}
}
.print()
env.execute()
}
}
1.3 ListState
package org.example.state
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
import org.example.source.self.{SensorReading, SensorSource}
import scala.collection.mutable.ListBuffer
/**
* 每10秒统计信息的数量并输出
*/
object ListStateExample {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream = env.addSource(new SensorSource)
.filter(_.id.equals("sensor_1"))
.keyBy(_.id)
.process(new MyKeyedProcess)
.print()
env.execute()
}
class MyKeyedProcess extends KeyedProcessFunction[String, SensorReading, String] {
var listState: ListState[SensorReading] = _
var timeTs: ValueState[Long] = _
override def open(parameters: Configuration): Unit = {
//定义一个状态列表
listState = getRuntimeContext.getListState(
new ListStateDescriptor[SensorReading]("list-state",
Types.of[SensorReading]
)
)
//定义一个状态值
timeTs = getRuntimeContext.getState(
new ValueStateDescriptor[Long](
"timer",
Types.of[Long]
)
)
}
override def processElement(value: SensorReading, ctx: KeyedProcessFunction[String, SensorReading, String]#Context, out: Collector[String]): Unit = {
listState.add(value) //把元素添加到状态列表中
if(timeTs.value() == 0L){ //第一个元素
val ts = ctx.timerService().currentProcessingTime() + 10 * 1000L
ctx.timerService().registerProcessingTimeTimer(ts) //注册一个定时器
timeTs.update(ts)
}
}
override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[String, SensorReading, String]#OnTimerContext, out: Collector[String]): Unit = {
val list: ListBuffer[SensorReading] = ListBuffer() // 初始化一个空列表
import scala.collection.JavaConversions._ // 必须导入
// 将列表状态变量的数据都添加到列表中
for (r <- listState.get()) {
list += r
}
listState.clear() // gc列表状态变量
out.collect("列表状态变量中的元素数量有 " + list.size + " 个")
timeTs.clear()
}
}
}
1.4 使用连接的广播状态 (using connected broadcast state)
一个常见的需求就是流应用需要将同样的事件分发到操作符的所有的并行实例中,而这样的分发操作还得是可恢复的。
我们举个例子:一条流是一个规则 (比如 5 秒钟内连续两个超过阈值的温度),另一条流是待匹配的流。也就是说,规则流和事件流。所以每一个操作符的并行实例都需要把规则流保存在操作符状态中。也就是说,规则流需要被广播到所有的并行实例中去。
在 Flink 中,这样的状态叫做广播状态 (broadcast state)。广播状态和 DataStream 或者 KeyedStream 都可以做连接操作。
下面的例子实现了一个温度报警应用,应用有可以动态设定的阈值,动态设定通过广播流来实现。
val sensorData: DataStream[SensorReading] = ...
val thresholds: DataStream[ThresholdUpdate] = ...
val keyedSensorData: KeyedStream[SensorReading, String] = sensorData
.keyBy(_.id)
// the descriptor of the broadcast state
val broadcastStateDescriptor =
new MapStateDescriptor[String, Double](
"thresholds", classOf[String], classOf[Double])
val broadcastThresholds: BroadcastStream[ThresholdUpdate] = thresholds
.broadcast(broadcastStateDescriptor)
// connect keyed sensor stream and broadcasted rules stream
val alerts: DataStream[(String, Double, Double)] = keyedSensorData
.connect(broadcastThresholds)
.process(new UpdatableTemperatureAlertFunction())
带有广播状态的函数在应用到两条流上时分三个步骤:
• 调用 DataStream.broadcast() 来创建 BroadcastStream,定义一个或者多个 MapStateDescriptor 对象。
• 将 BroadcastStream 和 DataStream/KeyedStream 做 connect 操作。
• 在 connected streams 上调用 KeyedBroadcastProcessFunction/BroadcastProcessFunction。
下面的例子实现了动态设定温度阈值的功能
class UpdatableTemperatureAlertFunction()
extends KeyedBroadcastProcessFunction[String,
SensorReading, ThresholdUpdate, (String, Double, Double)] {
// the descriptor of the broadcast state
private lazy val thresholdStateDescriptor =
new MapStateDescriptor[String, Double](
"thresholds", classOf[String], classOf[Double])
// the keyed state handle
private var lastTempState: ValueState[Double] = _
override def open(parameters: Configuration): Unit = {
// create keyed state descriptor
val lastTempDescriptor = new ValueStateDescriptor[Double](
"lastTemp", classOf[Double])
// obtain the keyed state handle
lastTempState = getRuntimeContext
.getState[Double](lastTempDescriptor)
}
override def processBroadcastElement(
update: ThresholdUpdate,
ctx: KeyedBroadcastProcessFunction[String,
SensorReading, ThresholdUpdate,
(String, Double, Double)]#Context,
out: Collector[(String, Double, Double)]): Unit = {
// get broadcasted state handle
val thresholds = ctx
.getBroadcastState(thresholdStateDescriptor)
if (update.threshold != 0.0d) {
// configure a new threshold for the sensor
thresholds.put(update.id, update.threshold)
} else {
// remove threshold for the sensor
thresholds.remove(update.id)
}
}
override def processElement(
reading: SensorReading,
readOnlyCtx: KeyedBroadcastProcessFunction
[String, SensorReading, ThresholdUpdate,
(String, Double, Double)]#ReadOnlyContext,
out: Collector[(String, Double, Double)]): Unit = {
// get read-only broadcast state
val thresholds = readOnlyCtx
.getBroadcastState(thresholdStateDescriptor)
// check if we have a threshold
if (thresholds.contains(reading.id)) {
// get threshold for sensor
val sensorThreshold: Double = thresholds.get(reading.id)
// fetch the last temperature from state
val lastTemp = lastTempState.value()
// check if we need to emit an alert
val tempDiff = (reading.temperature - lastTemp).abs
if (tempDiff > sensorThreshold) {
// temperature increased by more than the threshold
out.collect((reading.id, reading.temperature, tempDiff))
}
}
// update lastTemp state
this.lastTempState.update(reading.temperature)
}
}
2 算子状态(不做介绍)
- 列表状态(List state)
将状态表示为一组数据的列表
- 联合列表状态(Union list state)
也将状态表示为数据的列表。它与常规列表状态的区别在于,在发生故障时,或者从保存点(savepoint)启动应用程序时如何恢复
- 广播状态(Broadcast state)
如果一个算子有多项任务,而它的每项任务状态又都相同,那么这种特殊情况最适合应用广播状态。
3 检查点
env.enableCheckpointing(10000) //开启checkpoint 每10秒写一次
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE) //设置checkpoint 的模式为至少一次
/**
至少一次(exactly-once vs. at-least-once)
精准一次 CheckpointingMode.EXACTLY_ONCE (默认)
*/
//如果超过这个时间checkpoint还没结束,就会被认为是失败的
env.getCheckpointConfig.setCheckpointTimeout(60000)
//两次checkpoint的时间间隔
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
//至少同时开启多少个checkpoint
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
/**
enableExternalizedCheckpoints用于开启checkpoints的外部持久化,但是在job失
败的时候不会自动清理,需要自己手工清理state;ExternalizedCheckpointCleanup
用于指定当job canceled的时候externalized checkpoint该如何清理,
DELETE_ON_CANCELLATION的话,在job canceled的时候会自动删除externalized
state,但是如果是FAILED的状态则会保留;RETAIN_ON_CANCELLATION则在job
canceled的时候会保留externalized checkpoint state
*/
env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
//checkpoint的保存路径
val backend = new FsStateBackend("hdfs://xiaoai07:9000/flink/flink1/checkouts")
env.setStateBackend(backend)
4 保存点
4.1 触发保存点
bin/flink savepoint jobId path
yarn模式:
bin/flink savepoint jobId path -yid yarnAppId
4.2 取消job的同时触发保存点
bin/flink cancel -s path jobId
4.3 从保存点启动
bin/flink run -s Path [:runArgs]
4.3 删除savepoint
bin/flink savepoint -d savepointPath
5 flink常用命令
在本机上使用flink run 指定集群的模式提交
./flink run -m remote_ip:8090 -p 1 -c com.test.TestLocal /home/hdp/flink-local.jar
# -m flink 集群地址
# -p 配置job的并行度
# -c Job执行的main class
不用指定master节点,提交是在flink集群的机器节点上
# 前台提交
./flink run -p 1 -c com.test.TestLocal /home/hdp/flink-local.jar
# 通过-d后台提交
./flink run -p 1 -c com.test.TestLocal -d /home/hdp/flink-local.jar
# -d,–detached : 在后台运行
# -s,–fromSavepoint : 基于savepoint保存下来的路径,进行恢复。
flink list
flink list:列出flink的job列表。
flink list -r/–runing :列出正在运行的job
flink list -s/–scheduled :列出已调度完成的job
flink cancel
flink cancel [options] <job_id> : 取消正在运行的job id
flink cancel -s/–withSavepoint
flink stop:仅仅针对Streaming job
flink stop [options] <job_id>
flink stop <job_id>:停止对应的job
flink modify(1.10.0 版已经被移除了)
flink modify <job_id> [options]
flink modify <job_id> -p/–parallelism p : 修改job的并行度
master 节点启动一个 Socket 服务
命令:nc -lk 8888(提示:nc: command not found,通过yum install nc来安装)