Flink DataStream的(任务、算子链和资源组)、ProcessFunction、迭代运算

1. 任务、算子链和资源组

Task Slot含义

  • Task Slot相当于线程
  • Task Slot的数量等于最大parallelism
  • 可以通过-p参数进行指定Task Slot的数量

自定义算子链接关系

package devBase

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}


object TranformationOperatorTest {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment
    // senv.disableOperatorChaining()   // 禁用链接Chain
    val input = senv.fromElements(("key1", "value1"), ("key2", "value2"))

    // source和filter位于一个slot, 第一个map、第二个map、print位于另一个slot
    val output1 = input.filter(x => true)
      .map(x => x).startNewChain().map(x => x)
    output1.print("output1")
    /*
    output1:5> (key1,value1)
    output1:6> (key2,value2)
     */

    // source和filter位于一个slot, map、print位于另一个slot
    val output2 = input.filter(x => true)
      .map(x => x).disableChaining()
    output2.print("output2")
    /*
    output2:5> (key2,value2)
    output2:4> (key1,value1)
     */

    // source、filter属于default group, parallelism为1,需要一个Task Slot
    // map、print属于no_default group, parallelism为1,需要一个Task Slot
    // 所以该application总共需要两个Task Slot
    // (todo)IDEA中执行一直卡住,最后报Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not acquire the minimum required resources
    val output3 = input.filter(x => true)
      .map(x => x).slotSharingGroup("no_default")
    output3.print("output3")


    senv.execute()


  }

}

2. ProcessFunction

  1. 基础应用
package devBase

import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
import org.apache.flink.util.Collector


object TranformationOperatorTest {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment
    val input = senv.fromElements(
      ("key1", "value1"),
      ("key2", "value2"),
      ("key3", "value3")
    )


    val output = input.process(new ProcessFunction[(String,String), (String,String)] {

      override def processElement(i: (String, String), context: ProcessFunction[(String, String), (String, String)]#Context, collector: Collector[(String, String)]): Unit = {

        collector.collect(i)


      }

    })

    output.print()


    senv.execute()


  }

}

执行结果

7> (key2,value2)
8> (key3,value3)
6> (key1,value1)
  1. ProcessFunction类
package org.apache.flink.streaming.api.functions;

import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.api.common.functions.AbstractRichFunction;
import org.apache.flink.streaming.api.TimeDomain;
import org.apache.flink.streaming.api.TimerService;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

@PublicEvolving
public abstract class ProcessFunction<I, O> extends AbstractRichFunction {

    private static final long serialVersionUID = 1L;

    public abstract void processElement(I value, Context ctx, Collector<O> out) throws Exception;

    public void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception {}

    public abstract class Context {

        public abstract Long timestamp();

        public abstract TimerService timerService();

        public abstract <X> void output(OutputTag<X> outputTag, X value);
    }

    public abstract class OnTimerContext extends Context {
        public abstract TimeDomain timeDomain();
    }
}
  • ProcessFunction类的函数和内部类都是新增的
  • 可以通过getRuntimeContext函数获取RuntimeContext, RuntimeContext能获取到各种State状态
  • 通过processElement函数的Context访问TimerService,为ProcessingTime或EventTime时间模式注册timestamp, 当Watermarks到达该timestamp,则触发onTimer函数的运行;为同一key注册多个相同的timestamp,则onTimer函数也只触发一次;onTimer函数和processElement函数不是异步执行的
  1. TimerService类
package org.apache.flink.streaming.api;

import org.apache.flink.annotation.PublicEvolving;

@PublicEvolving
public interface TimerService {

    String UNSUPPORTED_REGISTER_TIMER_MSG = "Setting timers is only supported on a keyed streams.";

    String UNSUPPORTED_DELETE_TIMER_MSG = "Deleting timers is only supported on a keyed streams.";

    long currentProcessingTime();

    long currentWatermark();

    void registerProcessingTimeTimer(long time);

    void registerEventTimeTimer(long time);

    void deleteProcessingTimeTimer(long time);

    void deleteEventTimeTimer(long time);
}
  • checkpoint也会对注册timestamp的Timer进行保存
  • 可以将timestamp进行转换为秒级别,再进行注册,减少Timer计时器的数量
  1. TimeDomain类
package org.apache.flink.streaming.api;

public enum TimeDomain {

    EVENT_TIME,

    PROCESSING_TIME
}
  1. AbstractRichFunction类
package org.apache.flink.api.common.functions;

import org.apache.flink.annotation.Public;
import org.apache.flink.configuration.Configuration;

import java.io.Serializable;

@Public
public abstract class AbstractRichFunction implements RichFunction, Serializable {

    private static final long serialVersionUID = 1L;

    private transient RuntimeContext runtimeContext;

    @Override
    public void setRuntimeContext(RuntimeContext t) {
        this.runtimeContext = t;
    }

    @Override
    public RuntimeContext getRuntimeContext() {
        if (this.runtimeContext != null) {
            return this.runtimeContext;
        } else {
            throw new IllegalStateException("The runtime context has not been initialized.");
        }
    }

    @Override
    public IterationRuntimeContext getIterationRuntimeContext() {
        if (this.runtimeContext == null) {
            throw new IllegalStateException("The runtime context has not been initialized.");
        } else if (this.runtimeContext instanceof IterationRuntimeContext) {
            return (IterationRuntimeContext) this.runtimeContext;
        } else {
            throw new IllegalStateException("This stub is not part of an iteration step function.");
        }
    }

    @Override
    public void open(Configuration parameters) throws Exception {}

    @Override
    public void close() throws Exception {}
}
  • AbstractRichFunction类对RichFunction类的所以函数对进行了实现
  1. RuntimeContext
package org.apache.flink.api.common.functions;

import org.apache.flink.annotation.Public;
import org.apache.flink.annotation.PublicEvolving;
import org.apache.flink.api.common.ExecutionConfig;
import org.apache.flink.api.common.JobID;
import org.apache.flink.api.common.accumulators.Accumulator;
import org.apache.flink.api.common.accumulators.DoubleCounter;
import org.apache.flink.api.common.accumulators.Histogram;
import org.apache.flink.api.common.accumulators.IntCounter;
import org.apache.flink.api.common.accumulators.LongCounter;
import org.apache.flink.api.common.cache.DistributedCache;
import org.apache.flink.api.common.externalresource.ExternalResourceInfo;
import org.apache.flink.api.common.state.AggregatingState;
import org.apache.flink.api.common.state.AggregatingStateDescriptor;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.ReducingState;
import org.apache.flink.api.common.state.ReducingStateDescriptor;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.metrics.MetricGroup;

import java.io.Serializable;
import java.util.List;
import java.util.Set;

@Public
public interface RuntimeContext {

    JobID getJobId();

    String getTaskName();

    @PublicEvolving
    MetricGroup getMetricGroup();

    int getNumberOfParallelSubtasks();

    @PublicEvolving
    int getMaxNumberOfParallelSubtasks();

    int getIndexOfThisSubtask();

    int getAttemptNumber();

    String getTaskNameWithSubtasks();

    ExecutionConfig getExecutionConfig();

    ClassLoader getUserCodeClassLoader();

    @PublicEvolving
    void registerUserCodeClassLoaderReleaseHookIfAbsent(
            String releaseHookName, Runnable releaseHook);

    <V, A extends Serializable> void addAccumulator(String name, Accumulator<V, A> accumulator);

    <V, A extends Serializable> Accumulator<V, A> getAccumulator(String name);

    @PublicEvolving
    IntCounter getIntCounter(String name);

    @PublicEvolving
    LongCounter getLongCounter(String name);

    @PublicEvolving
    DoubleCounter getDoubleCounter(String name);

    @PublicEvolving
    Histogram getHistogram(String name);

    @PublicEvolving
    Set<ExternalResourceInfo> getExternalResourceInfos(String resourceName);

    @PublicEvolving
    boolean hasBroadcastVariable(String name);

    <RT> List<RT> getBroadcastVariable(String name);

    <T, C> C getBroadcastVariableWithInitializer(
            String name, BroadcastVariableInitializer<T, C> initializer);

    DistributedCache getDistributedCache();

    @PublicEvolving
    <T> ValueState<T> getState(ValueStateDescriptor<T> stateProperties);

    @PublicEvolving
    <T> ListState<T> getListState(ListStateDescriptor<T> stateProperties);

    @PublicEvolving
    <T> ReducingState<T> getReducingState(ReducingStateDescriptor<T> stateProperties);

    @PublicEvolving
    <IN, ACC, OUT> AggregatingState<IN, OUT> getAggregatingState(
            AggregatingStateDescriptor<IN, ACC, OUT> stateProperties);

    @PublicEvolving
    <UK, UV> MapState<UK, UV> getMapState(MapStateDescriptor<UK, UV> stateProperties);
}

3. 迭代运算

3.1 iterate

  • 全量迭代,达到最大迭代次数,则停止迭代
package devBase

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment, createTypeInformation}


object TranformationOperatorTest {

  def main(args: Array[String]): Unit = {

    val test_sum = 1000000
    val hit_sum = 0
    val env = ExecutionEnvironment.getExecutionEnvironment
    val input = env.fromElements(hit_sum)

    val hit_sum_ds_final:DataSet[Int] = input.iterate(test_sum)(
      // input传递给map函数返回的dataset, 再传递给map函数进行计算,共迭代test_sum次
      hit_sum_ds => hit_sum_ds.map(hit_sum => {
        val x:Double = math.random
        val y:Double =math.random
        hit_sum + (if((x * x + y * y) < 1) 1 else 0)
      })
    )

    val pi_output = hit_sum_ds_final.map(hit_sum => hit_sum.toDouble / test_sum * 4)
    pi_output.print()


  }

}

执行结果

3.14442

3.2 iterateWithTermination

  • 全量迭代,term_ds没有数据, 或达到最大迭代次数,则停止迭代
package devBase

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment, createTypeInformation}


object TranformationOperatorTest {

  def main(args: Array[String]): Unit = {

    val test_sum = 1000000
    val hit_sum = 0
    val env = ExecutionEnvironment.getExecutionEnvironment
    val input = env.fromElements(hit_sum)

    val hit_sum_ds_final:DataSet[Int] = input.iterateWithTermination(test_sum)(
      // input传递给map函数返回的dataset, 再传递给map函数进行计算
      hit_sum_ds => {
        val next_ds = hit_sum_ds.map(hit_sum => {
          val x:Double = math.random
          val y:Double =math.random
          hit_sum + (if((x * x + y * y) < 1) 1 else 0)
        })

        // term_ds没有数据,或达到最大迭代次数test_sum,迭代终止
        val term_ds = next_ds.filter(hit_sum => hit_sum < 700000)

        (next_ds, term_ds)
      }
    )

    val pi_output = hit_sum_ds_final.map(hit_sum => hit_sum.toDouble / test_sum * 4)
    pi_output.print()


  }

}

执行结果

2.8

3.3 iterateDelta

原理图:
iterateDelta增量迭代原理图

  • 增量迭代,nextWorkSet没有数据, 或达到最大迭代次数,则停止迭代,输出initialSolutionSet
  • iterateDelta函数第3个参数指定一个key, 用于solutionUpdate更新initialSolutionSet
  • iterateDelta函数第4个参数默认为false, 表示将initialSolutionSet置于Flink管理的内存中;为true, 表示将initialSolutionSet置于java的object heap内存中,此时程序执行的结果为:(LiMing,20),(ZhaoSi,32),(ZhangSan,22)
  • stepFunction函数的第一次输入为(initialSolutionSet, initialWorkSet), 输出为(solutionUpdate, nextWorkSet); solutionUpdate先更新(类似insert overwrite)initialSolutionSet, 再将initialSolutionSet传递给下一次迭代,nextWorkSet直接传递给下一次迭代(迭代后的nextWorkSet数据量可能会不断的减少)
  • initialSolutionSet有多个相同的key, 只保留最后一条数据
package devBase

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment, createTypeInformation}
import org.apache.flink.util.Collector

object TranformationOperatorTest3 {

  def main(args: Array[String]): Unit = {
    
    val env = ExecutionEnvironment.getExecutionEnvironment

    val initialSolutionSet:DataSet[(String, Int)] = env.fromElements(
      ("LiMing", 10),
      ("LiMing", 20),
      ("ZhangSan", 20),
      ("ZhaoSi", 30)
    )

    val initialWorkSet:DataSet[(String, Int)] = env.fromElements(
      ("ZhangSan", 0),
      ("ZhaoSi", 1),
      ("ZhaoSi", 2),
      ("WangWu", 0)
    )

    val output = initialSolutionSet.iterateDelta(initialWorkSet, 100, Array(0), false)(
      (solutionSet, workSet) => {
        val candidateUpdate:DataSet[(String,Int)] = workSet.map(
          tuple => (tuple._1, tuple._2 + 2)
        )

        // 在iterateDelta函数中,join solutionSet时,只能使用inner join, 但又有leftOuterJoin的功能
        val solutionUpdate:DataSet[(String,Int)] =
          candidateUpdate.join(solutionSet)
          .where(0).equalTo(0)
          .apply((left:(String,Int), right:(String,Int), collector:Collector[(String,Int)]) =>{
            if (right != null && left._2 > right._2) {
              collector.collect(left)
            }
          })

        val nextWorkSet:DataSet[(String,Int)] =
          candidateUpdate.leftOuterJoin(solutionUpdate)
            .where(0).equalTo(0)
            .apply((left:(String,Int), right:(String,Int), collector:Collector[(String,Int)]) =>{
              if(right == null) {
                collector.collect(left)
              }
            })

        (solutionUpdate, nextWorkSet)
      }
    )

    output.print()
    
  }

}

执行结果:

(LiMing,20)
(ZhaoSi,31)
(ZhangSan,22)

3.4 iterate(DataStream)

package devBase

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation}


object TranformationOperatorTest {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment

    val input:DataStream[Long] = senv.fromSequence(0L, 5L)   // 6个数
    val output:DataStream[Long]=input.iterate(iter => {
      val minusOne:DataStream[Long] = iter.map(_ - 1)
      val gt0 = minusOne.filter(_ > 0)
      val lt0 = minusOne.filter(_ <= 0)
      // gt0返回给iter继续下一次迭代, lt0多次迭代结果返回给output, 程序永远不会停止
      (gt0,lt0)
    })

    output.print("output")


    senv.execute()


  }

}

执行结果:

output:2> 0
output:7> -1
output:6> 0
output:8> 0
output:4> 0
output:5> 0
<think>好的,我现在需要回答用户关于Flink算子使用说明示例的问题。首先,我得回顾一下用户提供的引用内容,确保准确引用相关信息。用户提供了三个引用,分别是关于Flink的数据源算子、聚合算子Transform算子的示例,特别是map操作。 首先,我应该按照用户的要求,结构清晰地分步骤说明。根据系统级指令,需要包括数据源算子、转换算子数据汇算子,每个部分都要有代码示例,并且正确使用LaTeX格式。比如在解释转换算子时,提到的map、filter、keyBy、reduce等,要给出相应的Java或Scala代码示例。 然后,要确保每个引用的内容被正确引用到。例如,在数据源算子部分,引用[1][3]提到了构建环境fromCollection的使用,应该在相关段落末尾添加[^1][^3]。聚合算子部分引用[2]中的内容,需要标注[^2]。 接下来,生成相关问题。问题需要围绕Flink算子的更多细节、最佳实践应用场景展开,比如数据源算子的类型、转换算子的优化方法等。确保问题与用户提供的引用内容相关,并且能够引导进一步学习。 同时,检查LaTeX语法是否正确,比如行内公式用$...$,独立公式用$$...$$。不过在这个回答中可能数学公式不多,但如果有的话需要注意格式。例如,在解释窗口操作时,如果有时间窗口的表达式,需要用正确的格式。 另外,确保使用中文回答,避免任何Markdown格式,除了代码块需要用三个反引号包裹。代码示例需要根据引用中的内容提供,比如引用[3]中的Scala代码,可以展示map操作,而引用[2]中的聚合算子可以用Java示例。 最后,检查整体结构是否清晰,分为数据源、转换、数据汇三个部分,每个部分都有说明示例,并且相关问题符合要求。同时,确保所有引用标识正确对应,如引用[1]、[2]、[3]分别在适当的位置标注。</think>### Flink算子使用说明及示例 Flink算子分为**数据源(Source)**、**转换(Transformation)****数据汇(Sink)**三类。以下结合示例说明其使用方法: --- #### 一、数据源算子(Source Operators) 数据源算子用于从外部系统(如文件、消息队列、集合等)读取数据。 **示例**:从集合构建数据源(Scala): ```scala val env = ExecutionEnvironment.getExecutionEnvironment val textDataSet: DataSet[String] = env.fromCollection( List("张三,1", "李四,2", "王五,3", "张三,4") ) // 从内存集合生成数据集[^3] ``` 其他常用数据源包括`readTextFile`(文件读取)、`fromElements`(元素集合)等。 --- #### 二、转换算子(Transformation Operators) 转换算子对数据进行处理,核心操作包括: 1. **`map`**:逐条处理数据 ```scala // 将字符串转换为元组(姓名,数值) val tupleData = textDataSet.map { str => val parts = str.split(",") (parts(0), parts(1).toInt) } // 输出如("张三",1) ``` 2. **`filter`**:过滤数据 ```java DataStream<String> filtered = dataStream.filter(value -> value.startsWith("A")); ``` 3. **`keyBy`**:按Key分组 ```java DataStream<Tuple2<String, Integer>> keyedStream = dataStream .keyBy(value -> value.f0); // 按元组第一个字段分组 ``` 4. **`reduce`**:聚合操作 ```java DataStream<Tuple2<String, Integer>> reduced = keyedStream .reduce((v1, v2) -> new Tuple2<>(v1.f0, v1.f1 + v2.f1)); // 对数值求[^2] ``` 5. **`window`**:窗口计算 ```java dataStream.keyBy(value -> value.id) .window(TumblingEventTimeWindows.of(Time.seconds(10))) // 定义10秒滚动窗口 .sum("count"); // 按窗口统计总 ``` --- #### 三、数据汇算子(Sink Operators) 数据汇负责将结果输出到外部系统,如文件、数据库或控制台。 **示例**:输出到文件(Java): ```java DataStream<String> result = ...; result.writeAsText("output/path"); // 写入文本文件 ``` ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值