flink

最新推荐文章于 2024-08-05 00:38:55 发布

原创最新推荐文章于 2024-08-05 00:38:55 发布 · 171 阅读

0 ·

CC 4.0 BY-SA版权

分布式专栏收录该内容

4 篇文章

订阅专栏

flatmap

flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
                    public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
                        for (String token : value.split("\\W+")) {
                            if (token.length() > 0) {
                                out.collect(new Tuple2<>(token, 1));
                            }
                        }
                    }
                })

没有reduceby 用keyby
window
翻滚窗口能将数据流切分成不重叠的窗口，每一个事件只能属于一个窗口。

  // tumbling time window of 1 minute length
  .timeWindow(Time.minutes(1))

滑动时间窗口（Sliding Time Window）。

  // sliding time window of 1 minute length and 30 secs trigger interval
  .timeWindow(Time.minutes(1), Time.seconds(30))

join
datastream

stream.join(otherStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(<WindowAssigner>)
    .apply(<JoinFunction>)

dataset

            weightedRatings =
            ratings.join(weights)

                   // key of the first input
                   .where("category")

                   // key of the second input
                   .equalTo("f0")

                   // applying the JoinFunction on joining pairs
                   .with(new PointWeighter());

With Periodic Watermarks
产生watermark，依赖于到达的流或仅依赖处理时间
周期性的触发watermark的生成和发送，默认是100ms，每隔N秒自动向流里注入一个WATERMARK 时间间隔由ExecutionConfig.setAutoWatermarkInterval 决定.
每次调用getCurrentWatermark 方法, 如果得到的WATERMARK 不为空并且比之前的大就注入流中
可以定义一个最大延迟的时间
实现AssignerWithPeriodicWatermarks接口
水位线=事件序列最大值-t

/**
 * 假定数据是乱序的，但乱序的间隔很短。但数据都会延迟一段时间到达
 * This generator generates watermarks assuming that elements arrive out of order,
 * but only to a certain degree. The latest elements for a certain timestamp t will arrive
 * at most n milliseconds after the earliest elements for timestamp t.*/
public class BoundedOutOfOrdernessGenerator implements AssignerWithPeriodicWatermarks<MyEvent> {

    private final long maxOutOfOrderness = 3500; // 3.5 seconds

    private long currentMaxTimestamp;

    @Override
    public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
        long timestamp = element.getCreationTime();
        currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
        return timestamp;
    }

    @Override
    public Watermark getCurrentWatermark() {
        // return the watermark as current highest timestamp minus the out-of-orderness bound
        return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
    }
}

水位线=当前最晚接收到的时间戳- t 其实仅依赖处理时间


/**
假定数据会延迟了一段时间
This generator generates watermarks that are lagging behind processing time by a fixed amount.
 * It assumes that elements arrive in Flink after a bounded delay.*/
public class TimeLagWatermarkGenerator implements AssignerWithPeriodicWatermarks<MyEvent> {

	private final long maxTimeLag = 5000; // 5 seconds

	@Override
	public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
		return element.getCreationTime();
	}

	@Override
	public Watermark getCurrentWatermark() {
		// return the watermark as current time minus the maximum time lag
		return new Watermark(System.currentTimeMillis() - maxTimeLag);
	}
}

tuple
新建

 Tuple3.of(input1.f0, input1.f1, input1.f2 + input2.f2)
new Tuple3<>(tuple.f0,true，c)；//或者

Tuple2.of(closestCentroidId, p);
new Tuple2<>(tuple.f0,true)；//或者

取值 X.f0,X.f1

flink循环
创建IterativeDataSet initial，对这个dataset进行转换，最后转换成dataset2，最后调用initial.closeWith（dataset2），会自动替换一开始的initial dataset并判断是否结束循环

求pi

final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// Create initial IterativeDataSet
IterativeDataSet<Integer> initial = env.fromElements(0).iterate(10000);

DataSet<Integer> iteration = initial.map(new MapFunction<Integer, Integer>() {
    @Override
    public Integer map(Integer i) throws Exception {
        double x = Math.random();
        double y = Math.random();

        return i + ((x * x + y * y < 1) ? 1 : 0);
    }
});

// Iteratively transform the IterativeDataSet
DataSet<Integer> count = initial.closeWith(iteration);

count.map(new MapFunction<Integer, Double>() {
    @Override
    public Double map(Integer count) throws Exception {
        return count / (double) 10000 * 4;
    }
}).print();

env.execute("Iterative Pi Example");

kmens

		// get input data:
		// read the points and centroids from the provided paths or fall back to default data
		DataSet<Point> points = getPointDataSet(params, env);
		DataSet<Centroid> centroids = getCentroidDataSet(params, env);

		// set number of bulk iterations for KMeans algorithm
		
        IterativeDataSet<Centroid> loop = centroids.iterate(10000);

		DataSet<Centroid> newCentroids = points
			// compute closest centroid for each point
			.map(new SelectNearestCenter()).withBroadcastSet(loop, "centroids")
			// count and sum point coordinates for each centroid
			.map(new CountAppender())
			.groupBy(0).reduce(new CentroidAccumulator())
			// compute new centroids from point counts and coordinate sums
			.map(new CentroidAverager());

		// feed new centroids back into next iteration
DataSet<Centroid> finalCentroids = loop.closeWith(newCentroids, new TerminationCriterionImpl().getTerminatedDataSet(newCentroids, loop));
       
       	DataSet<Tuple2<Integer, Point>> clusteredPoints = points
			// assign points to final clusters
			.map(new SelectNearestCenter()).withBroadcastSet(finalCentroids, "centroids");
			
public class TerminationCriterionImpl extends TerminationCriterion {
    public FilterOperator<Tuple2<Centroid,Centroid>> getTerminatedDataSet(DataSet<Centroid> newCentroids, DataSet<Centroid> oldCentroids) throws Exception {

        return newCentroids.join(oldCentroids).where("id").equalTo("id").
                filter (new FilterFunction<Tuple2<Centroid,Centroid>>(){
                    @Override
                    public boolean filter(Tuple2<Centroid,Centroid> value) {
                        return Math.sqrt((value.f0.x - value.f1.x) * (value.f0.x - value.f1.x) +
                                (value.f0.y - value.f1.y) * (value.f0.y- value.f1.y))>EPSILON;
                    }
                });
    }


}

state
ValueState

 The value can be set using x.update(value) and retrieved using x.value().

public class CountWindowAverage extends RichFlatMapFunction<Tuple2<Long, Long>, Tuple2<Long, Long>> {

    /**
     * The ValueState handle. The first field is the count, the second field a running sum.
     */
    private transient ValueState<Tuple2<Long, Long>> sum;

    @Override
    public void flatMap(Tuple2<Long, Long> input, Collector<Tuple2<Long, Long>> out) throws Exception {

        // access the state value
        Tuple2<Long, Long> currentSum = sum.value();

        // update the count
        currentSum.f0 += 1;

        // add the second field of the input value
        currentSum.f1 += input.f1;

        // update the state
        sum.update(currentSum);

        // if the count reaches 2, emit the average and clear the state
        if (currentSum.f0 >= 2) {
            out.collect(new Tuple2<>(input.f0, currentSum.f1 / currentSum.f0));
            sum.clear();
        }
    }

    @Override
    public void open(Configuration config) {
        ValueStateDescriptor<Tuple2<Long, Long>> descriptor =
                new ValueStateDescriptor<>(
                        "average", // the state name
                        TypeInformation.of(new TypeHint<Tuple2<Long, Long>>() {}), // type information
                        Tuple2.of(0L, 0L)); // default value of the state, if nothing was set
        sum = getRuntimeContext().getState(descriptor);
    }
}

// this can be used in a streaming program like this (assuming we have a StreamExecutionEnvironment env)
env.fromElements(Tuple2.of(1L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L), Tuple2.of(1L, 4L), Tuple2.of(1L, 2L))
        .keyBy(0)
        .flatMap(new CountWindowAverage())
        .print();

sink
结果输出，可以使用flink已经提供的sink，如kafka，jdbc,es等，当然我们也可以通过自定义的方式，来实现我们自己的sink。
process
ProcessFunction是一个低级流处理算子操作，可以访问所有（非循环）流应用程序的基本构建块：事件（流数据元），state（容错，一致，仅在被Key化的数据流上），定时器（事件时间和处理时间，仅限被Key化的数据流）。
该ProcessFunction可被认为是一个FlatMapFunction可以访问Keys状态和定时器。它通过为输入流中接收的每个事件调用来处理事件。
通过定义 open，close，processElement 进行自定义