Trident流处理框架详解-优快云博客

本文深入探讨了Trident流处理框架的各种功能，包括函数、过滤器、映射与扁平映射、峰值、最小值与最大值计算、窗口处理、分区聚合等核心概念。通过具体示例解释了如何使用不同的聚合器进行数据处理，以及如何进行流的合并与连接，为读者提供了全面的Trident使用指南。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

转载自：https://blog.youkuaiyun.com/opensure/article/details/45847545
https://blog.youkuaiyun.com/hjw199089/article/details/72026815

文章目录

Functions

• 以一个输入集的fields，emits零个或多个的tuples作为输出
• 输出tuples是appended到原来的 input tuple中的
• 如果没有输出tuples，则一般是对源tuples做了过滤

public class MyFunction extends BaseFunction {
    public void execute(TridentTuple tuple, TridentCollector collector) {
        for(int i=0; i < tuple.getInteger(0); i++) {
            collector.emit(new Values(i));
        }
    }
}

mystream.each(new Fields("b"), new MyFunction(), new Fields("d")))

假设有一个叫“mystream”输入流有[“a”，“b”，“c“]三个字段：

[1, 2, 3]
[4, 1, 6]
[3, 0, 8]

运行结果将会有4个字段[“a”，“b”，“c”，“d”]，如下：

[1, 2, 3, 0]
[1, 2, 3, 1]
[4, 1, 6, 0]

这里[3, 0, 8]被过滤掉了。

Filters

function可以作为过滤器，而更为一般的过滤器为：

public class MyFilter extends BaseFunction {
    public booleanisKeep(TridentTuple tuple) {
        return tuple.getInteger(0) == 1 && tuple.getInteger(1) == 2;
    }
}

假设有如下输入：

[1, 2, 3]
[2, 1, 1]
[2, 3, 4]

运行下面的代码：

mystream.each(new Fields("b","a"), new MyFilter())

结果将会如下：

[2, 1, 1]

map and flatMap

map对输入tuple进行一对一的处理，例如转大写的处理：

public class UpperCase extends MapFunction {
 @Override
 public Values execute(TridentTuple input) {
   return new Values(input.getString(0).toUpperCase());
 }
}

如下调用即可：

mystream.map(new UpperCase())

flatMap对处理结果flattening到一个新的stream中，例如将一个句子的stream转成一个words的stream：

public class Split extends FlatMapFunction {
  @Override
  public Iterable<Values> execute(TridentTuple input) {
    List<Values> valuesList = new ArrayList<>();
    for (String word : input.getString(0).split(" ")) {
      valuesList.add(new Values(word));
    }
    return valuesList;
  }

调用如下方式：

mystream.flatMap(new Split()).map(new UpperCase())

可以为输入重命名：

mystream.map(new UpperCase(), new Fields("uppercased"))；

mystream.flatMap(new Split(), new Fields("word"))；

peek

接收流数据，但不改变流。例如将tupe打印出来：

mystream.peek(new Consumer() {
                 @Override
                 public void accept(TridentTuple input) {
                     System.out.println(input.getString(0));
                 }
              }
);

min & minBy 与 max & maxBy

min & minBy 用于计算一个batch中的最小值 & 对batch排序，max & maxBy 计算一个batch中的最大值 & 对batch排序。
假设字段是[“device-id”, “count”] 的输入流：

Partition 0:
[123, 2]
[113, 54]

Partition 1:
[64,  18]
[62,  12]

Partition 2:
[27,  94]
[82,  23]

运行下面的代码：

mystream.minBy(new Fields("count"))

结果将会如下：

Partition 0:
[123, 2]

Partition 1:
[62,  12]

Partition 2:
[82,  23]

Windowing

Trident可以多批次合并为一个窗统一处理，提供两种基于时间和 tuples count处理窗： Tumbling window 与 Sliding window。
Tumbling window任一个 tuple仅在一个窗口中被处理一次：

    /** * Returns a stream of tuples which are aggregated results of a tumbling window with every {@code windowCount} of tuples. */
    public Stream tumblingWindow(int windowCount, WindowsStoreFactory windowStoreFactory,
                                      Fields inputFields, Aggregator aggregator, Fields functionFields);
 
    /** * Returns a stream of tuples which are aggregated results of a window that tumbles at duration of {@code windowDuration} */
    public Stream tumblingWindow(BaseWindowedBolt.Duration windowDuration, WindowsStoreFactory windowStoreFactory,
                                     Fields inputFields, Aggregator aggregator, Fields functionFields);

Sliding window滑动窗窗口处理，这样tuple可能会在多个窗中：

    /** * Returns a stream of tuples which are aggregated results of a sliding window with every {@code windowCount} of tuples * and slides the window after {@code slideCount}. */
    public Stream slidingWindow(int windowCount, int slideCount, WindowsStoreFactory windowStoreFactory,
                                      Fields inputFields, Aggregator aggregator, Fields functionFields);
 
    /** * Returns a stream of tuples which are aggregated results of a window which slides at duration of {@code slidingInterval} * and completes a window at {@code windowDuration} */
    public Stream slidingWindow(BaseWindowedBolt.Duration windowDuration, BaseWindowedBolt.Duration slidingInterval,
                                    WindowsStoreFactory windowStoreFactory, Fields inputFields, Aggregator aggregator, Fields functionFields);

partitionAggregate

分区聚合方法：partitionAggregate，可自定如下三种类型的聚合器

Aggregator 涉及方法init, aggregate, complete，每batch调用一次int方法，每tuple调用 aggregate方法，可以emit任意tuple，该batch中所有tuple处理完后调用complete；
CombinerAggregator，涉及方法init, combine, zero，每tuple调用init和combine方法，分区中无tuple时，返回zero;
ReducerAggregator，涉及方法init, reduce，实例初始化时调用init，每个tuple调用reduce方法；

partitionAggregate总是在每个批次的分区上运行的函数，和函数不同的是，分区汇总发射（emit）的数据覆盖了原始的tuple。例如：

mystream.partitionAggregate(new Fields("b"), new Sum(), new Fields("sum"))

假设输入的“流”包含【“a”，“b”】两个字段，并且按照如下分区：

Partition 0:
["a", 1]
["b", 2]
 
Partition 1:
["a", 3]
["c", 8]
 
Partition 2:
["e", 1]
["d", 9]
["d", 10]

输出中只有一个Field “sum”

Partition 0:
[3]
 
Partition 1:
[11]
 
Partition 2:
[20]

有三种聚合器： CombinerAggregator, ReducerAggregator, and Aggregator.

（1）CombinerAggregator

public interface CombinerAggregator<T> extends Serializable {
    T init(TridentTuple tuple);
    T combine(T val1, T val2);
    T zero();
}

CombinerAggregator返回一个只有一个field的tuple，CombinerAggregators对每一输入tuple运行init()，使用 combine()对结果进行聚合，如果没有分区中没有tuples，CombinerAggregator发送出 zero()的输出。例如：

public class Count implements CombinerAggregator<Long> {
    public Long init(TridentTuple tuple) {
        return 1L;
    }
 
    public Long combine(Long val1, Long val2) {
        return val1 + val2;
    }
 
    public Long zero() {
        return 0L;
    }
}

（2）ReducerAggregator:

public interface ReducerAggregator<T> extends Serializable {
    T init();
    T reduce(T curr, TridentTuple tuple);
}

RducerAggregator在初始化的时候产生一个值，每个输入的tuple在这个值的基础上进行迭代并输出一个单独的值。例如：

public class Count implements ReducerAggregator<Long> {
    public Long init() {
        return 0L;
    }
    
    public Long reduce(Long curr, TridentTuple tuple) {
        return curr + 1;
    }
}

（3）Aggregator

public interface Aggregator<T> extends Operation {
    T init(Object batchId, TridentCollector collector);
    void aggregate(T state, TridentTuple tuple, TridentCollector collector);
    void complete(T state, TridentCollector collector);
}

partitionAggregateAggregator是更加通用聚合接口。在处理batch前调用init()，返回值代表aggregation的状态被传入aggregate和complete函数。aggregate会对批次中每个tuple调用，可以更新状态并有选择的输出tuples。当这个批次分区的数据执行结束后调用complete函数。来看Aggregator实现Count:

public class CountAgg extends BaseAggregator<CountState> {
    static class CountState {
        long count = 0;
    }
 
    public CountState init(Object batchId, TridentCollector collector) {
        return new CountState();
    }
 
    public void aggregate(CountState state, TridentTuple tuple, TridentCollector collector) {
        state.count+=1;
    }
 
    public void complete(CountState state, TridentCollector collector) {
        collector.emit(new Values(state.count));
    }
}

如果想同时执行多个聚合，可以使用如下的调用链：

mystream.chainedAgg()
        .partitionAggregate(new Count(), new Fields("count"))
        .partitionAggregate(new Fields("b"), new Sum(), new Fields("sum"))
        .chainEnd();

上面代码将会在每个分区上执行count和sum聚合。输出一个tuple，其中的fields为["count", "sum"]。

stateQuery and partitionPersist

状态查询和持久化，在状态管理机制上加入了分区操作。

projection

可以截取tuple中的字段，取字段子集。对于一个含有fields["a", "b", "c", "d"]的流：

mystream.project(new Fields("b", "d"))

输出流将只有["b", "d"]两个字段。

Repartitioning operations

数据原本以分区形式分布在集群中，进行并行计算。可以进行重新分区，会产生网络传输。方法：

shuffle 轮询均匀分配；
broadcast 每个tuple都广播到所有分区；broadcast 每个tuple都广播到所有分区；
partitionBy 根据字段分区；partitionBy 根据字段分区；
global 所有tuple进入同一分区；global 所有tuple进入同一分区；
batchGlobal 同一batch的tuple进入同一分区；batchGlobal 同一batch的tuple进入同一分区；
partition 可自定义分区函数

Aggregation operations

聚合操作分为两种：aggregate 和persistentAggregate
可以接收上面讲的聚合器作为入参，需要注意的是：

ReducerAggregator或Aggregator坐全局聚合的时候数据先汇聚，并发效率低。
CombinerAggregator 是先各分区聚合，然后再计算，并发效率较高。

Trident有aggregate和persistentAggregate函数对流做聚合。Aggregate在每个批次上独立运行，persistentAggregate聚合流的所有的批次并将结果存储下来。

在一个流上做全局的聚合，可以使用reducecerAggregator或者aggretator，这个流先被分成一个分区，然后聚合函数在这个分区上运行。如果使用CombinerAggreator，Trident贤惠在每个分区上做一个局部的汇总，然后重分区冲为一个分区，在网络传输结束后完成聚合。CombinerAggreator非常有效，在尽可能的情况下多使用。

下面是一个做批次内聚合的例子：

mystream.aggregate(new Count(), new Fields("count"))

和partitionAggregate一样，聚合的aggregate也可以串联。如果将CombinerAggreator和非CombinerAggreator串联，trident就不能做局部汇总的优化

grouped streams

Groupby操作，会改变数据的分区
GroupBy操作根据特殊的字段对流进行重分区，分组字段相同的元组（tuple）被分到同一个分区，

Merges and joins

最简单的方式就是合并多个流成为一个流，可以使用tridentTopology#meger，如下：

topology.merge(stream1, stream2, stream3);

Trident合并的流字段会以第一个流的字段命名。

另一个合并流的方法是join。类似SQL的join都是对固定输入的。而流的输入是不固定的，所以不能按照sql的方法做join。下面是个join的例子，一个流包含字段[“key”,“val1”，“val2”]，另一个流包含字段[“x”，“val1”]：

topology.join(stream1, new Fields("key"), 
stream2, new Fields("x"), 
new Fields("key","a","b","c"));

Stream1的“key”和stream2的“x”关联。Trident要求所有的字段都重新声明名字，因为原来的名字将会会覆盖。Join的输入会包含：

首先是join字段。例子中stream1中的“key”对应stream2中的“x”。
接下来，会把非join字段依次列出来，排列顺序按照传给join的顺序。例子中“a”，“b”对应stream1中的“val1”和“wal2”，“c”对应stream2中的“val1”。

当join的流分别来自不同的spout，这些spout会同步发射的批次，也就是说，批次处理会包含每个spout发射的tuple。
那么问怎么做windowed join，join的一边和另一边最近一个小时的数据做join运算。为了实现这个，可以使用patitionPersist和stateQuery。最近一个小时的数据可以按照join字段做key存储下改变，在join的过程中可以查询存储的额数据完成join操作。