2021SC@SDUSC
Stream Grouping
在Storm中, 开发者可以为上游Spout/Bolt发射出的Tuples指定下游Bolt的哪个/哪些Task(s)来处理该Tuples。
为每个bolt指定应该接受哪个流作为输入,流分组定义了如何在bolt的任务直接进行分发。
即Stream Grouping,分组方式主要有以下7种。
public enum GroupingType {
SHUFFLE {
@Override
public void assign(BoltDeclarer declarer, InputStream stream) {
declarer.shuffleGrouping(stream.fromComponent, stream.id);
}
},
FIELDS {
@Override
public void assign(BoltDeclarer declarer, InputStream stream) {
declarer.fieldsGrouping(stream.fromComponent, stream.id, new Fields("key"));
}
},
ALL {
@Override
public void assign(BoltDeclarer declarer, InputStream stream) {
declarer.allGrouping(stream.fromComponent, stream.id);
}
},
GLOBAL {
@Override
public void assign(BoltDeclarer declarer, InputStream stream) {
declarer.globalGrouping(stream.fromComponent, stream.id);
}
},
LOCAL_OR_SHUFFLE {
@Override
public void assign(BoltDeclarer declarer, InputStream stream) {
declarer.localOrShuffleGrouping(stream.fromComponent, stream.id);
}
},
NONE {
@Override
public void assign(BoltDeclarer declarer, InputStream stream) {
declarer.noneGrouping(stream.fromComponent, stream.id);
}
},
PARTIAL_KEY {
@Override
public void assign(BoltDeclarer declarer, InputStream stream) {
declarer.partialKeyGrouping(stream.fromComponent, stream.id, new Fields("key"));
}
};
ShuffleGrouping
随机分组: 随机派发stream里面的tuple,保证每个bolt接收到的tuple数目大致相同。
public class ShuffleGrouping implements CustomStreamGrouping, Serializable {
private ArrayList<List<Integer>> choices;
private AtomicInteger current;
@Override
public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) {
choices = new ArrayList<List<Integer>>(targetTasks.size());
for (Integer i : targetTasks) {
choices.add(Arrays.asList(i));
}
current = new AtomicInteger(0);
Collections.shuffle(choices, new Random());
}
@Override
public List<Integer> chooseTasks(int taskId, List<Object> values) {
int rightNow;
int size = choices.size();
while (true) {
rightNow = current.incrementAndGet();
if (rightNow < size) {
return choices.get(rightNow);
} else if (rightNow == size) {
current.set(0);
return choices.get(0);
}
} // race condition with another thread, and we lost. try again
}
}
随机分组(Shuffle Grouping)是最常用的流分组方式,它随机地分发元组到Bolt上的任务,这样能保证每个任务得到相同数量的元组。将流分组定义为混排。这种混排分组意味着来自Spout的输入将混排,或随机分发给此Bolt中的任务。shuffle grouping对各个task的tuple分配的比较均匀。
这里在prepare的时候对ArrayList<List> choices进行随机化
采用current.incrementAndGet()实现round robbin的效果,超过size的时候重置返回第一个,没有超过则返回incr后的index的值。
GlobalGrouping
全局分组: 这个tuple被分配到storm中的一个bolt的其中一个task。再具体一点就是分配给id值最低的那个task。
public class GlobalGrouping implements CustomStreamGrouping {
List<Integer> target;
@Override
public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targets) {
List<Integer> sorted = new ArrayList<>(targets);
Collections.sort(sorted);
target = Arrays.asList(sorted.get(0));
}
@Override
public List<Integer> chooseTasks(int i, List<Object> list) {
return target;
}
}
这里固定取第一个task,即targetTasks.get(0)。
FieldsGrouping
按字段分组:比如按 UserId来分组,具有同样UserId来分组的Tuple会被发送到到相同的Bolts,而不同的UserId则会被分配到不同的Bolts。
public static class FieldsGrouping implements CustomStreamGrouping {
private Fields outFields;
private List<List<Integer>> targetTasks;
private Fields groupFields;
private int numTasks;
public FieldsGrouper(Fields outFields, Grouping thriftGrouping) {
this.outFields = outFields;
this.groupFields = new Fields(Thrift.fieldGrouping(thriftGrouping));
}
@Override
public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) {
this.targetTasks = new ArrayList<List<Integer>>();
for (Integer targetTask : targetTasks) {
this.targetTasks.add(Collections.singletonList(targetTask));
}
this.numTasks = targetTasks.size();
}
@Override
public List<Integer> chooseTasks(int taskId, List<Object> values) {
int targetTaskIndex = TupleUtils.chooseTaskIndex(outFields.select(groupFields, values), numTasks);
return targetTasks.get(targetTaskIndex);
}
}
对选中fields的values通过TupleUtils.chooseTaskIndex选择task下标;chooseTaskIndex主要是采用Arrays.deepHashCode取哈希值然后对numTask向下取模。