Apache Storm 核心 API 详解
Storm 的核心 API 提供了构建实时流处理应用的基础框架,以下是对其关键组件及使用方式的深度解析:
一、TopologyBuilder(拓扑构建器)
核心功能
构建数据处理拓扑结构,定义 Spout 和 Bolt 之间的数据流向
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("kafka-spout", new KafkaSpout(), 3);
builder.setBolt("parser-bolt", new LogParserBolt(), 4)
.shuffleGrouping("kafka-spout");
builder.setBolt("counter-bolt", new CountBolt(), 4)
.fieldsGrouping("parser-bolt", new Fields("ip"));
builder.setBolt("hdfs-bolt", new HdfsBolt(), 2)
.globalGrouping("counter-bolt");
关键方法
方法 | 参数 | 功能 |
---|
setSpout | (id, spout, parallelism) | 定义数据源 |
setBolt | (id, bolt, parallelism) | 定义处理单元 |
shuffleGrouping | (componentId) | 随机分组策略 |
fieldsGrouping | (componentId, fields) | 按字段分组 |
allGrouping | (componentId) | 广播分组 |
customGrouping | (componentId, grouping) | 自定义分组 |
二、Spout API(数据源)
1. 基础接口:IRichSpout
public class KafkaSpout extends BaseRichSpout {
private SpoutOutputCollector collector;
private KafkaConsumer<String, String> consumer;
@Override
public void open(Map conf, TopologyContext context,
SpoutOutputCollector collector) {
this.collector = collector;
Properties props = new Properties();
props.put("bootstrap.servers", "kafka:9092");
consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("logs"));
}
@Override
public void nextTuple() {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
collector.emit(new Values(record.value()), record.offset());
}
}
@Override
public void ack(Object msgId) {
consumer.commitSync(Collections.singletonMap(
new TopicPartition("logs", 0),
new OffsetAndMetadata((Long)msgId + 1)
));
}
@Override
public void fail(Object msgId) {
LOG.error("Message failed: " + msgId);
}
}
2. 核心方法说明
方法 | 调用时机 | 典型用途 |
---|
open() | Spout初始化时 | 建立外部连接 |
nextTuple() | 循环调用 | 获取并发射数据 |
ack(Object msgId) | Tuple处理成功 | 提交偏移量/清理状态 |
fail(Object msgId) | Tuple处理失败 | 重发/记录错误 |
declareOutputFields() | 拓扑声明时 | 定义输出字段 |
三、Bolt API(处理单元)
1. 基础接口:IRichBolt
public class CountBolt extends BaseRichBolt {
private OutputCollector collector;
private Map<String, Integer> counters;
@Override
public void prepare(Map conf, TopologyContext context,
OutputCollector collector) {
this.collector = collector;
this.counters = new HashMap<>();
}
@Override
public void execute(Tuple input) {
String ip = input.getStringByField("ip");
int count = counters.getOrDefault(ip, 0) + 1;
counters.put(ip, count);
collector.emit(new Values(ip, count));
collector.ack(input);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("ip", "count"));
}
@Override
public void cleanup() {
counters.forEach((ip, count) ->
System.out.println(ip + " : " + count));
}
}
2. 核心方法说明
方法 | 调用时机 | 典型用途 |
---|
prepare() | Bolt初始化时 | 初始化资源 |
execute(Tuple input) | 收到Tuple时 | 业务逻辑处理 |
declareOutputFields() | 拓扑声明时 | 定义输出字段 |
cleanup() | Bolt关闭时 | 资源清理 |
四、Tuple(数据单元)
核心操作
String ip = tuple.getString(0);
String url = tuple.getStringByField("url");
ComponentId sourceComponent = tuple.getSourceComponent();
int sourceTask = tuple.getSourceTask();
StreamId streamId = tuple.getSourceStreamId();
List<Object> values = Arrays.asList("192.168.1.1", "/index.html", 200);
Tuple newTuple = new TupleImpl(context, values, tuple.getMessageId(), streamId);
五、分组策略(Stream Grouping)
1. 内置分组实现
ShuffleGrouping shuffle = new ShuffleGrouping();
FieldsGrouping fields = new FieldsGrouping("ip", "user_id");
GlobalGrouping global = new GlobalGrouping();
AllGrouping all = new AllGrouping();
2. 自定义分组策略
public class IPRangeGrouping implements CustomStreamGrouping {
@Override
public void prepare(WorkerTopologyContext context,
GlobalStreamId stream,
List<Integer> targetTasks) {
}
@Override
public List<Integer> chooseTasks(int taskId, List<Object> values) {
String ip = (String) values.get(0);
if (ip.startsWith("192.168")) {
return Arrays.asList(0);
} else {
return Arrays.asList(1);
}
}
}
builder.setBolt("geo-bolt", new GeoBolt(), 2)
.customGrouping("spout", new IPRangeGrouping());
六、可靠性机制(ACK框架)
1. 锚定机制(Anchoring)
collector.emit(input, new Values(derivedData));
collector.emit(input1, Arrays.asList(input2), new Values(result));
2. 手动ACK/FAIL
public void execute(Tuple input) {
try {
process(input);
collector.ack(input);
} catch (Exception e) {
collector.fail(input);
}
}
3. 超时配置
Config conf = new Config();
conf.setMessageTimeoutSecs(30);
七、配置系统(Config)
常用配置项
Config conf = new Config();
conf.setNumWorkers(4);
conf.setMaxSpoutPending(1000);
conf.setMaxTaskParallelism(16);
conf.setNumAckers(2);
conf.setMessageTimeoutSecs(30);
conf.registerSerialization(LogEntry.class, LogEntrySerializer.class);
conf.put(Config.TOPOLOGY_WORKER_CHILDOPTS, "-Xmx2g -XX:+UseG1GC");
StormSubmitter.submitTopology("network-monitor", conf,
builder.createTopology());
配置优先级
- topology.* (拓扑代码设置)
- storm.yaml (集群配置文件)
- defaults.yaml (Storm默认配置)
八、Trident API(高级抽象)
核心操作示例
TridentTopology topology = new TridentTopology();
Stream stream = topology.newStream("spout", new KafkaSpout());
stream.each(new Fields("log"), new ParserFunction(), new Fields("ip", "path"))
.groupBy(new Fields("ip"))
.persistentAggregate(
new RedisStateFactory(),
new Count(),
new Fields("count")
)
.newValuesStream()
.each(new Fields("ip", "count"), new ThresholdFilter(1000))
.peek(new AlertFunction());
Trident 操作类型
操作 | API示例 | 功能 |
---|
Filter | .each(filterFn) | 数据过滤 |
Function | .each(functionFn) | 字段转换 |
Aggregation | .aggregate(aggFn) | 聚合计算 |
State Query | .stateQuery(state, queryFn) | 状态查询 |
Partitioning | .partitionBy(fields) | 数据分区 |
九、序列化扩展
自定义序列化器
public class LogEntrySerializer implements Serializer<LogEntry> {
@Override
public void write(Kryo kryo, Output output, LogEntry entry) {
output.writeString(entry.getIp());
output.writeLong(entry.getTimestamp());
output.writeString(entry.getPath());
}
@Override
public LogEntry read(Kryo kryo, Input input, Class<LogEntry> type) {
return new LogEntry(
input.readString(),
input.readLong(),
input.readString()
);
}
}
conf.registerSerialization(LogEntry.class, LogEntrySerializer.class);
十、高级特性API
1. 分布式RPC
DRPCSpout drpcSpout = new DRPCSpout("url-count");
builder.setSpout("drpc", drpcSpout);
builder.setBolt("counter", new UrlCounterBolt())
.shuffleGrouping("drpc");
DRPCClient client = new DRPCClient("drpc-server", 3772);
String result = client.execute("url-count", "https://example.com");
2. 指标上报
public class MetricBolt extends BaseRichBolt {
private transient Counter processedCounter;
@Override
public void prepare(Map conf, TopologyContext context,
OutputCollector collector) {
processedCounter = context.registerCounter("processed_records");
}
@Override
public void execute(Tuple input) {
processedCounter.inc();
}
}
总结:Storm API 核心价值
1. 简洁的流处理抽象
- Spout/Bolt 模型直观表达数据流
- Tuple 作为统一数据载体
- 分组策略灵活控制数据分发
2. 可靠的容错机制
- ACK/Fail 精确控制消息生命周期
- 锚定机制保障衍生数据可靠性
- 超时配置防止系统阻塞
3. 灵活的扩展能力
4. 丰富的生态集成
- Kafka/JDBC/HDFS 等连接器
- Redis/HBase 状态后端支持
- 分布式RPC服务能力
通过合理组合这些API组件,开发者可以构建从简单ETL管道到复杂事件处理系统的各类实时应用,满足毫秒级延迟的业务场景需求。