今天看了一片介绍spark-streaming direct stream的实现原理,文章在这个地址(内容加载比较慢,打开链接后等几分钟内从才能显示出来):
Exactly-once Spark Streaming from Apache Kafka
总结一下几点:
1. spark-streaming其实是根据interval创建了rdd stream。在创建rdd的时候,首先根据上次读取的偏移量定义rdd,然后再根据刚才定义的rdd实际接收数据,产生rdd
2. HasOffsetRanges,对应于每个topic的每个partition,在rdd中都有一个HasOffsetRanges表示
3. spark-streaming的kafka默认consume消息是At-most-once模式,但是如果想要实现Exactly-once模式,会消耗很大的资源来保存每个rdd partition的offset。比如说在上面的文章里面,使用了如下标红的代码来读取rdd partition的offset
stream.foreachRDD { rdd =>
<span style="color:#ff0000;">rdd.foreachPartition { iter =>
// make sure connection pool is set up on the executor before writing
SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
iter.foreach { case (key, msg) =>
DB.autoCommit { implicit session =>
// the unique key for idempotency is just the text of the message itself, for example purposes
sql"insert into idem_data(msg) values (${msg})".update.apply
}</span>
}
}
}