在项目中用到Spark Streaming读取Kafka,应用的是Kafka的low level的API因此手动的把Offset存储到ZK(每次执行成功后,才更新zk中的offset信息)当中,但是如果出现Kafka出现网络问题或者ZK没有写入到情况就会出现ZK的offset和Kafka的offset不一致。此时就要对比Kafka和ZK中的Offset
PS:另外spark中也可以做checkpoint来保存state
- Using checkpoints
- Keeping track of the offsets that have been processed.
另外it takes time for Spark to prepare them and store them
checkpoint比较耗时(平均时间3S做checkpoint)
墙裂推荐:http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html
逻辑:
如果ZK中的offset小于 EarliestOffset 大于LastestOffset说明ZK中的offset已经失效,把ZK中的offset更新为EarliestOffset;如果ZK的offset在 EarliestOffset 大于LastestOffset之间那么以ZK的offset为准
KafkaUtil (SimpleConsumer从Kafka读取offset)
public class KafkaUtil implements Serializable {
private static final long serialVersionUID = -7708717328840L;
private static KafkaUtil kafkaUtil = null;
private KafkaUtil() {
}
public static KafkaUtil getInstance() {
if (kafkaUtil == null) {
synchronized (KafkaUtil.class) {