sparkStreaming Kafka数据丢失问题

本文探讨了Spark Streaming结合Kafka时出现的数据丢失问题,尤其是在网络连接中断的情况下,手动提交offset到Zookeeper可能导致数据丢失。通过具体日志示例分析了问题产生的原因。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

针对Spark Streaming,为了保证数据尽量不丢失,自己管理offset

采用手动提交offset to zk的方案:


2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.streaming.kafka.MyKafkaRDD INFO:Computing topic datamining, partition 8 offsets 3883 -> 3903
这里的offset错误出现一次,然后offset 在下一次错误的时候递增了,意味着中间的kafka数据丢失掉了。

TODO :需要测试自带checkpoint是否出现这个问题。


贴上一段日志,这里的数据在处理过程中,网络连接中断,导致consumer的消费连接出现中断,数据丢失,但是offset却递增 丢失。

2017-10-26 11:46:22 Executor task launch worker-1 kafka.utils.VerifiableProperties INFO:Property zookeeper.connect is overridden to 
2017-10-26 11:46:22 task-result-getter-3 org.apache.spark.scheduler.TaskSetManager WARN:Lost task 2.0 in stage 494.0 (TID 1972, localhost): java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:110)
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:98)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:83)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:132)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:131)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:130)
at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.fetchBatch(MyKafkaRDD.scala:192)
at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.getNext(MyKafkaRDD.scala:208)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)


2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.executor.Executor ERROR:Exception in task 3.0 in stage 494.0 (TID 1973)
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:110)
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:98)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:83)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:132)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:131)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:130)
at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.fetchBatch(MyKafkaRDD.scala:192)
at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.getNext(MyKafkaRDD.scala:208)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
2017-10-26 11:46:22 dispatcher-event-loop-3 org.apache.spark.scheduler.TaskSetManager INFO:Starting task 2.0 in stage 496.0 (TID 1977, localhost, partition 2,ANY, 2004 bytes)
2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.executor.Executor INFO:Running task 2.0 in stage 496.0 (TID 1977)
2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.streaming.kafka.MyKafkaRDD INFO:Computing topic datamining, partition 8 offsets 3883 -> 3903
2017-10-26 11:46:22 Executor task launch worker-3 kafka.utils.VerifiableProperties INFO:Verifying properties

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

mtj66

看心情

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值