[Spark streaming基础]--如何优雅地停止Spark streaming

本文介绍了如何在Apache Spark Streaming中实现优雅的停止。通过详细解释和链接到原始英文博客,提供了一个供参考的学习资源。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

感谢原文链接:http://why-not-learn-something.blogspot.hk/2016/05/apache-spark-streaming-how-to-do.html

由于英文水平有限,加之理解可能不到位,故转载完整的英文博客,供参考。

Apache Spark Streaming : How to do Graceful Shutdown




In my current project, I am using Spark Streaming as processing engine , Kafka as data source and Mesos as cluster /resource manager.
To be precise, i am using Direct Kafka Approach in spark for data ingestion.
Once a streaming application is up and running, there will be multiple things to do to make it stable ,consistent and seamless.
One of them is ensuring Graceful Shutdown to avoid data loss. In cases of restarting Streaming application, deploying changes, etc we have to ensure that the shutdown happens gracefully and in consistent state. It means that once the application receives shutdown signal, it should not accept any more data for processing but at the same time, it should make sure to process all the data/jobs for the current Kafka offsets in memory to get processed before bringing the application down. When the application restarts, it will read the Kafka offset from the checkpoint directory and start getting the data from kafka accordingly for processing.

In this post, i am going to share details how to do graceful shutdown of Spark Streaming application.
There are 2 ways   :
1. Explicitly calling the shutdown hook in driver program : 

       sys.ShutdownHookThread 
         {
            log.info("Gracefully stopping Spark Streaming Application")
            ssc.stop(true, true)
            log.info("Application stopped")
          }


The ssc.stop method’s 1st boolean argument is for stopping the associated spark context while the 2nd boolean argument is for graceful shutdown of streaming context.
      I tried this above approach in my spark application with version 1.5.1 but it did not work. The streaming application was shutting down gracefully but the spark context remained alive or lets say hung. The driver and executor processes were not getting exited. I had to use kill -9 command to forcefully terminate the spark context(which kills driver and executors ).
Later, i found out that this approach is old and was used for spark version before 1.4 . For new spark versions, we use the 2nd approach.

2. spark.streaming.stopGracefullyOnShutdown parameter :
        sparkConf. set (“spark.streaming.stopGracefullyOnShutdown","true")  
        Setting this parameter to True in spark configuration ensures the proper graceful shutdown in new Spark version (1.4 onwards) applications. Also we should not use 1st explicit shutdown hook approach or call the ssc.stop method in the driver along with this parameter . We can just set this parameter, and then call methods ssc.start() and 
ssc .awaitTermination() . No need to call ssc.stop method. Otherwise application might hung during shutdown. 
Please look at the spark source code for knowing how this parameter is used internally : https://github.com/apache/spark/blob/8ac71d62d976bbfd0159cac6816dd8fa580ae1cb/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala#L732

How to pass Shutdown Signal :
Now we know how to ensure graceful shutdown in spark streaming. But how can we pass the shutdown signal to spark streaming. One naive option is to use CTRL+C command at the screen terminal where we run driver program but obviously its not a good option.
One solution , which i am using is , grep the driver process of spark streaming and send a SIGTERM signal . When driver gets this signal, it initiates the graceful shutdown of the application.
We can write the command as below in some shell script  and run the script to pass shutdown signal :
ps -ef | grep spark |  grep <DriverProgramName> | awk '{print $2}'   | xargs kill  -SIGTERM
e.g. ps -ef | grep spark |  grep DataPipelineStreamDriver | awk '{print $2}'   | xargs kill  -SIGTERM

One limitation of this approach is that it can be run only on the same machine on which driver program was run and not on any other node machine of the spark cluster.

If you come to know any of the better approach, please do share.

### 如何使用 Scala 进行 Spark Streaming 开发 #### 1. 环境准备 为了在 Windows 上使用 Scala 和 Apache Spark 进行开发,首先需要完成必要的环境设置。这包括安装 Scala 并配置其运行环境[^4]。 ```bash # 下载并解压 Scala 安装包到指定目录 https://www.scala-lang.org/download/ # 配置环境变量 SCALA_HOME 和 PATH export SCALA_HOME=/path/to/scala export PATH=$PATH:$SCALA_HOME/bin ``` 接着,确保已正确安装 JDK,并将其路径加入 `JAVA_HOME` 中。这是因为在 JVM 上运行的应用程序(如 Scala 和 Spark)都需要 Java 支持。 --- #### 2. 创建 Spark Conf 和 Streaming Context 在编写 Spark Streaming 应用之前,需初始化 `SparkConf` 对象来定义应用程序名称及其执行模式(本地或集群)。随后创建 `StreamingContext` 来管理流式计算任务[^3]。 ```scala import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} val appName = "MySparkStreamingApp" val master = "local[*]" // 或者 cluster URL 如果是在分布式环境中运行 // 初始化 SparkConf val conf = new SparkConf() .setAppName(appName) .setMaster(master) // 创建 StreamingContext,每秒触发一次批处理操作 val ssc = new StreamingContext(conf, Seconds(1)) ``` 上述代码片段展示了如何构建基本的 Spark 流程框架。其中 `Seconds(1)` 表示每隔一秒会启动一个新的批次作业。 --- #### 3. 数据源接入与 DStream 处理 DStream 是 Spark Streaming 的核心抽象之一,代表连续的数据流。可以通过多种方式获取输入数据流,比如文件系统、套接字连接或者 Kafka 消息队列等[^2]。 以下是基于 Socket 输入的一个简单例子: ```scala // 接收来自网络端口的数据流 (假设服务器监听于 localhost:9999) val lines = ssc.socketTextStream("localhost", 9999) // 将每一行拆分为单词并统计频率 val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _) // 打印结果至控制台 wordCounts.print() ssc.start() // 启动 StreamingContext ssc.awaitTermination() // 等待终止信号 ``` 此部分实现了从远程主机读取文本消息的功能,并对这些字符串中的词语进行了计数分析。 --- #### 4. RDDs 及其转换逻辑 尽管 Spark Streaming 主要围绕着 DStreams 展开工作,但底层仍然依赖于 RDD(弹性分布式数据集),这意味着开发者也可以利用熟悉的 RDD API 设计复杂的业务流程。 例如,在上面的例子中我们调用了几个常见的方法: - **flatMap**: 把单个记录映射成多个子项; - **map**: 转换每个元素的形式; - **reduceByKey**: 按键聚合数值。 如果需求更加复杂,则可能涉及窗口函数(window functions),状态维护(stateful operations)等内容。 --- #### 总结 Scala 不仅是一种强大的编程语言,而且它还特别适合用来编写像 Spark 这样的大数据平台上的应用软件,因为两者共享相同的虚拟机架构——Java Virtual Machine(JVM)[^1]。通过本文介绍的方法论,读者应该已经掌握了怎样快速入门 Spark Streaming基础技能。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值