Spark Streaming on YARN：长时间运行作业的优化策略-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_31622585/article/details/97392764

本文探讨了在YARN上运行Spark Streaming作业时如何进行容错、性能、安全、日志管理和优雅停止的优化。建议设置Application Master的重试次数、调整executor失败的最大数量、启用推测执行以确保性能。在安全方面，处理Kerberos票据过期问题。日志管理推荐使用ELK套件。最后，提出了使用标记文件实现优雅停机的解决方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

容错

在YARN集群模式下，Spark驱动程序与Application Master（应用程序分配的第一个YARN容器）在同一容器中运行。此过程负责从YARN 驱动应用程序和请求资源（Spark执行程序）。重要的是，Application Master消除了在应用程序生命周期中运行的任何其他进程的需要。即使一个提交Spark Streaming作业的边缘Hadoop节点失败，应用程序也不会受到影响。
要以集群模式运行Spark Streaming应用程序，请确保为spark-submit命令提供以下参数：

spark-submit --master yarn --deploy-mode cluster

由于Spark驱动程序和Application Master共享一个JVM，Spark驱动程序中的任何错误都会阻止我们长期运行的工作。幸运的是，可以配置重新运行应用程序的最大尝试次数。设置比默认值2更高的值是合理的（从YARN集群属性yarn.resourcemanager.am.max尝试中导出）。对我来说，4工作相当好，即使失败的原因是永久性的，较高的值也可能导致不必要的重新启动。

spark-submit --master yarn --deploy-mode cluster \
    --conf spark.yarn.maxAppAttempts=4

如果应用程序运行数天或数周，而不重新启动或重新部署在高度使用的群集上，则可能在几个小时内耗尽4次尝试。为了避免这种情况，尝试计数器应该在每个小时都重置。

spark-submit --master yarn --deploy-mode cluster \
    --conf spark.yarn.maxAppAttempts=4 \
    --conf spark.yarn.am.attemptFailuresValidityInterval=1h

另一个重要的设置是在应用程序发生故障之前executor失败的最大数量。默认情况下是max（2 * num executors，3），非常适合批处理作业，但不适用于长时间运行的作业。该属性具有相应的有效期间，也应设置。

spark-submit --master yarn --deploy-mode cluster \
    --conf spark.yarn.maxAppAttempts=4 \
    --conf spark.yarn.am.attemptFailuresValidityInterval=1h \
    --conf spark.yarn.max.executor.failures={
   8 * num_executors} \
    --conf spark.yarn.executor.failuresValidityInterval=1h

对于长时间运行的作业，您也可以考虑在放弃作业之前提高任务失败的最大数量。默认情况下，任务将重试4次，然后作业失败。

spark-submit --master yarn --deploy-mode cluster \
    --conf spark.yarn.maxAppAttempts=4 \
    --conf spark.yarn.am.attemptFailuresValidityInterval=1h \
    --conf spark.yarn.max.executor