SPARK UDF多次执行的问题

最新推荐文章于 2025-07-01 16:49:33 发布

原创最新推荐文章于 2025-07-01 16:49:33 发布 · 1.7k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#spark

spark 专栏收录该内容

3 篇文章

订阅专栏

SPARK UDF多次执行的问题

通常我们在一个dataframe中调用udf时，我们预期是每一行应用一次udf函数，但实际上这是不能保证每一行应用一次的，因为在一些可能多次访问udf返回值得场景下，spark内部会优先反复调用udf而不是job。
所以我们在设计udf时应该设计为pure function，这样可以保证即使对于同一条数据多次调用udf也不会影响预期结果，否则应该考虑使用map/mapPartitions实现你的需求
下方引用spark jira上的问题描述

Spark assumes UDF’s are pure function; we do not guarantee that a function is only executed once. This is due to the way the optimizer works, and the fact that sometimes retry stages. We could add a flag to UDF to prevent this from working, but this would be a considerable engineering effort.
The example you give is not really a pure function, as its side effects makes the thread stop (changes state).
If you are connecting to an external service, then I would suggest using Dataset.mapPartitions(…) (similar to a generator). This will allow you to setup one connection per partition, and you can call a method as much or as little as you like.

jira链接：UDFs are run too many times