java.lang.StackOverflowError when calling count()

本文探讨了在Spark中通过缓存和检查点来处理长线程导致的Stack Overflow错误的方法,包括如何在特定迭代中进行缓存和检查点操作,以及在内存数据丢失时使用复制和新Block RDD避免写入HDFS和防止错误。
Just to add some more clarity in the discussion, there is a difference between caching to memory and checkpointing, when considered from the lineage point of view. 

When an RDD in checkpointed, the data of the RDD is saved to HDFS (or any Hadoop API compatible fault-tolerant storage) and the lineage of the RDD is truncated. This is okay because in case of the worker failure, the RDD data can be read back from the fault-tolerant storage.

When an RDD is cached, the data of the RDD is cached in memory, but the lineage is not truncated. This is because if the in-memory data is lost, the lineage is required to recompute the data.

So to deal with stackoverflow errors due to long lineage, just caching is not going to be useful. You have to checkpoint the RDD, and as far as I think, its correct way to do this is to do the following
1. Mark RDD of every Nth iteration for caching and checkpointing (both). 
2. And before generating N+1 th iteration RDD, force the materialization of this RDD by doing a rdd.count(). This will persist the RDD in memory as well as save to HDFS and truncate the lineage. If you just mark all Nth iteration RDD for checkpointing, but only force the materialization after ALL the iterations (not after every N+1 th iteration as I suggested) that will still lead to stackoverflow errors.

Yes this checkpointing and materialization is definitely decrease performance, but that is the limitation of the current implementation. 

If you are brave enough, you can try the following. Instead of relying on checkpointing to HDFS for truncating lineage, you can do the following.
1. Persist Nth RDD with replication (see different StorageLevels), this would replicated the in-memory RDD between workers within Spark. Lets call this RDD as R.
2. Force it materialize in the memory. 
3. Create a modified RDD R` which has the same data as RDD R but does not have the lineage. This is done by creating a new BlockRDD using the ids of blocks of data representing the in-memory R (can elaborate on that if you want).

This will avoid writing to HDFS (replication in the Spark memory), but truncate the lineage (by creating new BlockRDDs), and avoid stackoverflow error.

Hope this helps



The long lineage causes a long/deep Java object tree (DAG of RDD objects), which needs to be serialized as part of the task creation. When serializing, the whole object DAG needs to be traversed leading to the stackoverflow error. 

TD


在JNI层调用 `printStackTrace` 时出现 `StackOverflowError` 的问题,通常与线程栈大小限制或递归调用深度有关。以下是可能的原因及解决方案: ### 原因分析 1. **线程栈空间不足** Android中每个线程默认的栈大小为64KB,如果在JNI层执行的调用栈较深(例如嵌套调用较多),可能会导致栈溢出。这种情况在调用Java层的 `printStackTrace` 方法时尤为常见,因为该方法会遍历整个异常堆栈信息并输出,可能引发栈空间不足的问题[^2]。 2. **递归调用导致栈溢出** 如果在JNI层或Java层存在无限递归调用,例如错误地在 `readMessageArray` 方法中调用自身而没有终止条件,也会导致 `StackOverflowError`。这种问题在处理复杂数据结构或异常堆栈时容易暴露出来[^3]。 3. **JNI本地方法调用链过深** 在JNI中调用Java方法时,如果调用链过长或涉及大量本地与Java层的交互,也可能导致栈空间耗尽。尤其是在异常处理中频繁调用 `printStackTrace` 时,会加剧栈空间的消耗[^1]。 --- ### 解决方案 1. **增大线程栈大小** 可以通过创建新线程的方式来指定更大的栈大小,从而避免栈溢出。例如,在Java层创建线程时,可以通过构造函数指定线程栈大小: ```java new Thread(null, new Runnable() { @Override public void run() { // 执行包含 printStackTrace 的操作 } }, "CustomStackSizeThread", 128 * 1024).start(); // 设置栈大小为128KB ``` 该方法适用于需要执行复杂调用或异常处理的场景[^2]。 2. **避免无限递归** 检查代码中是否存在递归调用错误。例如,在 `readMessageArray` 方法中,代码错误地调用了自身而非其他方法,导致无限递归。应确保递归调用有明确的终止条件,并避免在异常处理中触发递归行为[^3]。 3. **减少JNI调用深度** 优化JNI与Java层之间的交互逻辑,避免嵌套调用过深。对于需要频繁输出堆栈信息的场景,可以考虑在本地层记录日志,而不是调用Java层的 `printStackTrace` 方法。 4. **使用替代的日志记录方式** 在JNI层,可以通过 `__android_log_print` 函数将错误信息直接输出到Logcat,避免调用Java异常处理机制: ```cpp #include <android/log.h> __android_log_print(ANDROID_LOG_ERROR, "JNI_LOG", "An error occurred: %s", error_message); ``` 这种方式可以有效规避栈溢出问题,并提升性能[^4]。 --- ### 示例代码:在JNI中使用本地日志记录 ```cpp #include <jni.h> #include <android/log.h> extern "C" JNIEXPORT void JNICALL Java_com_example_myapp_NativeLib_logErrorMessage(JNIEnv *env, jobject /* this */) { __android_log_print(ANDROID_LOG_ERROR, "NativeLog", "This is a native error message"); } ``` ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值