问题简介:
cuda-gdb 调试 Program terminated with signal SIGKILL, Killed. The program no longer exists.
问题背景:
编写cuda程序,需要核对计算过程是否正确,所以使用cuda-gdb来进行单步调试,并显示运算过程及结果。
通过 nvcc -g -G location.cu -o location
编译cuda程序
通过 cuda-gdb location
调试程序
键入 r
程序断在 如图示:
报出错误如下:
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x5555c9de90 (location2.cu:45)
(cuda-gdb) n
[Thread 0x7fb7ff5010 (LWP 10479) exited]
Program terminated with signal SIGKILL, Killed.
The program no longer exists.
问题分析:
在gdb调试时出现以上报错信息,通常是 内存相关问题,查看系统日志 tail /var/log/kern.log -n 1000
发现如下:
Aug 12 09:29:11 nx kernel: [ 1197.607814] nvgpu: 17000000.gv11b gv11b_fb_print_fault_info:710 [ERR] [MMU FAULT] mmu engine id: 65, ch id: 504, fault addr: 0x7ffff4e000, fault addr aperture: 0, fault type: invalid pde, access type: virt read,
Aug 12 09:29:11 nx kernel: [ 1197.608185] nvgpu: 17000000.gv11b gv11b_fb_print_fault_info:719 [ERR] [MMU FAULT] protected mode: 0, client type: gpc, client id: t1 2, gpc id if client type is gpc: 0,
Aug 12 09:29:12 nx kernel: [ 1199.027576] nvgpu: 17000000.gv11b gk20a_fifo_deferred_reset:1604 [ERR] chid: 503 is not bound to tsg
Aug 12 09:29:12 nx kernel: [ 1199.039202] nvgpu: 17000000.gv11b gk20a_tsg_unbind_channel:169 [ERR] Channel 504 unbind failed, tearing down TSG 4
第一段日志信息说是 [MMU FAULT] fault addr: 0x7ffff4e000
此时,我却任然一头雾水,在网上搜索了很多关于
“cuda-gdb 调试 Program terminated with signal SIGKILL, Killed.
The program no longer exists.”
这样的关键词,但所得结果均无效。搜索一下午,困惑一晚上,任然不得其解。第二天早上迷迷糊糊看向核函数的形参恍然大悟,问题不就是:将host变量引用传递给了device
在找到原因后才发现:其实,最重要的报错信息是如下,而我却忽略了
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x5555c9de90 (location2.cu:45)
这报错信息明确说明 线程束非法地址,其异常被触发在 location2.cu 的45行处 。这意思不就是说地址非法了,再看源码
__global__
void gpu_location_kernel(DeviceArray2D<float>& struct_dev_arr_input_2D,
DeviceArray2D<float>& struct_dev_arr_w1_2D, DeviceArray2D<float>& struct_dev_arr_w2_2D, DeviceArray2D<float>& struct_dev_arr_w3_2D,
DeviceArray2D<float>& struct_dev_arr_b1_2D, DeviceArray2D<float>& struct_dev_arr_b2_2D, DeviceArray2D<float>& struct_dev_arr_b3_2D,
DeviceArray2D<float>& struct_dev_median_layer1_reuslt, DeviceArray2D<float>& struct_dev_median_layer2_reuslt, DeviceArray2D<float>& struct_dev_final_reuslt)
{
...
}
哈哈,我居然在核函数里使用引用,大错特错,再看源码调用 gpu_location_kernel
处,传递的参数居然是 host 变量的引用,大错特错。
想想为什么会犯这个错误呢?
其实我只是想通过结构体引用传递多个参数值,如果一个参数一个参数写作形参,实在是太长了。遂而将 device 地址放到了 host 结构体中,然后将 host 结构体的引用作为实参传递给核函数,这不出现非法地址,才怪!
解决办法:
只需要将 host 结构体 拷贝到 device 结构体中,然后将 device 结构体指针传递给核函数就可以了,这样既满足了传递多个参数而不追加形参,而又解决了目前的问题。
代码如下:
struct GPULocationKernelParameter{
float *dev_arr_input_2D;
size_t dev_arr_input_2D_pitch;
int dev_arr_input_2D_rows;
int dev_arr_input_2D_columns;
float *dev_arr_w1_2D;
size_t dev_arr_w1_2D_pitch;
int dev_arr_w1_2D_rows;
int dev_arr_w1_2D_columns;
float *dev_arr_w2_2D;
size_t dev_arr_w2_2D_pitch;
int dev_arr_w2_2D_rows;
int dev_arr_w2_2D_columns;
float *dev_arr_w3_2D;
size_t dev_arr_w3_2D_pitch;
int dev_arr_w3_2D_rows;
int dev_arr_w3_2D_columns;
float *dev_arr_b1_2D;
size_t dev_arr_b1_2D_pitch;
int dev_arr_b1_2D_rows;
int dev_arr_b1_2D_columns;
float *dev_arr_b2_2D;
size_t dev_arr_b2_2D_pitch;
int dev_arr_b2_2D_rows;
int dev_arr_b2_2D_columns;
float *dev_arr_b3_2D;
size_t dev_arr_b3_2D_pitch;
int dev_arr_b3_2D_rows;
int dev_arr_b3_2D_columns;
float *dev_median_layer1_reuslt;
size_t dev_median_layer1_reuslt_pitch;
int dev_median_layer1_reuslt_rows;
int dev_median_layer1_reuslt_columns;
float *dev_median_layer2_reuslt;
size_t dev_median_layer2_reuslt_pitch;
int dev_median_layer2_reuslt_rows;
int dev_median_layer2_reuslt_columns;
float *dev_final_reuslt;
size_t dev_final_reuslt_pitch;
int dev_final_reuslt_rows;
int dev_final_reuslt_columns;
};
__global__
void gpu_location_kernel(GPULocationKernelParameter *dev_kernelParameter)
{
GPULocationKernelParameter dev_struct_kernelParameter = *dev_kernelParameter;
...
}
补充:
1.参考的cuda-gdb调试文章
CUDA-GDB 使用
CUDA C编程(三十八)CUDA调试
GDB调试操作
GDB调试
2.感悟
这个问题我思考了一下午近5h,外加困扰一晚上。现在回想这个问题的原因在报出时就已经给出了,然而我却浑然不知。所以我还是应该认认真真地去看执行过程报出的错误,里面每一个单词,每一句话都有深意。