gltest1测试GPU driver过程中图像卡死定位

一、现象

运行gltest1后,出现两个三角形的画面,随后画面卡住。

二、分析

2.1 初步排查

用ps指令查看gltest1的状态为S状态。未出现D状态。

过几分钟后,查看内核日志,发现hang task堆栈如下:

7781471 Jul  6 17:41:23 xdx kernel: [  242.759037] INFO: task kworker/u32:3:362 blocked for more than 120 seconds.
7781472 Jul  6 17:41:23 xdx kernel: [  242.759248]       Tainted: G           OE     5.4.0-42-generic #46~18.04.1-Ubuntu
7781473 Jul  6 17:41:23 xdx kernel: [  242.759464] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
7781474 Jul  6 17:41:23 xdx kernel: [  242.759690] kworker/u32:3   D    0   362      2 0x80004000
7781475 Jul  6 17:41:23 xdx kernel: [  242.759717] Workqueue: events_unbound commit_work [drm_kms_helper]
7781476 Jul  6 17:41:23 xdx kernel: [  242.759720] Call Trace:
7781477 Jul  6 17:41:23 xdx kernel: [  242.759730]  __schedule+0x293/0x740
7781478 Jul  6 17:41:23 xdx kernel: [  242.759738]  ? lock_timer_base+0x6b/0x90
7781479 Jul  6 17:41:23 xdx kernel: [  242.759742]  schedule+0x33/0xa0
7781480 Jul  6 17:41:23 xdx kernel: [  242.759747]  schedule_timeout+0x1d3/0x2f0
7781481 Jul  6 17:41:23 xdx kernel: [  242.759753]  ? dma_fence_add_callback+0x70/0x140
7781482 Jul  6 17:41:23 xdx kernel: [  242.759757]  dma_fence_default_wait+0x1e4/0x260
7781483 Jul  6 17:41:23 xdx kernel: [  242.759761]  ? dma_fence_release+0x130/0x130
7781484 Jul  6 17:41:23 xdx kernel: [  242.759765]  dma_fence_wait_timeout+0xfd/0x110
7781485 Jul  6 17:41:23 xdx kernel: [  242.759782]  drm_atomic_helper_wait_for_fences+0x63/0xc0 [drm_kms_helper]
7781486 Jul  6 17:41:23 xdx kernel: [  242.759798]  commit_tail+0x33/0x100 [drm_kms_helper]
7781487 Jul  6 17:41:23 xdx kernel: [  242.759811]  commit_work+0x12/0x20 [drm_kms_helper]
7781488 Jul  6 17:41:23 xdx kernel: [  242.759816]  process_one_work+0x20f/0x400
7781489 Jul  6 17:41:23 xdx kernel: [  242.759820]  worker_thread+0x34/0x410
7781490 Jul  6 17:41:23 xdx kernel: [  242.759824]  kthread+0x121/0x140
7781491 Jul  6 17:41:23 xdx kernel: [  242.759827]  ? process_one_work+0x400/0x400
7781492 Jul  6 17:41:23 xdx kernel: [  242.759831]  ? kthread_park+0x90/0x90
7781493 Jul  6 17:41:23 xdx kernel: [  242.759835]  ret_from_fork+0x1f/0x40

 函数drm_atomic_helper_wait_for_fences(kernel 5.4.191)代码如下,删掉了函数体里的注释:

int drm_atomic_helper_wait_for_fences(struct drm_device *dev,
				      struct drm_atomic_state *state,
				      bool pre_swap)
{
	struct drm_plane *plane;
	struct drm_plane_state *new_plane_state;
	int i, ret;

	for_each_new_plane_in_state(state, plane, new_plane_state, i) {
		if (!new_plane_state->fence)
			continue;

		WARN_ON(!new_plane_state->fb);

		ret = dma_fence_wait(new_plane_state->fence, pre_swap);
		if (ret)
			return ret;

		dma_fence_put(new_plane_state->fence);
		new_plane_state->fence = NULL;
	}

	return 0;
}

可知kworker线程在等待new_plane_state->fence,这个fence是哪儿来的呢?通过跟踪fence的来源,发下如下代码:

static int drm_atomic_plane_set_property(struct drm_plane *plane,
		struct drm_plane_state *state, struct drm_file *file_priv,
		struct drm_property *property, uint64_t val)
{
	struct drm_device *dev = plane->dev;
	struct drm_mode_config *config = &dev->mode_config;
	bool replaced = false;
	int ret;
......
	} else if (property == config->prop_in_fence_fd) {
		if (state->fence)
			return -EINVAL;

		if (U642I64(val) == -1)
			return 0;

		state->fence = sync_file_get_fence(val);
		if (!state->fence)
			return -EINVAL;

	} else if (property == config->prop_crtc_id) {
......
}

发现fence是由函数syn_file_get_fence获取的,参数val即fd是应用程序传入内核的,通过该fd找到相应的sync file,再将sync file的fence赋值给state->fence。

根据文档linux-5.4.191\Documentation\driver-api\sync_file.rst可知,该fence是'in-fences',来自用户态。

2.2 查找fence来源

2.2.1 初寻陷入困境

为了找到该fence在哪儿释放(signal fence),就需要找到该fence在哪儿创建,由谁释放该fence。查看sync file相关的接口,发现函数sync_file_create可以根据传入参数fence创建sync file。查找引用该函数的代码,发现有如下路径可能会调用sync_file_create:

1、kernel:prepare_signaling->setup_out_fence->sync_file_create,fence由drm_crtc_create_fence和drm_writeback_get_out_fence创建。

2、driver:pvr_sync_create_fence->sync_file_create,fence由pvr_fence_create创建,函数pvr_fence_create会创建pvr fence,该pvr fence有一个成员变量base(类型为struct dma_fence,即文章中称的fence),一个sync checkpoint,该sync checkpoint为driver实现的host与fw通信的方式,当GPU完成相应的操作后,fw向相应的地址(sync checkpoint)写入值,然后向host触发中断,host的中断处理函数再读取该地址的值,如果发现已经设置为相应的值,则对该fence调用函数dma_fence_signal。

在DC中断处理函数中加入如下调试信息,之所以在中断处理中加入调试信息,是因为drm_crtc_handle_vblank最终会调用函数drm_send_event_locked,该函数会对struct drm_pending_event的fence成员调用函数dma_fence_signal。而该fence就是函数prepare_signaling中创建的fence,并将该fence赋值给crtc_state->event->base.fence,随后在struct drm_crtc_helper_funcs的成员atomic_flush即函数xxx_crtc_atomic_flush中调用函数drm_crtc_arm_vblank_event,其中参数就是crtc_state->event,即将该event加入了链表dev->vblank_event_list,所以调试信息就是遍历该链表打印fence地址。开始最怀疑的就是该fence:

static bool xxx_crtc_handle_rq(void *pvData)
{
   int* rq_enter;
   struct drm_crtc* crtc = (struct drm_crtc*)pvData;
   struct xdx_crtc *xdx_crtc = to_xdx_crtc(crtc);
    ......
  
    //加入的调试信息
    struct drm_pending_vblank_event *e, *t;
    spin_lock_irq(&crtc->dev->event_lock);
    list_for_each_entry_safe(e, t, &crtc->dev->vblank_event_list, base.link) {
        printk("%s %d dma_fence:0x%lx e->pipe:%u sequence:%lu\n", __func__, __LINE__, e->base.fence, e->pipe, e->sequence);
    }
    spin_unlock_irq(&crtc->dev->event_lock);

   if( !drm_crtc_handle_vblank(crtc) ) {
       printk("%s : call drm_crtc_handle_vblank fault!!!", __func__ );
   }
   ......
}

在不重新编译内核的前提下,将driver的struct drm_mode_config_funcs xxx_mode_config_funcs的回调函数atomic_commit即drm_atomic_helper_commit(gltest1每次送显,kernel都会调用该回调函数进行送显)替换为自定义函数xxx_atomic_helper_commit,然后在函数xxx_atomic_helper_commit中加入调试信息,因为drm_atomic_helper_commit为kernel实现的函数,为了添加调试信息需要替换为自定义函数,添加调试信息后的代码如下:

int xxx_atomic_helper_commit(struct drm_device *dev, struct drm_atomic_state* state, bool nonblock)
{
    struct drm_plane *plane;
    struct drm_plane_state *old_plane_state, *new_plane_state;
    int i;

    for_each_oldnew_plane_in_state(state, plane, old_plane_state, new_plane_state, i) {

        if (!new_plane_state->fence) {
            printk("%s %d new_plane_state dma_fence null\n", __func__, __LINE__);
            continue;
        }
        else {
            printk("%s %d new_plane_state dma_fence:0x%lx\n", __func__, __LINE__, new_plane_state->fence);
        }

        if (!old_plane_state->fence) {
            printk("%s %d old_plane_state dma_fence null\n", __func__, __LINE__);
            continue;
        }
        else {
            printk("%s %d old_plane_state dma_fence:0x%lx\n", __func__, __LINE__, old_plane_state->fence);
        }
    }

    return drm_atomic_helper_commit(dev, state, nonblock);
}

对比plane state的fence的值与DC中断处理函数中打印处来的fence值,发现没有相等的,所以排除了该fence。

继续在xxx_atomic_helper_commit中加入如下调试信息,打印drm_writeback_get_out_fence创建的fence地址:

    for_each_oldnew_connector_in_state(state, connector, old_conn_state, new_conn_state, i) {
        if (!new_conn_state->writeback_job) {
            printk("%s %d new_conn_state writeback_job null\n", __func__, __LINE__);
            continue;
        }
        else {
            printk("%s %d new_conn_state dma_fence:0x%lx\n", __func__, __LINE__, new_conn_state->writeback_job->out_fence);
        }

        if (!old_conn_state->writeback_job) {
            printk("%s %d old_conn_state writeback_job null\n", __func__, __LINE__);
            continue;
        }
        else {
            printk("%s %d old_conn_state dma_fence:0x%lx\n", __func__, __LINE__, old_conn_state->writeback_job->out_fence);
        }
    }

同时在pvr_sync_create_fence中加入调试信息,打印fence地址。最后与plane state的fence值对比,发现都不相等。

2.2.2 柳暗花明

排除了sync_file_create,因为plane state的fence是从sync file的fence而来,那只能再追踪struct sync_file的fence从何而来。发现函数sync_file_set_fence也会设置sync file的fence,有如下调用关系:

sync_file_ioctl->sync_file_ioctl_merge->sync_file_merge->sync_file_set_fence

为了验证,用strace gltest1运行gltest1,查看gltest1发起的ioclt命令,发现如下打印:

poll([{fd=27, events=POLLIN}], 1, 0)    = 1 ([{fd=27, revents=POLLIN}])
dup(11)                                 = 31
poll([{fd=31, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=28, events=POLLIN}], 1, 0)    = 0 (Timeout)
ioctl(31, SYNC_IOC_MERGE, 0x7ffe860ad920) = 0
close(31)                               = 0
getpid()                                = 2755
getpid()                                = 2755

说明gltest1发起了ioctl SYNC_IOC_MERGE,然后调用函数sync_file_ioctl_merge。

阅读了sync_file_ioctl_merge和涉及到的dma fence array相关的代码,其原理就是将多个fence合并为一个fence array(参考struct dma_fence_array的成员fences),然后将struct dma_fence_array的base的地址赋值给sync file的fence,只有合并的fence都signal了fence,fence array才会signal fence。

所以决定把gltest1的sync file的信息都打印出来看看,将fence array的所有fence地址都打印出来,在函数xxx_crtc_handle_rq加入如下调试信息:

    for (i = 0; i < 100; ++i) {
        fence = sync_file_get_fence(i);

        if (fence) {
            printk("%s %d sync_file_get_fence dma_fence:0x%lx fd=%d\n", __func__, __LINE__, fence, i);

            struct file *file = fget(i);
            struct sync_file *sync_file = NULL;

            if (!file)
                continue;

            sync_file = file->private_data;

            printk("%s %d dma_fence:0x%lx sync_file_name:%s status:%s fd=%d\n", __func__, __LINE__, sync_file->fence, sync_file_get_name_shen(sync_file, buf, sizeof(buf)),
                sync_status_str(dma_fence_get_status(sync_file->fence)), i);

            fput(sync_file->file);

            if (dma_fence_is_array(sync_file->fence)) {
                int j;
                struct dma_fence_array *array = to_dma_fence_array(sync_file->fence);

                printk("%s %d dma_fence:0x%lx fd:%d num_fences:%u\n", __func__, __LINE__, sync_file->fence, i, array->num_fences);

                for (j = 0; j < array->num_fences; ++j) {
                    printk("%s %d dma_fence:0x%lx \n", __func__, __LINE__, array->fences[j]);
                }
            }
        }
    }

得到如下打印信息:

3453 Jul  9 00:21:45 kernel: [   39.242451] pvr_sync_create_fence 392:dma_fence:0xffff8f97f7cba300 sync_file->fence=0xffff8f97f7cba300 fd=28
3454 Jul  9 00:21:45 kernel: [   39.242452] pvr_sync_create_fence 392:dma_fence:0xffff8f97f7cbb380 sync_file->fence=0xffff8f97f7cbb380 fd=30 ----等于未signal的fence的地址
.......
3502 Jul  9 00:21:45 kernel: [   39.245532] xdx_atomic_helper_commit 281 sync_file_get_fence dma_fence:0xffff8f97f7cba300 fd=33
3503 Jul  9 00:21:45 kernel: [   39.245533] xxx_atomic_helper_commit 291 dma_fence:0xffff8f97f7cba300 sync_file_name:4-gl_DoKickTA_frame0 status:signaled fd=33
3504 Jul  9 00:21:45 kernel: [   39.245534] xxx_atomic_helper_commit 281 sync_file_get_fence dma_fence:0xffff8f97f7cbb380 fd=34
3505 Jul  9 00:21:45 kernel: [   39.245552] xxx_atomic_helper_commit 291 dma_fence:0xffff8f97f7cbb380 sync_file_name:4-gl_DoKick3D_frame0 status:active fd=34
.......
3511 Jul  9 00:21:45 kernel: [   39.245559] xxx_atomic_helper_commit 281 sync_file_get_fence dma_fence:0xffff8f97eea81e00 fd=36
3512 Jul  9 00:21:45 kernel: [   39.245562] xxx_atomic_helper_commit 291 dma_fence:0xffff8f97eea81e00 sync_file_name:dma_fence_array-unbound17-1 status:active fd=36  ----状态为active,即未signal的fence
3513 Jul  9 00:21:45 kernel: [   39.245562] xxx_atomic_helper_commit 300 dma_fence:0xffff8f97eea81e00 fd:36 num_fences:2  ------与plane state fence地址一样
3514 Jul  9 00:21:45 kernel: [   39.245562] xxx_atomic_helper_commit 303 dma_fence:0xffff8f97f7cba300   signal   -----已经signal的fence
3515 Jul  9 00:21:45 kernel: [   39.245563] xxx_atomic_helper_commit 303 dma_fence:0xffff8f97f7cbb380   active   -----未signal的fence
.......
3555 Jul  9 00:21:45 kernel: [   39.249157] xxx_atomic_helper_commit 247 new_plane_state dma_fence:0xffff8f97eea81e00  --------plane state fence地址

从中可以看出plane fence的地址为0xffff8f97eea81e00,它是一个fence array,因为地址为0xffff8f97f7cbb380的fence未signal,所以该fence array也未signal。而未signal的fence的地址等于pvr_sync_create_fence创建的fence。

所以根据上面分析可以得出,未释放fence为pvr_sync_create_fence创建的fence。

2.2.3 轻松定位问题

pvr_sync_create_fence创建pvr fence后,将pvr fence加入一个链表。GPU中断处理函数最终会调用函数pvr_fence_context_signal_fences遍历链表中的pvr fence,如果pvr fence的成员sync checkpoint被设置为相应的值,则认为pvr fence已经signal,然后对该pvr fence的成员fence(类型为 struct dma_fence)调用函数dma_fence_signal从而signal fence。

在相应的中断处理函数中加入相应的调试信息,发现GPU中断处理函数未被执行,最终定位为负责分发中断处理的中断总入口函数有bug,将按位与&错误写成逻辑与&&,造成虽然GPU中断触发了,但是中断处理程序却无法被执行。

错误代码如下,对于GPU中断,irq->tag[irq_src->int_type]等于0xFFFFFFFF,tag也等于0xFFFFFFFF,但是if语句里为false:

if ((irq->tag[irq_src->int_type] && tag) == tag)  //错误代码
    irq_src->funcs(...)

三、总结

解决该问题的难点在于不知道fence array这个东西,同时不想编译kernel。找到fence的来源后就很好解决了。其实之前有很多情况会出现图像卡死,这个问题解决后,所有图像卡死的问题都解决了。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值