xgltest1结束后xorg出现segment fault定位(包含bo映射流程分析)

一、背景

采用新架构的KMD,且在libdrm中实现了我司自己的bo(buffer object)管理、device管理等功能。此问题是在新架构中出现的。

老架构的GPU、DC、PCIE是三个内核模块,GPU和DC分别创建了一个DRM驱动,所以GPU和DC的设备文件(/dev/dri/xx)不同,故而GPU和DC的mmap不会相互影响;新架构将它们整合到了一起,只创建了一个DRM驱动,所以GPU和DC使用相同的设备文件,故而mmap的入口函数只有一个。

老架构GPU使用的是PMR handle映射,DC使用的gem object的mmap offset(gem handle)映射;新架构GPU和DC使用DRM框架的gem object的mmap offset(gem handle)映射,所以新架构需要将PMR handle转为gem object的mmap offset,然后再进行映射。

二、问题描述

1、加载新架构的KMD内核模块。

2、加载Xorg。

3、运行xgltest1,出现图像。

4、结束xgltest1。

5、Xorg崩溃并报错segment fault。

错误日志如下:

........
PVR: EGL rendertarget cache stats:
PVR:    Hits:   0
PVR:    Misses: 0
PVR:    High watermark: 0
PVRSRVGetClientEventFilter 436 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
PVRSRVGetClientEventFilter 442 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
OSMUnmapPMR 300 gInfoPage=0x7f6a182b0000
xdxgpu_bo_unmap 222 unmap_addr=0x7f6a182b0000
PVRSRVGetClientEventFilter 436 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
(EE)
(EE) Backtrace:
(EE) 0: /usr/local/bin/Xorg (xorg_backtrace+0xbf) [0x5636fd331098]
(EE) 1: /usr/local/bin/Xorg (0x5636fd17b000+0x1ba895) [0x5636fd335895]
(EE) 2: /lib/x86_64-linux-gnu/libc.so.6 (0x7f6a15d4d000+0x3ef10) [0x7f6a15d8bf10]
(EE) 3: /usr/lib/x86_64-linux-gnu/libsrv_um.so (PVRSRVGetClientEventFilter+0xf2) [0x7f6a110498e9]
(EE) 4: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x274528) [0x7f6a10897528]
(EE) 5: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x274f67) [0x7f6a10897f67]
(EE) 6: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x275674) [0x7f6a10898674]
(EE) 7: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27aaff) [0x7f6a1089daff]
(EE) 8: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x277a52) [0x7f6a1089aa52]
(EE) 9: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27ada6) [0x7f6a1089dda6]
(EE) 10: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27ae09) [0x7f6a1089de09]
(EE) 11: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13d47f) [0x7f6a1076047f]
(EE) 12: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13a130) [0x7f6a1075d130]
(EE) 13: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13a798) [0x7f6a1075d798]
(EE) 14: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x7ae5d) [0x7f6a1069de5d]
(EE) 15: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (glDeleteFramebuffers+0x41a) [0x7f6a1076a126]
(EE) 16: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x30c56) [0x7f6a121f8c56]
(EE) 17: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x31694) [0x7f6a121f9694]
(EE) 18: /usr/local/lib/xorg/modules/libglamoregl.so (glamor_close_screen+0x1be) [0x7f6a121d3f95]
(EE) 19: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x93b7) [0x7f6a121d13b7]
(EE) 20: /usr/local/bin/Xorg (0x5636fd17b000+0x1cbcab) [0x5636fd346cab]
(EE) 21: /usr/local/bin/Xorg (0x5636fd17b000+0x13de29) [0x5636fd2b8e29]
(EE) 22: /usr/local/bin/Xorg (0x5636fd17b000+0x3ff61) [0x5636fd1baf61]
(EE) 23: /usr/local/bin/Xorg (0x5636fd17b000+0xf3c2e) [0x5636fd26ec2e]
(EE) 24: /usr/local/bin/Xorg (0x5636fd17b000+0x4c813) [0x5636fd1c7813]
(EE) 25: /usr/local/bin/Xorg (0x5636fd17b000+0x526ed) [0x5636fd1cd6ed]
(EE) 26: /usr/local/bin/Xorg (0x5636fd17b000+0x24374d) [0x5636fd3be74d]
(EE) 27: /usr/local/lib/xorg/modules/drivers/modesetting_drv.so (0x7f6a12623000+0xd5c5) [0x7f6a126305c5]
(EE) 28: /usr/local/bin/Xorg (0x5636fd17b000+0xd2676) [0x5636fd24d676]
(EE) 29: /usr/local/bin/Xorg (0x5636fd17b000+0x218f48) [0x5636fd393f48]
(EE) 30: /usr/local/bin/Xorg (0x5636fd17b000+0x1faef4) [0x5636fd375ef4]
(EE) 31: /usr/local/bin/Xorg (0x5636fd17b000+0x1e98c4) [0x5636fd3648c4]
(EE) 32: /usr/local/bin/Xorg (0x5636fd17b000+0x134c45) [0x5636fd2afc45]
(EE) 33: /usr/local/bin/Xorg (0x5636fd17b000+0x2026ad) [0x5636fd37d6ad]
(EE) 34: /usr/local/bin/Xorg (0x5636fd17b000+0x10e3f6) [0x5636fd2893f6]
(EE) 35: /usr/local/bin/Xorg (0x5636fd17b000+0x20b1db) [0x5636fd3861db]
(EE) 36: /usr/local/bin/Xorg (0x5636fd17b000+0x13ebdf) [0x5636fd2b9bdf]
(EE) 37: /usr/local/bin/Xorg (0x5636fd17b000+0xf4b6d) [0x5636fd26fb6d]
(EE) 38: /usr/local/bin/Xorg (0x5636fd17b000+0xca093) [0x5636fd245093]
(EE) 39: /usr/local/lib/xorg/modules/extensions/libglx.so (0x7f6a14b7f000+0x3c32d) [0x7f6a14bbb32d]
(EE) 40: /usr/local/bin/Xorg (0x5636fd17b000+0x83bc7) [0x5636fd1febc7]
(EE) 41: /usr/local/bin/Xorg (0x5636fd17b000+0x2504b5) [0x5636fd3cb4b5]
(EE) 42: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xe7) [0x7f6a15d6ec87]
(EE) 43: /usr/local/bin/Xorg (_start+0x2a) [0x5636fd1a951a]
(EE)
(EE) Segmentation fault at address 0x7f6a182b0014
(EE)
Fatal server error:
(EE) Caught signal 11 (Segmentation fault). Server aborting
(EE)
(EE)
Please consult the The X.Org Foundation support
         at http://wiki.x.org
 for help.
(EE) Please also check the log file at "/usr/local/var/log/Xorg.0.log" for additional information.
(EE)
ACPI: Closing device
(EE) Server terminated with error (1). Closing log file.

三、问题分析

3.1 core dump文件分析

用ulimit命令设置core file size,复现问题后得到core dump文件,然后用gdb调试core dump文件,得到如下调用堆栈:

gdb /usr/local/bin/Xorg /var/core/core_Xorg_24746
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f72fc03e7f1 in __GI_abort () at abort.c:79
#2  0x0000556446cce811 in OsAbort () at ../source/os/utils.c:1351
#3  0x0000556446cd7db6 in AbortServer () at ../source/os/log.c:872
#4  0x0000556446cd82d6 in FatalError (f=0x556446d7d2e0 "Caught signal %d (%s). Server aborting\n") at ../source/os/log.c:1010
#5  0x0000556446cca93e in OsSigHandler (signo=11, sip=0x7ffedc5476f0, unused=0x7ffedc5475c0) at ../source/os/osinit.c:156
#6  <signal handler called>
#7  0x00007f72f72fa911 in PVRSRVGetClientEventFilter (psDevConnection=0x556448e30360, eApi=1) at services/client/common/hwperf_client.c:439
#8  0x00007f72f6b48528 in PVRSRVFenceWait (psDevConnection=0x556448e30360, hFence=29, ui32TimeoutInMs=0) at include/pvrsrv_sync_um.h:349
#9  0x00007f72f6b48f67 in RM_ANF_Check (psSysContext=0x556448e063c0, hFence=29) at common/resourceman.c:427
#10 0x00007f72f6b49674 in RMTask_IsComplete (psCtx=0x556448eceee0, psTask=0x556449318fb0) at common/resourceman.c:704
#11 0x00007f72f6b4eaff in RM_GetJobState (psCtx=0x556448eceee0, psHWQ=0x556448f1e080, uiJobNumber=26) at common/resourceman.c:4411
#12 0x00007f72f6b4ba52 in RM_GetResourceState (psCtx=0x556448eceee0, psHistory=0x5564491c84d0, eUsageMask=RM_USAGE_READ_WRITE, ui32StateCheckMask=3)
    at common/resourceman.c:2026
#13 0x00007f72f6b4eda6 in RM_IsResourceNeededBy3D_NoLock (psCtx=0x556448eceee0, psResource=0x556449116508, eUsageMask=RM_USAGE_READ_WRITE)
    at common/resourceman.c:4674
#14 0x00007f72f6b4ee09 in RM_IsResourceNeededBy3D (psCtx=0x556448eceee0, psResource=0x556449116508, eUsageMask=RM_USAGE_READ_WRITE)
    at common/resourceman.c:4712
#15 0x00007f72f6a1147f in FreeFBOStaticPrograms (gc=0x556448eceee0, psFBOStaticPrograms=0x556449116508) at opengles3/volcanic/fbo.c:2997
#16 0x00007f72f6a0e130 in FreeFrameBuffer (gc=0x556448eceee0, psFrameBuffer=0x556449115420) at opengles3/volcanic/fbo.c:1157
#17 0x00007f72f6a0e798 in DisposeFrameBufferObject (gc=0x556448eceee0, psNamedItem=0x556449115420, bIsShutdown=IMG_FALSE) at opengles3/volcanic/fbo.c:1408
#18 0x00007f72f694ee5d in NamedItemDelRefByName (gc=0x556448eceee0, psNamesArray=0x556448eefce0, ui32Num=1, ui32Name=0x5564492de2f4)
    at opengles3/names.c:1302
#19 0x00007f72f6a1b126 in glDeleteFramebuffers (n=1, framebuffers=0x5564492de2f4) at opengles3/volcanic/fbo.c:7144
#20 0x00007f72f84a9c56 in glamor_destroy_fbo (glamor_priv=0x556448f29bf0, fbo=0x5564492de2f0) at ../source/glamor/glamor_fbo.c:40
#21 0x00007f72f84aa694 in glamor_pixmap_destroy_fbo (pixmap=0x55644924fbd0) at ../source/glamor/glamor_fbo.c:311
#22 0x00007f72f8484f95 in glamor_close_screen (screen=0x556448f27e60) at ../source/glamor/glamor.c:811
#23 0x00007f72f84823b7 in glamor_egl_close_screen (screen=0x556448f27e60) at ../source/glamor/glamor_egl.c:776
#24 0x0000556446cdbcab in dri3_close_screen (screen=0x556448f27e60) at ../source/dri3/dri3.c:44
#25 0x0000556446c4de29 in SyncCloseScreen (pScreen=0x556448f27e60) at ../source/miext/sync/misync.c:161
#26 0x0000556446b4ff61 in miDCCloseScreen (pScreen=0x556448f27e60) at ../source/mi/midispcur.c:155
#27 0x0000556446c03c2e in damageCloseScreen (pScreen=0x556448f27e60) at ../source/miext/damage/damage.c:1605
#28 0x0000556446b5c813 in miPointerCloseScreen (pScreen=0x556448f27e60) at ../source/mi/mipointer.c:170
#29 0x0000556446b626ed in miSpriteCloseScreen (pScreen=0x556448f27e60) at ../source/mi/misprite.c:379
#30 0x0000556446d5374d in xf86CursorCloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/ramdac/xf86CursorRD.c:151
#31 0x00007f72f88e15c5 in CloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/drivers/modesetting/driver.c:1924
#32 0x0000556446be2676 in RRCloseScreen (pScreen=0x556448f27e60) at ../source/randr/randr.c:112
#33 0x0000556446d28f48 in xf86CrtcCloseScreen (screen=0x556448f27e60) at ../source/hw/xfree86/modes/xf86Crtc.c:785
#34 0x0000556446d0aef4 in DGACloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/common/xf86DGA.c:288
#35 0x0000556446cf98c4 in CMapCloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/common/xf86cmap.c:250
#36 0x0000556446c44c45 in XvCloseScreen (pScreen=0x556448f27e60) at ../source/Xext/xvmain.c:309
#37 0x0000556446d126ad in xf86XVCloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/common/xf86xv.c:1168
#38 0x0000556446c1e3f6 in present_close_screen (screen=0x556448f27e60) at ../source/present/present_screen.c:70
---Type <return> to continue, or q <return> to quit---
#39 0x0000556446d1b1db in VGAarbiterCloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/common/xf86VGAarbiter.c:262
#40 0x0000556446c4ebdf in CursorCloseScreen (pScreen=0x556448f27e60) at ../source/xfixes/cursor.c:205
#41 0x0000556446c04b6d in AnimCurCloseScreen (pScreen=0x556448f27e60) at ../source/render/animcur.c:100
#42 0x0000556446bda093 in compCloseScreen (pScreen=0x556448f27e60) at ../source/composite/compinit.c:86
#43 0x00007f72fae6c32d in glxCloseScreen (pScreen=0x556448f27e60) at ../source/glx/glxscreens.c:171
#44 0x0000556446b93bc7 in dix_main (argc=4, argv=0x7ffedc548d68, envp=0x7ffedc548d90) at ../source/dix/main.c:325
#45 0x0000556446d604b5 in main (argc=4, argv=0x7ffedc548d68, envp=0x7ffedc548d90) at ../source/dix/stubmain.c:34
(gdb)

从堆栈信息可知,Xorg执行函数PVRSRVGetClientEventFilter的第439行时出现segment fault。相关代码如下:

415 IMG_EXPORT IMG_UINT32 PVRSRVGetClientEventFilter(PVRSRV_DEV_CONNECTION *psDevConnection,
416                          RGX_HWPERF_CLIENT_API eApi)
417 {
418     HWPERF_CONTEXT *psContext;
419
420     PVR_ASSERT(psDevConnection != NULL);
421     PVR_LOG_RETURN_IF_FALSE(eApi < RGX_HWPERF_CLIENT_API_MAX &&
422                 eApi != RGX_HWPERF_CLIENT_API_INVALID,
423                 "eApi invalid", 0);
424
425     psContext = psDevConnection->pvHWPerfWriteContext;
426
427     /* If this filter is != 0 it means that stream has already been opened
428      * during the initialisation. Since AppHints have precedence over
429      * the value set by the server we can just 'return' here. */
430     if (psContext->ui32APIEventsFilter[eApi] != 0)
431     {
432         return psContext->ui32APIEventsFilter[eApi];
433     }
434
435     //xxxx modify
436     printf("%s %d psDevConnection->pui32InfoPage:0x%lx offset=%u\n", __func__, __LINE__, (long)psDevConnection->pui32InfoPage, _ApiToInfoPageIdx(eApi));
437
438     /* Prevent lazy initialisation if the filters haven't been set yet. */
439     if (psDevConnection->pui32InfoPage[_ApiToInfoPageIdx(eApi)] == 0)
440     {
441         //xxxx modify
442         printf("%s %d psDevConnection->pui32InfoPage:0x%lx offset=%u\n", __func__, __LINE__, (long)psDevConnection->pui32InfoPage, _ApiToInfoPageIdx(eApi    ));
443         return 0;
444     }
          .........

第439行是xorg访问information page。在KMD初始化时,会创建一个information page的PMR(physical memory resource),里面记录了一些配置信息(比如允许的最大timeout次数等信息)。Xorg将PMR映射到用户态的虚拟地址,该虚拟地址就是psDevConnection->pui32InfoPage,从而Xorg可以访问KMD的information page。

KMD创建好information page的PMR后,一般不会再去修改。而Xorg得到information page的用户态虚拟地址后,只会读取配置信息,也不会去修改。

3.2 确认pui32InfoPage是否被修改

因为pui32InfoPage是结构体psDevConnection的指针成员,先确认下指针pui32InfoPage是否被修改。
在最开始设置pui32InfoPage的地方加上调试信息,打印pui32InfoPage的值,然后在函数PVRSRVGetClientEventFilter的439行前加入调试信息,打印pui32InfoPage的值,最后确认
pui32InfoPage的值未被修改,且_ApiToInfoPageIdx(eApi)的值也在information page范围内,不会越界。

3.3 疑点猜想

3.3.1 unmap后访问猜想

排除了pui32InfoPage被修改的可能,那么最有可能出现segment fault的情况是:Xorg将information page unmap了,然后又访问information page,出现了segment fault。

为了验证这个猜想,在函数xdxgpu_bo_unmap中加入调试信息,输出unmap的地址。得到如下日志:

PVR: EGL rendertarget cache stats:
PVR:    Hits:   0
PVR:    Misses: 0
PVR:    High watermark: 0
PVRSRVGetClientEventFilter 436 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
PVRSRVGetClientEventFilter 442 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
OSMUnmapPMR 300 gInfoPage=0x7f6a182b0000
xdxgpu_bo_unmap 222 unmap_addr=0x7f6a182b0000
PVRSRVGetClientEventFilter 436 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
(EE)
(EE) Backtrace:
(EE) 0: /usr/local/bin/Xorg (xorg_backtrace+0xbf) [0x5636fd331098]
(EE) 1: /usr/local/bin/Xorg (0x5636fd17b000+0x1ba895) [0x5636fd335895]
(EE) 2: /lib/x86_64-linux-gnu/libc.so.6 (0x7f6a15d4d000+0x3ef10) [0x7f6a15d8bf10]
(EE) 3: /usr/lib/x86_64-linux-gnu/libsrv_um.so (PVRSRVGetClientEventFilter+0xf2) [0x7f6a110498e9]
(EE) 4: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x274528) [0x7f6a10897528]
(EE) 5: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x274f67) [0x7f6a10897f67]
(EE) 6: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x275674) [0x7f6a10898674]
(EE) 7: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27aaff) [0x7f6a1089daff]
(EE) 8: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x277a52) [0x7f6a1089aa52]
(EE) 9: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27ada6) [0x7f6a1089dda6]
(EE) 10: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27ae09) [0x7f6a1089de09]
(EE) 11: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13d47f) [0x7f6a1076047f]
(EE) 12: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13a130) [0x7f6a1075d130]
(EE) 13: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13a798) [0x7f6a1075d798]
(EE) 14: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x7ae5d) [0x7f6a1069de5d]
(EE) 15: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (glDeleteFramebuffers+0x41a) [0x7f6a1076a126]
(EE) 16: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x30c56) [0x7f6a121f8c56]
(EE) 17: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x31694) [0x7f6a121f9694]
(EE) 18: /usr/local/lib/xorg/modules/libglamoregl.so (glamor_close_screen+0x1be) [0x7f6a121d3f95]
(EE) 19: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x93b7) [0x7f6a121d13b7]
(EE) 20: /usr/local/bin/Xorg (0x5636fd17b000+0x1cbcab) [0x5636fd346cab]
(EE) 21: /usr/local/bin/Xorg (0x5636fd17b000+0x13de29) [0x5636fd2b8e29]
(EE) 22: /usr/local/bin/Xorg (0x5636fd17b000+0x3ff61) [0x5636fd1baf61]
(EE) 23: /usr/local/bin/Xorg (0x5636fd17b000+0xf3c2e) [0x5636fd26ec2e]
(EE) 24: /usr/local/bin/Xorg (0x5636fd17b000+0x4c813) [0x5636fd1c7813]
(EE) 25: /usr/local/bin/Xorg (0x5636fd17b000+0x526ed) [0x5636fd1cd6ed]
(EE) 26: /usr/local/bin/Xorg (0x5636fd17b000+0x24374d) [0x5636fd3be74d]
(EE) 27: /usr/local/lib/xorg/modules/drivers/modesetting_drv.so (0x7f6a12623000+0xd5c5) [0x7f6a126305c5]
(EE) 28: /usr/local/bin/Xorg (0x5636fd17b000+0xd2676) [0x5636fd24d676]
(EE) 29: /usr/local/bin/Xorg (0x5636fd17b000+0x218f48) [0x5636fd393f48]
(EE) 30: /usr/local/bin/Xorg (0x5636fd17b000+0x1faef4) [0x5636fd375ef4]
(EE) 31: /usr/local/bin/Xorg (0x5636fd17b000+0x1e98c4) [0x5636fd3648c4]
(EE) 32: /usr/local/bin/Xorg (0x5636fd17b000+0x134c45) [0x5636fd2afc45]
(EE) 33: /usr/local/bin/Xorg (0x5636fd17b000+0x2026ad) [0x5636fd37d6ad]
(EE) 34: /usr/local/bin/Xorg (0x5636fd17b000+0x10e3f6) [0x5636fd2893f6]
(EE) 35: /usr/local/bin/Xorg (0x5636fd17b000+0x20b1db) [0x5636fd3861db]
(EE) 36: /usr/local/bin/Xorg (0x5636fd17b000+0x13ebdf) [0x5636fd2b9bdf]
(EE) 37: /usr/local/bin/Xorg (0x5636fd17b000+0xf4b6d) [0x5636fd26fb6d]
(EE) 38: /usr/local/bin/Xorg (0x5636fd17b000+0xca093) [0x5636fd245093]
(EE) 39: /usr/local/lib/xorg/modules/extensions/libglx.so (0x7f6a14b7f000+0x3c32d) [0x7f6a14bbb32d]
(EE) 40: /usr/local/bin/Xorg (0x5636fd17b000+0x83bc7) [0x5636fd1febc7]
(EE) 41: /usr/local/bin/Xorg (0x5636fd17b000+0x2504b5) [0x5636fd3cb4b5]
(EE) 42: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xe7) [0x7f6a15d6ec87]
(EE) 43: /usr/local/bin/Xorg (_start+0x2a) [0x5636fd1a951a]
(EE)
(EE) Segmentation fault at address 0x7f6a182b0014
(EE)
Fatal server error:
(EE) Caught signal 11 (Segmentation fault). Server aborting
(EE)
(EE)
Please consult the The X.Org Foundation support
         at http://wiki.x.org
 for help.
(EE) Please also check the log file at "/usr/local/var/log/Xorg.0.log" for additional information.
(EE)
ACPI: Closing device
(EE) Server terminated with error (1). Closing log file.

从日志可以知道,函数xdxgpu_bo_unmap unmap地址0x7f6a182b0000(information page的虚拟地址)后,随后函数PVRSRVGetClientEventFilter又访问information page,造成了segment fault。

3.3.2 unmap流程获取

到底是谁unmap的呢?因为对xorg的代码流程不熟悉,为了快速获取unmap的函数调用流程,在函数AcquireInfoPage(该函数设置pui32InfoPage)中加入代码,将pui32InfoPage的值赋值给定义的全局变量gInfoPage。在函数OSMUnmapPMR中加入代码,当unmap的地址等于gInfoPage时,打印一句信息。代码如下:

154 unsigned int* gInfoPage = NULL;
155
156 IMG_INTERNAL
157 PVRSRV_ERROR AcquireInfoPage(PVRSRV_DEV_CONNECTION *psDevConnection)
158 {
159     PVRSRV_ERROR eError;
160     IMG_HANDLE hSrvHandle;
161     IMG_DEVMEM_SIZE_T uiImportSize;
162
163     PVR_ASSERT(psDevConnection != NULL);
164
165     /* Obtain the services connection handle */
166     hSrvHandle = GetSrvHandle(psDevConnection);
167
168     /* Acquire server information page (placed in process handle table) */
169     eError = BridgeAcquireInfoPage(hSrvHandle, &psDevConnection->hInfoPagePMR);
170     PVR_LOG_GOTO_IF_ERROR(eError, "BridgeAcquireInfoPage", e0);
171
172     /* Now convert this physical memory into a client memory descriptor
173      * handle */
174     eError = DevmemLocalImport(psDevConnection,
175                             psDevConnection->hInfoPagePMR,
176                             PVRSRV_MEMALLOCFLAG_CPU_READABLE,
177                             &psDevConnection->psImportMemDesc,
178                             &uiImportSize,
179                             "InfoPageBuffer");
180     PVR_LOG_GOTO_IF_ERROR(eError, "DevmemLocalImport", e1);
181
182     /* Now map the client memory handle into the virtual address space of this
183      * process */
184     eError = DevmemAcquireCpuVirtAddr(psDevConnection->psImportMemDesc,
185                                       (void**) &psDevConnection->pui32InfoPage);
186
187     gInfoPage = psDevConnection->pui32InfoPage;  //添加调试信息
188     printf("%s %d psDevConnection->pui32InfoPage=0x%lx\n", __func__, __LINE__, (unsigned long)psDevConnection->pui32InfoPage);
        .......



264 extern unsigned int * gInfoPage;
265
266 IMG_INTERNAL void
267 OSMUnmapPMR(IMG_HANDLE hBridge,
268             IMG_HANDLE hPMR,
269             IMG_HANDLE hOSMMapPrivData,
270             void *pvMappingAddress,
271             size_t uiMappingLength)
272 {
273 #ifndef YAJUN_MMAP
274     IMG_INT32 iStatus;
275 #endif
276     size_t iSizeAsSizeT;
277
278     PVR_UNREFERENCED_PARAMETER(hBridge);
279     PVR_UNREFERENCED_PARAMETER(hPMR);
280 #ifndef YAJUN_MMAP
281     PVR_UNREFERENCED_PARAMETER(hOSMMapPrivData);
282
283     PVR_ASSERT(hOSMMapPrivData == pvMappingAddress);
284 #else
285     PVR_UNREFERENCED_PARAMETER(pvMappingAddress);
286 #endif
287
288     /* We generically use IMG_DEVMEM_SIZE_T throughout, but, munmap
289        requires that we use a size_t.  We have to assert that the
290        length being mapped didn't exceed the max that size_t can
291        handle.  This should be checked in OSMMapPMR() */
292     iSizeAsSizeT = (size_t) uiMappingLength;
293     PVR_ASSERT(iSizeAsSizeT == uiMappingLength);
294
295     PVRSRVBridgeLog_StartTimer();
296
297 #if 1
298     if (*((unsigned int **)hOSMMapPrivData) == gInfoPage)  //添加调试信息
299     {
300         printf("%s %d gInfoPage=0x%lx\n", __func__, __LINE__, (unsigned long)gInfoPage);
301     }
302 #endif
303     xdxgpu_bo_unmap((xdxgpu_handle)hOSMMapPrivData);
        .......

然后运行Xorg后,使用gdb attach到Xorg,并设置断点为函数OSMUnmapPMR的第300行即打印调试信息的代码,再运行xgltest1复现问题。

最终得到如下函数调用栈:

(gdb) c
Continuing.
[Thread 0x7fc4184e2700 (LWP 2618) exited]

Thread 1 "Xorg" hit Breakpoint 1, OSMUnmapPMR (hBridge=0x560f5fba2ff0, hPMR=0x2, hOSMMapPrivData=0x560f5f78f770, pvMappingAddress=0x7f6a182b0000,
    uiMappingLength=4096) at services/client/env/linux/osmmap.c:300
300                     printf("%s %d gInfoPage=0x%lx\n", __func__, __LINE__, (unsigned long)gInfoPage);
(gdb) bt
#0  OSMUnmapPMR (hBridge=0x560f5fba2ff0, hPMR=0x2, hOSMMapPrivData=0x560f5f78f770, pvMappingAddress=0x7f6a182b0000, uiMappingLength=4096)
    at services/client/env/linux/osmmap.c:300
#1  0x00007fc41d87a946 in DevmemImportStructCPUUnmap (psImport=0x560f5fbf55c0) at services/shared/common/devicemem_utils.c:1207
#2  0x00007fc41d8783db in DevmemReleaseCpuVirtAddr (psMemDesc=0x560f5fb93000) at services/shared/common/devicemem.c:2631
#3  0x00007fc41d83a73b in ReleaseInfoPage (psDevConnection=0x560f5fb92f40) at services/client/common/srvcore.c:215
#4  0x00007fc41d82e949 in ConnectionDestroy (psConnection=0x560f5fb92f40) at services/client/common/connection.c:454
#5  0x00007fc41d83aae4 in PVRSRVDisconnect (psConnection=0x560f5fb92f40) at services/client/common/srvcore.c:377
#6  0x00007fc41ddf11c9 in PVRDRIDestroyScreenImpl (psScreenImpl=0x560f5faca8f0) at lws/pvr_dri_support/pvrscreen_impl.c:321
#7  0x00007fc41ddf2ec3 in DRIMODDestroyScreen (psPVRScreen=0x560f5fb68830) at lws/pvr_dri_support/pvrdri_mod.c:250
#8  0x00007fc41e421241 in DRISUPDestroyScreen (psDRISUPScreen=0x560f5fb68830) at ../src/mesa/drivers/dri/pvr/pvrcompat.c:304
#9  0x00007fc41e422192 in PVRDRIScreenRemoveReference (psPVRScreen=0x560f5fc65430) at ../src/mesa/drivers/dri/pvr/pvrdri.c:96
#10 0x00007fc41e42277c in PVRDRIDestroyScreen (psDRIScreen=0x560f5fbfcf10) at ../src/mesa/drivers/dri/pvr/pvrdri.c:265
#11 0x00007fc41e425b2c in driDestroyScreen (psp=0x560f5fbfcf10) at ../src/mesa/drivers/dri/common/dri_util.c:239
#12 0x00007fc421376887 in __glXDRIscreenDestroy (baseScreen=0x560f5fa1e770) at ../source/glx/glxdri2.c:944
#13 0x00007fc4213a2319 in glxCloseScreen (pScreen=0x560f5f8d2e60) at ../source/glx/glxscreens.c:169
#14 0x0000560f5d521bc7 in dix_main (argc=4, argv=0x7ffce56e6328, envp=0x7ffce56e6350) at ../source/dix/main.c:325
#15 0x0000560f5d6ee4b5 in main (argc=4, argv=0x7ffce56e6328, envp=0x7ffce56e6350) at ../source/dix/stubmain.c:34
(gdb) c
Continuing.

Thread 1 "Xorg" received signal SIGSEGV, Segmentation fault.
0x00007fc41d8308e9 in PVRSRVGetClientEventFilter (psDevConnection=0x560f5f7db250, eApi=1) at services/client/common/hwperf_client.c:439
439             if (psDevConnection->pui32InfoPage[_ApiToInfoPageIdx(eApi)] == 0)
(gdb) info thread
  Id   Target Id         Frame
* 1    Thread 0x7fc424a69d00 (LWP 2601) "Xorg" 0x00007fc41d8308e9 in PVRSRVGetClientEventFilter (psDevConnection=0x560f5f7db250, eApi=1)
    at services/client/common/hwperf_client.c:439
(gdb) c
Continuing.

Thread 1 "Xorg" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb)
Continuing.

Program terminated with signal SIGABRT, Aborted.
The program no longer exists.
(gdb) quit
root@FPGA-test:/home/test#

可以推测是xgltest1结束后,触发了xorg调用glxCloseScreen,貌似也没什么问题。那么为什么xorg会去访问已经unmap的内存呢?Xorg代码没修改过,应该不会出现这种逻辑问题。所以还是把重心放在information page上。通过在函数PVRDRICreateScreen、AcquireInfoPage中添加调试信息(见上面函数AcquireInfoPage行号为188的代码行)得到如下日志:

......
LoaderOpen(/usr/local/lib/xorg/modules/libglamoregl.so)
(II) Loading /usr/local/lib/xorg/modules/libglamoregl.so
(II) Module glamoregl: vendor="X.Org Foundation"
        compiled for 1.20.13, module version = 1.0.1
--------------- PVRDRICreateScreen 263
--------------- PVRSRVConnectionCreate 353
--------------- ConnectionCreate 240 fd=15
-------OSMMapPMR 198 fd=15 PMR_handle=0x2 dev=0x563571c00db0 bo=0x563571be09a0 addr=0x7feffb7f8000
xxxx map success OSMMapPMR 235 pvUserVirtualAddress=0x7feffb7f8000
AcquireInfoPage 188 psDevConnection->pui32InfoPage=0x7feffb7f8000   //关键信息
ConnectionCreate 308 psConnection->pui32InfoPage=0x7feffb7f8000 TIMEOUT_INFO_VALUE_RETRIES=20000, TIMEOUT_INFO_VALUE_TIMEOUT_MS=1300, TIMEOUT_INFO_CONDITION_RETRIES=5, TIMEOUT_INFO_CONDITION_TIMEOUT_MS=400, TIMEOUT_INFO_TASK_QUEUE_RETRIES=10, TIMEOUT_INFO_TASK_QUEUE_FLUSH_TIMEOUT_MS=1000
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
......
Sync Extension 3.1
(II) AIGLX: Indirect GLX is not supported
(II) AIGLX: Indirect GLX is not supported
--------------- PVRDRICreateScreen 263
--------------- PVRSRVConnectionCreate 353
--------------- ConnectionCreate 240 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
xdxgpu_device_create 96 not equal r1=1 r2=0 tmp->fd=16 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
over!
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
-------OSMMapPMR 198 fd=24 PMR_handle=0x2 dev=0x563571c00db0 bo=0x563571be09a0 addr=0x7feffb7f8000
xxxx map success OSMMapPMR 235 pvUserVirtualAddress=0x7feffb7f8000
AcquireInfoPage 188 psDevConnection->pui32InfoPage=0x7feffb7f8000      //关键信息
ConnectionCreate 308 psConnection->pui32InfoPage=0x7feffb7f8000 TIMEOUT_INFO_VALUE_RETRIES=20000, TIMEOUT_INFO_VALUE_TIMEOUT_MS=1300, TIMEOUT_INFO_CONDITION_RETRIES=5, TIMEOUT_INFO_CONDITION_TIMEOUT_MS=400, TIMEOUT_INFO_TASK_QUEUE_RETRIES=10, TIMEOUT_INFO_TASK_QUEUE_FLUSH_TIMEOUT_MS=1000
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
xdxgpu_device_create 96 not equal r1=1 r2=0 tmp->fd=16 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
......

分析日志可知,xorg会调用两次函数PVRDRICreateScreen,PVRDRICreateScreen每次都会open /dev/dri/crad0获取fd,然后调用AcquireInfoPage映射information page,且获得的information page的虚拟地址相同。为什么两次将information page映射到用户态,得到的虚拟地址一样?那就需要分析information page的映射流程。

3.3.3 information page映射流程

AcquireInfoPage最终调用函数osmmap将PMR handle(用户态handle,传入内核KMD后转换为相应的PMR)表示的PMR映射到用户态。osmmap代码如下:

 50 IMG_INTERNAL PVRSRV_ERROR
 51 OSMMapPMR(IMG_HANDLE hBridge,
 52           IMG_HANDLE hPMR,
 53           IMG_DEVMEM_SIZE_T uiPMRLength,
 54           PVRSRV_MEMALLOCFLAGS_T uiFlags,
 55           IMG_HANDLE *phOSMMapPrivDataOut,
 56           void **ppvMappingAddressInOut,
 57           size_t *puiMappingLengthOut)
 58 {
 59     //xxxx modify
 60     PVRSRV_ERROR eError = PVRSRV_OK;
 61     size_t uiMMapOffset;
 62     size_t uiMMapLength;
 63     IMG_UINT32 uiMMapProtFlags = 0;
 64 #if defined(MAP_FIXED_NOREPLACE) || !defined(YAJUN_MMAP)
 65     IMG_UINT32 uiMMapFlags = MAP_SHARED;
 66 #endif
 67     void *pvUserVirtualAddress = NULL;
 68     void *hOSMMapPrivDataOut = NULL;
 69     //xxxx modify
 70 #ifndef YAJUN_MMAP
 71     IMG_INT iResult, iErrno;
 72 #endif
 73
 74     uiMMapOffset = (size_t)hPMR;
 75
 76     uiMMapLength = (size_t)uiPMRLength;
 77     /* To justify above cast, we assert that no significant bits were lost */
 78     PVR_ASSERT((IMG_DEVMEM_SIZE_T)uiMMapLength == uiPMRLength);
..............
171     {
172         IMG_INT ret;
173         xdxgpu_handle devHandle = NULL;
174         IMG_INT drm_fd = *(IMG_INT *)hBridge;
175         xdxgpu_handle bo = NULL;
176
177         ret = xdxgpu_device_create(drm_fd, &devHandle);
178         if (ret) {
179             eError = PVRSRV_ERROR_DEVICEMEM_MAP_FAILED;
180             goto mmap_xdxgpu_end;
181         }
182
183         ret = xdxgpu_bo_import(devHandle,
184                 XDXGPU_BO_PVR_HANDLE, (IMG_INT64)hPMR, &bo);
185         if (ret) {
186             printf("xxxx map error %s %d ret=%d \n", __func__, __LINE__, ret);
187             eError = PVRSRV_ERROR_DEVICEMEM_MAP_FAILED;
188             goto mmap_xdxgpu_end;
189         }
190
191         ret = xdxgpu_bo_map(bo, &pvUserVirtualAddress);
192         if (ret) {
193             printf("xxxx map error %s %d ret=%d \n", __func__, __LINE__, ret);
194             eError = PVRSRV_ERROR_DEVICEMEM_MAP_FAILED;
195         }
196
197         hOSMMapPrivDataOut = bo;
198 mmap_xdxgpu_end:
199         if (bo)
200             xdxgpu_bo_destroy(bo);
201         if (devHandle)
202             xdxgpu_device_destroy(devHandle);
203         if (eError != PVRSRV_OK) {
204             printf("xxxx map error %s %d ret=%d \n", __func__, __LINE__, ret);
205             goto e0;
206         }
207     }
 ..............

osmmap调用了libdrm的如下几个函数实现将PMR handle表示的PMR映射到用户态:

  • xdxgpu_device_create
  • xdxgpu_bo_import
  • xdxgpu_bo_map

其中xdxgpu_bo_import是函数xdxgpu_bo_import_from_pvr_handle的封装,后面就只列出函数xdxgpu_bo_import_from_pvr_handle的代码

这几个函数的代码如下:

static int fd_compare(int fd1, int fd2)
{
	char *name1 = drmGetPrimaryDeviceNameFromFd(fd1);
	char *name2 = drmGetPrimaryDeviceNameFromFd(fd2);
	int result;

	if (name1 == NULL || name2 == NULL) {
		free(name1);
		free(name2);
		return 0;
	}

	result = strcmp(name1, name2);
	free(name1);
	free(name2);

	return result;
}
 
 81 drm_public int xdxgpu_device_create(int fd, xdxgpu_handle *devHandle)
 82 {
 83     struct xdxgpu_device *dev = NULL, *tmp;
 84     drmVersionPtr version;
 85     int ret;
 86
 87     pthread_mutex_lock(&dev_mutex);
 88     LIST_FOR_EACH_ENTRY(tmp, &dev_list, node)
 89     {
 90         if (fd_compare_test(tmp->fd, fd) == 0) {
 91             dev = tmp;
 92             break;
 93         }
 94     }
 95
 96     if (dev) {
 97         xdxgpu_device_get(dev);
 98         *devHandle = (xdxgpu_handle)dev;
 99         pthread_mutex_unlock(&dev_mutex);
100         return 0;
101     }
102
103     /* Create new xdxgpu device */
104     dev = calloc(1, sizeof(struct xdxgpu_device));
105     if (!dev) {
106         fprintf(stderr, "%s: calloc failed\n", __func__);
107         pthread_mutex_unlock(&dev_mutex);
108         return -ENOMEM;
109     }
110
111     ret = drmGetDevice2(fd, 0, &dev->ddev);
112     if (ret) {
113         fprintf(stderr, "%s: get device info failed\n", __func__);
114         free(dev);
115         pthread_mutex_unlock(&dev_mutex);
116         return ret;
117     }
118
119     dev->fd = -1;
120     atomic_set(&dev->refcount, 1);
121
122     version = drmGetVersion(fd);
123     dev->major_version = version->version_major;
124     dev->minor_version = version->version_minor;
125     drmFreeVersion(version);
126
127     dev->fd = fcntl(fd, F_DUPFD_CLOEXEC, 0);
128     dev->bo_table = drmHashCreate();
129
130     list_add(&dev->node, &dev_list);
131     pthread_mutex_init(&dev->bo_tlb_lock, NULL);
132
133     *devHandle = (xdxgpu_handle)dev;
134     pthread_mutex_unlock(&dev_mutex);
135     return 0;
136 }

340 static int xdxgpu_bo_import_from_pvr_handle(struct xdxgpu_device *xdev,
341                         uint32_t handle,
342                         struct xdxgpu_bo **ppxbo)
343 {
344     union drm_xdxgpu_gem_import_from_pvr args = { 0 };
345     int ret;
346     int32_t gem_handle;
347     struct xdxgpu_bo *xbo;
348     struct xdxgpu_gem_info info = { 0 };
349
350     dev_info(xdev, "%s: pvr handle 0x%x\n", __func__, handle);
351
352     args.in.handle = handle;
353     ret = drmCommandWriteRead(xdev->fd, DRM_XDXGPU_IMPORT_FROME_PVR_HANDLE,
354                   &args, sizeof(args));
355     if (ret) {
356         dev_err(xdev, "%s: failed to import from PVR handle (%d)\n",
357             __func__, ret);
358         return ret;
359     }
360
361     gem_handle = args.out.handle;
362
363     pthread_mutex_lock(&xdev->bo_tlb_lock);
364     xbo = xdxgpu_lookup_bo(xdev->bo_table, gem_handle);
365     pthread_mutex_unlock(&xdev->bo_tlb_lock);
366     if (xbo) {
367         *ppxbo = xbo;
368         return 0;
369     }
370
371     // new buffer object in current process
372     xbo = calloc(1, sizeof(struct xdxgpu_bo));
373     if (!xbo) {
374         drmCloseBufferHandle(xdev->fd, gem_handle);
375         return -ENOMEM;
376     }
377
378     ret = xdxgpu_query_gem_info(xdev, gem_handle, &info);
379     if (ret) {
380         free(xbo);
381         drmCloseBufferHandle(xdev->fd, gem_handle);
382         return ret;
383     }
384
385     xdxgpu_device_get(xdev);
386     xbo->xdev = xdev;
387     xbo->gem_handle = gem_handle;
388     xbo->size = info.size;
389     atomic_set(&xbo->refcount, 1);
390     atomic_set(&xbo->map_count, 0);
391
392     pthread_mutex_lock(&xdev->bo_tlb_lock);
393     drmHashInsert(xdev->bo_table, xbo->gem_handle, xbo);
394     pthread_mutex_unlock(&xdev->bo_tlb_lock);
395
396     *ppxbo = xbo;
397
398     return ret;
399 }

169 drm_public int xdxgpu_bo_map(xdxgpu_handle bo, void **cpu)
170 {
171     void *ptr;
172     struct xdxgpu_bo *xbo = (struct xdxgpu_bo *)bo;
173     struct xdxgpu_device *xdev = xbo->xdev;
174     int ret;
175     struct xdxgpu_gem_info info = { 0 };
176
177     assert(xbo != NULL);
178
179     if (xbo->ptr) {
180         *cpu = xbo->ptr;
181         return 0;
182     }
183
184     ret = xdxgpu_query_gem_info(xdev, xbo->gem_handle, &info);
185     if (ret)
186         return ret;
187
188     if (info.mmap_offset == (uint64_t)-1) {
189         dev_err(xdev, "%s: no permission to mmap buffer object %p\n",
190             __func__, xbo);
191         return -errno;
192     }
193
194     /* Map the buffer. */
195     ptr = drm_mmap(NULL, xbo->size, PROT_READ | PROT_WRITE, MAP_SHARED,
196                xdev->fd, info.mmap_offset);
197     if (ptr == MAP_FAILED) {
198         dev_err(xdev, "%s: failed mmap buffer object %p\n", __func__,
199             xbo);
200         return -errno;
201     }
202
203     xdxgpu_bo_get(xbo);
204
205     xbo->ptr = ptr;
206
207     *cpu = xbo->ptr;
208
209     return 0;
210 }

函数xdxgpu_device_create处理流程如下:

  1. 遍历全局链表dev_list的所有设备结点,比较结点的fd对应的文件名(包含路径和文件名)与参数fd对应的文件名(包含路径和文件名)是否相等。
  2. 如果有相等的设备结点,则调用xdxgpu_device_get将设备结点的引用计数增1,并获取该设备结点。
  3. 如果没有相等的设备结点,则创建设备结点并将设备结点引用计数初始化为1,随后调用函数fcntl复制参数fd生成一个新的fd,并将新的fd赋值给设备结点的fd。查看内核代码do_fcntl->f_dupfd,函数f_dupfd调用alloc_fd分配一个空闲的fd,然后将参数fd所指向的file的引用计数增1,并没有创建一个新的file。

函数xdxgpu_bo_import_from_pvr_handle的处理流程如下:

  1. 调用函数drmCommandWriteRead将PMR handle转换为gem object的handle(对应内核处理函数为xdx_gem_import_from_pvr_ioctl),后面称为gem handle。
  2. 调用函数xdxgpu_lookup_bo在设备结点的bo_table中查找第1步获取的gem handle是否已经存在,如果存在则将查找到的bo的引用计数增1(xdxgpu_lookup_bo->xdxgpu_bo_get)并返回。
  3. 如果gem handle不存在,则分配bo并初始化bo的引用计数为1并调用xdxgpu_device_get增加设备结点引用计数,最后以gem handle为key将bo加入设备结点的hash表即bo_table。

注:每创建一个bo会将设备结点引用计数增1,而bo引用计数为0调用函数xdxgpu_bo_free释放时,会调用函数xdxgpu_bo_free->xdxgpu_device_put将设备结点引用计数减1,如果减1后为0则释放设备结点。

函数xdxgpu_bo_map的处理流程如下:

  1. 如果bo的虚拟地址ptr不为NULL即bo已经映射,则直接获取bo的ptr,并返回。
  2. 如果bo未映射,则调用xdxgpu_query_gem_info,该函数会调用ioctl,内核KMD相应的处理函数获取gem object的大小、创建gem map offset(内核函数drm_gem_create_mmap_offset)。
  3. 调用函数drm_mmap(为mmap的宏定义)将bo映射到用户虚拟地址空间,获取虚拟地址。
  4. 调用函数xdxgpu_bo_get将bo的引用计数增1
  5. 并将虚拟地址复制到bo的ptr。

根据上面的流程可知,函数OSMMapPMR后面调用的xdxgpu_bo_destroy(bo引用计数减1)对应xdxgpu_bo_import中bo引用计数设置为1或增1,但是因为xdxgpu_bo_map中将bo引用计数增1,所以最终并不会释放bo,需要OSUnmapPMR->xdxgpu_bo_unmap->xdxgpu_bo_put将bo引用计数减为0,从而释放bo。

函数OSMMapPMR后面调用的xdxgpu_device_destroy(设备结点引用计数减1)对应xdxgpu_device_create中将设备引用计数设置为1或增1,但是因为xdxgpu_bo_import会增加设备引用计数,所以最终并不会释放设备结点,需要OSUnmapPMR->xdxgpu_bo_unmap->xdxgpu_bo_put将bo引用计数减1,如果bo引用计数减1后为0则调用xdxgpu_bo_free->xdxgpu_device_put将设备结点引用计数减1,从而释放设备结点。

从上面分析可知设备结点的引用计数等于设备结点上的bo个数,当所有bo都释放后设备结点也会被释放。

其中函数xdxgpu_bo_import_from_pvr_handle调用drmCommandWriteRead最终会调用ioctl,对应的内核KMD处理函数如下:

int xdx_gem_import_from_pvr_ioctl(struct drm_device *ddev, void *data,
				  struct drm_file *filp)
{
	struct xdx_device *xdev = drm_to_xdev(ddev);
	union drm_xdxgpu_gem_import_from_pvr *args = data;
	struct xdx_drm_fpriv *fpriv = filp->driver_priv;
	int ret;
	void *psPMR;
	struct drm_gem_object *gobj;
	struct xdx_bo *bo;
	uint32_t handle;
	struct xdx_bo_property pro;
	uint32_t size;
	struct drm_file *file;
	bool bFound = false;

	ret = pvr_pmr_lookup(fpriv->pvr_fpriv, args->in.handle, &psPMR);
	if (ret) {
		dev_err(xdev->dev, "failed to lookup PMR: handle(0x%x)\n", args->in.handle);
		return ret;
	}

	mutex_lock(&ddev->filelist_mutex);
	list_for_each_entry(file, &ddev->filelist, lhead) {
		spin_lock(&file->table_lock);
		idr_for_each_entry(&file->object_idr, gobj, handle) {
			bo = to_xdx_gem(gobj);
			if(bo->psPMR == psPMR) {
				bFound = true;
				drm_gem_object_get(gobj); /* avoid released in other process */
				break;
			}
		}
		spin_unlock(&file->table_lock);

		if(bFound)
			break;
	}
	mutex_unlock(&ddev->filelist_mutex);

	if (bFound) {
		if (file != filp) {
			ret = drm_gem_handle_create(filp, gobj, &handle);
			if (ret) {
				dev_err(xdev->dev, "failed to create gem handle\n");
				PMRUnrefPMR((PMR *)psPMR);
				drm_gem_object_put_unlocked(gobj);
				return ret;
			}
		}

		args->out.handle = handle;
		PMRUnrefPMR((PMR *)psPMR);
		drm_gem_object_put_unlocked(gobj);
		return 0;
	}

	pvr_query_pmr_info(psPMR, &pro, &size);

	ret = xdx_gem_object_create(xdev, NULL, (uint64_t)size, PAGE_SIZE,
				    &pro, 0, xdx_bo_type_pvr, &gobj);
	if (ret) {
		PMRUnrefPMR((PMR *)psPMR);
		dev_err(xdev->dev, "failed to import from pvr handle: size %d\n", size);
		return ret;
	}

	bo = to_xdx_gem(gobj);
	bo->psPMR = psPMR;

	ret = drm_gem_handle_create(filp, gobj, &handle);
	drm_gem_object_put_unlocked(gobj);
	if (ret)
		return ret;

	args->out.handle = handle;
	return ret;
}

函数xdx_gem_import_from_pvr_ioctl将应用层PMR handle转换为gem handle,它的处理流程如下:

  1. 调用函数pvr_pmr_lookup根据PMR handle查找PMR。
  2. 遍历drm_device的文件链表,然后对于每个文件drm file结点,遍历文件drm file结点的object_idr,如果bo的PMR(bo->psPMR)等于第1步查找到的PMR,则设置bFound为ture。
  3. 如果bFound为ture即drv_device上找到了查找的PMR,则判断PMR(实际是gem object)已经挂载到的drm file与上层应用本次调用xdx_gem_import_from_pvr_ioctl对应的drm file是否相同。如果不相同,则调用drm_gem_handle_create->....->idr_alloc(&file_priv->object_idr, obj....)分配gem handle,并将gem object挂载到上层应用本次调用xdx_gem_import_from_pvr_ioctl对应的drm file上;如果相同,则直接获取gem object的gem handle。
  4. 如果bFound为false即drv_device上未找到查找的PMR,则创建相应的gem object,将PMR赋值给gem object相应成员,随后调用drm_gem_handle_create分配gem handle,并将gem object挂载到上层应用本次调用xdx_gem_import_from_pvr_ioctl对应的drm file上。

注:上层应用每次open设备文件/dev/dri/card0或/dev/dri/renderD128时,内核都会分配一个drm file,所以同一个进程drm file可以不相等。

3.3.4 为什么两次映射虚拟地址一样?

从上节的分析,不难分析出:如果上层bo已经映射(bo->ptr不为NULL),则对该bo调用函数xdxgpu_bo_map时会直接返回映射的虚拟地址。

为了验证分析,在xdxgpu_bo_map中加入调试信息打印出bo地址,发现函数AcquireInfoPage最终都是用的同一个bo。日志如下:

......
LoaderOpen(/usr/local/lib/xorg/modules/libglamoregl.so)
(II) Loading /usr/local/lib/xorg/modules/libglamoregl.so
(II) Module glamoregl: vendor="X.Org Foundation"
        compiled for 1.20.13, module version = 1.0.1
--------------- PVRDRICreateScreen 263
--------------- PVRSRVConnectionCreate 353
--------------- ConnectionCreate 240 fd=15
-------OSMMapPMR 198 fd=15 PMR_handle=0x2 dev=0x563571c00db0 bo=0x563571be09a0 addr=0x7feffb7f8000  //关键日志
xxxx map success OSMMapPMR 235 pvUserVirtualAddress=0x7feffb7f8000
AcquireInfoPage 188 psDevConnection->pui32InfoPage=0x7feffb7f8000
ConnectionCreate 308 psConnection->pui32InfoPage=0x7feffb7f8000 TIMEOUT_INFO_VALUE_RETRIES=20000, TIMEOUT_INFO_VALUE_TIMEOUT_MS=1300, TIMEOUT_INFO_CONDITION_RETRIES=5, TIMEOUT_INFO_CONDITION_TIMEOUT_MS=400, TIMEOUT_INFO_TASK_QUEUE_RETRIES=10, TIMEOUT_INFO_TASK_QUEUE_FLUSH_TIMEOUT_MS=1000
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
......
Sync Extension 3.1
(II) AIGLX: Indirect GLX is not supported
(II) AIGLX: Indirect GLX is not supported
--------------- PVRDRICreateScreen 263
--------------- PVRSRVConnectionCreate 353
--------------- ConnectionCreate 240 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
xdxgpu_device_create 96 not equal r1=1 r2=0 tmp->fd=16 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
over!
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
-------OSMMapPMR 198 fd=24 PMR_handle=0x2 dev=0x563571c00db0 bo=0x563571be09a0 addr=0x7feffb7f8000 //关键日志
xxxx map success OSMMapPMR 235 pvUserVirtualAddress=0x7feffb7f8000
AcquireInfoPage 188 psDevConnection->pui32InfoPage=0x7feffb7f8000
ConnectionCreate 308 psConnection->pui32InfoPage=0x7feffb7f8000 TIMEOUT_INFO_VALUE_RETRIES=20000, TIMEOUT_INFO_VALUE_TIMEOUT_MS=1300, TIMEOUT_INFO_CONDITION_RETRIES=5, TIMEOUT_INFO_CONDITION_TIMEOUT_MS=400, TIMEOUT_INFO_TASK_QUEUE_RETRIES=10, TIMEOUT_INFO_TASK_QUEUE_FLUSH_TIMEOUT_MS=1000
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
xdxgpu_device_create 96 not equal r1=1 r2=0 tmp->fd=16 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
......

从而可以推出两次映射时xdxgpu_bo_import_from_pvr_handle->drmCommandWriteRead从内核获取的gem handle相等,否则第二次映射调用xdxgpu_bo_import_from_pvr_handle->xdxgpu_lookup_bo会找不到bo。

从而可以推出第二次映射时,内核KMD函数xdx_gem_import_from_pvr_ioctl的bFound为true,且file == filp。bFound为true是因为内核KMD初始化时只创建了一个information page的PMR,第一次映射时创建了gem object并挂载到了drm device的相应drm file上,第二次映射就可以找到。

为什么file == filp?因为PVRDRICreateScreen两次打开的都是/dev/dri/card0,所以第二次映射时fd_compare返回值为0,而xdxgpu_bo_import_from_pvr_handle使用的设备结点就是fd_compare找到的设备结点,而drmCommandWriteRead的参数fd就是设备结点的fd,所以两次映射调用函数drmCommmandWriteRead传入的fd一样,所以file == filp成立。

四、问题原因

从上面分析可知,两次映射information page的虚拟地址一样,当xgltest1结束时,触发了xorg解除某个映射,从而将xorg的虚拟地址解除映射了,内核释放了相应vma,当xorg再次使用已经解除映射的虚拟地址访问数据,会触发page fault,内核page fault处理流程未找到对应的vma,从而给进程发送segment fault的信号。

如下为解除映射代码:

266 IMG_INTERNAL void
267 OSMUnmapPMR(IMG_HANDLE hBridge,
268             IMG_HANDLE hPMR,
269             IMG_HANDLE hOSMMapPrivData,
270             void *pvMappingAddress,
271             size_t uiMappingLength)
272 {
........
303     xdxgpu_bo_unmap((xdxgpu_handle)hOSMMapPrivData);
........

216 drm_public void xdxgpu_bo_unmap(xdxgpu_handle bo)
217 {
218     struct xdxgpu_bo *xbo = (struct xdxgpu_bo *)bo;
219
220     assert(xbo != NULL);
221
222     if (xbo->ptr) {
223         drm_munmap(xbo->ptr, xbo->size);
224         xbo->ptr = NULL;
225     }
226
227     xdxgpu_bo_put(xbo);
228 }

从代码可知,解除映射时并没有判断该bo是否还被其它流程映射,而是直接解除了映射。 

五、如何修改

5.1 方法一修改函数fd_compare

5.1.1 修改方法

从上面分析可知,fd_compare是基于进程查找的,只要两个fd表示的文件名(包括路径)一样,则fd_compare返回0。如果fd_compare比较的是内核file是否为同一个,则可以解决这个问题,因为PVRDRICreateScreen open了两次/dev/dri/card0,每次open内核会创建一个file,所以两个fd在内核表示不同的file。

修改代码如下:

 20 static int kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1,
 21         unsigned long idx2)
 22 {
 23     return syscall(SYS_kcmp, pid1, pid2, type, idx1, idx2);
 24 }
 25
 26 static int fd_compare(int fd1, int fd2)
 27 {
 28     pid_t pid = getpid();
 29
 30     return  kcmp(pid, pid, KCMP_FILE, (long)fd1, (long)fd2);
 31 }

因为SYS_kcmp比较的时内核的file指针(见文件kernel/kcmp.c),所以第二次映射时调用函数fd_compare不会返回0,从而创建设备结点,从而xdxgpu_bo_import_from_pvr_handle会创建新的bo(因为xdxgpu_lookup_bo在新创建的设备结点上找不到gem handle),从而xdxgpu_bo_map会重新映射,获取的虚拟地址就不一样,从而不会相互影响。

5.1.2 不足

  1. 这种修改方法将设备结点从进程的范围缩小到open文件,即这样修改后,一个进程如果打开同一个设备文件多次,则上层应用会创建对应个数的设备结点。这样浪费了不必要的内存资源,同时xdxgpu_device_create遍历dev_list查找设备结点、分配设备结点会降低一定性能(可能微不足道)。
  2. 如果一个进程对同一个PMR handle进行映射,且fd都是指向同一个open的设备文件(内核中file指针相同),则还是会存在上面的问题。

5.2 方法二添加map_count字段

给bo添加一个成员map_count,类型为原子(atomic_t),创建bo时设置map_count为0。map时函数xdxgpu_bo_map中将map_count增1。unmap时函数xdxgpu_bo_map中将map_count减1,并判断map_count是否为0,如果为0则解除映射;否则直接返回不解除映射。修改代码如下:

@@ -174,6 +176,8 @@ drm_public int xdxgpu_bo_map(xdxgpu_handle bo, void **cpu)

        assert(xbo != NULL);

+     atomic_inc(&xbo->map_count);
+
        if (xbo->ptr) {
                *cpu = xbo->ptr;
                return 0;
@@ -207,14 +211,30 @@ drm_public int xdxgpu_bo_map(xdxgpu_handle bo, void **cpu)
        return 0;
 }


 drm_public void xdxgpu_bo_unmap(xdxgpu_handle bo)
 {
        struct xdxgpu_bo *xbo = (struct xdxgpu_bo *)bo;

        assert(xbo != NULL);

+       if (!atomic_dec_and_test(&xbo->map_count))
+               return;

@@ -367,6 +387,7 @@ static int xdxgpu_bo_import_from_pvr_handle(struct xdxgpu_device *xdev,
        xbo->gem_handle = gem_handle;
        xbo->size = info.size;
        atomic_set(&xbo->refcount, 1);
+       atomic_set(&xbo->map_count, 0);

这样引用计数流程如下:创建bo时相应的设备结点引用计数增1,第一次对bo进行map时bo的引用计数增1,后续对该bo再进行map时只对bo的map_count增1;对bo进行unmap时将bo的map_count减1,map_count减1后如果为0则对bo的引用计数减1,bo的引用计数减1后如果为0则对设备结点的引用计数减1。其中bo的引用计数减1后为0则需要释放bo相关的资源,设备结点的引用计数减1后为0则需要释放设备结点相关的资源。

这样同一个进程在全局设备链表dev_list中,即使open同一个设备文件多次,也只会创建一个设备结点(见函数xdxgpu_device_create)。

下面分析一下设备结点的引用计数和fd:

  1. 假设一个进程前后对同一个设备文件调用了两次open函数得到fd1、fd2,并分别调用了两次xdxgpu_device_create。
  2. 第一次调用xdxgpu_device_create(参数为fd1)的时候会创建设备结点,初始化设备结点引用为1,并调用fcntl复制fd1为dup_fd1并赋值给设备结点的fd成员。此时内核中fd1表示的file的引用计数为2。
  3. 第二次调用xdxgpu_device_create(参数为fd2)的时候,会找到创建的设备结点并调用xdxgpu_device_get将设备结点的引用计数加1,这样设备结点的引用计数就变为2
  4. 当close(fd1)后,内核中fd1表示的file的引用计数变为1
  5. 当对该设备结点调用xdxgpu_device_destroy时,最终调用xdxgpu_device_put将设备结点引用计数减为1
  6. 当再次对设备结点调用xdxgpu_device_destroy时,最终调用xdxgpu_device_put将设备结点引用计数减为0,并最终调用函数xdxgpu_device_free->close(dev->fd即dup_fd1)、free(dev),调用close(dup_fd1)后,内核中fd1/dup_fd1表示的file的引用计数变为0,内核会释放file资源,调用free(dev)会释放设备结点资源。

根据上面分析,关闭fd1不会影响其他流程,同时明白了为什么xdxgpu_device_create中调用fcntl,而不是直接把fd1赋值给设备结点的fd成员。至于fd2,在需要关闭的时候关闭就行了,不会有影响。 

六、总结

解决此问题的关键点是,对information page作用和映射流程要熟悉,在排除虚拟地址被修改的情况时,推测出访问的虚拟地址已经被unmap了,然后一步步加以验证。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值