一、背景
采用新架构的KMD,且在libdrm中实现了我司自己的bo(buffer object)管理、device管理等功能。此问题是在新架构中出现的。
老架构的GPU、DC、PCIE是三个内核模块,GPU和DC分别创建了一个DRM驱动,所以GPU和DC的设备文件(/dev/dri/xx)不同,故而GPU和DC的mmap不会相互影响;新架构将它们整合到了一起,只创建了一个DRM驱动,所以GPU和DC使用相同的设备文件,故而mmap的入口函数只有一个。
老架构GPU使用的是PMR handle映射,DC使用的gem object的mmap offset(gem handle)映射;新架构GPU和DC使用DRM框架的gem object的mmap offset(gem handle)映射,所以新架构需要将PMR handle转为gem object的mmap offset,然后再进行映射。
二、问题描述
1、加载新架构的KMD内核模块。
2、加载Xorg。
3、运行xgltest1,出现图像。
4、结束xgltest1。
5、Xorg崩溃并报错segment fault。
错误日志如下:
........
PVR: EGL rendertarget cache stats:
PVR: Hits: 0
PVR: Misses: 0
PVR: High watermark: 0
PVRSRVGetClientEventFilter 436 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
PVRSRVGetClientEventFilter 442 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
OSMUnmapPMR 300 gInfoPage=0x7f6a182b0000
xdxgpu_bo_unmap 222 unmap_addr=0x7f6a182b0000
PVRSRVGetClientEventFilter 436 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
(EE)
(EE) Backtrace:
(EE) 0: /usr/local/bin/Xorg (xorg_backtrace+0xbf) [0x5636fd331098]
(EE) 1: /usr/local/bin/Xorg (0x5636fd17b000+0x1ba895) [0x5636fd335895]
(EE) 2: /lib/x86_64-linux-gnu/libc.so.6 (0x7f6a15d4d000+0x3ef10) [0x7f6a15d8bf10]
(EE) 3: /usr/lib/x86_64-linux-gnu/libsrv_um.so (PVRSRVGetClientEventFilter+0xf2) [0x7f6a110498e9]
(EE) 4: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x274528) [0x7f6a10897528]
(EE) 5: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x274f67) [0x7f6a10897f67]
(EE) 6: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x275674) [0x7f6a10898674]
(EE) 7: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27aaff) [0x7f6a1089daff]
(EE) 8: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x277a52) [0x7f6a1089aa52]
(EE) 9: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27ada6) [0x7f6a1089dda6]
(EE) 10: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27ae09) [0x7f6a1089de09]
(EE) 11: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13d47f) [0x7f6a1076047f]
(EE) 12: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13a130) [0x7f6a1075d130]
(EE) 13: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13a798) [0x7f6a1075d798]
(EE) 14: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x7ae5d) [0x7f6a1069de5d]
(EE) 15: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (glDeleteFramebuffers+0x41a) [0x7f6a1076a126]
(EE) 16: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x30c56) [0x7f6a121f8c56]
(EE) 17: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x31694) [0x7f6a121f9694]
(EE) 18: /usr/local/lib/xorg/modules/libglamoregl.so (glamor_close_screen+0x1be) [0x7f6a121d3f95]
(EE) 19: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x93b7) [0x7f6a121d13b7]
(EE) 20: /usr/local/bin/Xorg (0x5636fd17b000+0x1cbcab) [0x5636fd346cab]
(EE) 21: /usr/local/bin/Xorg (0x5636fd17b000+0x13de29) [0x5636fd2b8e29]
(EE) 22: /usr/local/bin/Xorg (0x5636fd17b000+0x3ff61) [0x5636fd1baf61]
(EE) 23: /usr/local/bin/Xorg (0x5636fd17b000+0xf3c2e) [0x5636fd26ec2e]
(EE) 24: /usr/local/bin/Xorg (0x5636fd17b000+0x4c813) [0x5636fd1c7813]
(EE) 25: /usr/local/bin/Xorg (0x5636fd17b000+0x526ed) [0x5636fd1cd6ed]
(EE) 26: /usr/local/bin/Xorg (0x5636fd17b000+0x24374d) [0x5636fd3be74d]
(EE) 27: /usr/local/lib/xorg/modules/drivers/modesetting_drv.so (0x7f6a12623000+0xd5c5) [0x7f6a126305c5]
(EE) 28: /usr/local/bin/Xorg (0x5636fd17b000+0xd2676) [0x5636fd24d676]
(EE) 29: /usr/local/bin/Xorg (0x5636fd17b000+0x218f48) [0x5636fd393f48]
(EE) 30: /usr/local/bin/Xorg (0x5636fd17b000+0x1faef4) [0x5636fd375ef4]
(EE) 31: /usr/local/bin/Xorg (0x5636fd17b000+0x1e98c4) [0x5636fd3648c4]
(EE) 32: /usr/local/bin/Xorg (0x5636fd17b000+0x134c45) [0x5636fd2afc45]
(EE) 33: /usr/local/bin/Xorg (0x5636fd17b000+0x2026ad) [0x5636fd37d6ad]
(EE) 34: /usr/local/bin/Xorg (0x5636fd17b000+0x10e3f6) [0x5636fd2893f6]
(EE) 35: /usr/local/bin/Xorg (0x5636fd17b000+0x20b1db) [0x5636fd3861db]
(EE) 36: /usr/local/bin/Xorg (0x5636fd17b000+0x13ebdf) [0x5636fd2b9bdf]
(EE) 37: /usr/local/bin/Xorg (0x5636fd17b000+0xf4b6d) [0x5636fd26fb6d]
(EE) 38: /usr/local/bin/Xorg (0x5636fd17b000+0xca093) [0x5636fd245093]
(EE) 39: /usr/local/lib/xorg/modules/extensions/libglx.so (0x7f6a14b7f000+0x3c32d) [0x7f6a14bbb32d]
(EE) 40: /usr/local/bin/Xorg (0x5636fd17b000+0x83bc7) [0x5636fd1febc7]
(EE) 41: /usr/local/bin/Xorg (0x5636fd17b000+0x2504b5) [0x5636fd3cb4b5]
(EE) 42: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xe7) [0x7f6a15d6ec87]
(EE) 43: /usr/local/bin/Xorg (_start+0x2a) [0x5636fd1a951a]
(EE)
(EE) Segmentation fault at address 0x7f6a182b0014
(EE)
Fatal server error:
(EE) Caught signal 11 (Segmentation fault). Server aborting
(EE)
(EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
(EE) Please also check the log file at "/usr/local/var/log/Xorg.0.log" for additional information.
(EE)
ACPI: Closing device
(EE) Server terminated with error (1). Closing log file.
三、问题分析
3.1 core dump文件分析
用ulimit命令设置core file size,复现问题后得到core dump文件,然后用gdb调试core dump文件,得到如下调用堆栈:
gdb /usr/local/bin/Xorg /var/core/core_Xorg_24746
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f72fc03e7f1 in __GI_abort () at abort.c:79
#2 0x0000556446cce811 in OsAbort () at ../source/os/utils.c:1351
#3 0x0000556446cd7db6 in AbortServer () at ../source/os/log.c:872
#4 0x0000556446cd82d6 in FatalError (f=0x556446d7d2e0 "Caught signal %d (%s). Server aborting\n") at ../source/os/log.c:1010
#5 0x0000556446cca93e in OsSigHandler (signo=11, sip=0x7ffedc5476f0, unused=0x7ffedc5475c0) at ../source/os/osinit.c:156
#6 <signal handler called>
#7 0x00007f72f72fa911 in PVRSRVGetClientEventFilter (psDevConnection=0x556448e30360, eApi=1) at services/client/common/hwperf_client.c:439
#8 0x00007f72f6b48528 in PVRSRVFenceWait (psDevConnection=0x556448e30360, hFence=29, ui32TimeoutInMs=0) at include/pvrsrv_sync_um.h:349
#9 0x00007f72f6b48f67 in RM_ANF_Check (psSysContext=0x556448e063c0, hFence=29) at common/resourceman.c:427
#10 0x00007f72f6b49674 in RMTask_IsComplete (psCtx=0x556448eceee0, psTask=0x556449318fb0) at common/resourceman.c:704
#11 0x00007f72f6b4eaff in RM_GetJobState (psCtx=0x556448eceee0, psHWQ=0x556448f1e080, uiJobNumber=26) at common/resourceman.c:4411
#12 0x00007f72f6b4ba52 in RM_GetResourceState (psCtx=0x556448eceee0, psHistory=0x5564491c84d0, eUsageMask=RM_USAGE_READ_WRITE, ui32StateCheckMask=3)
at common/resourceman.c:2026
#13 0x00007f72f6b4eda6 in RM_IsResourceNeededBy3D_NoLock (psCtx=0x556448eceee0, psResource=0x556449116508, eUsageMask=RM_USAGE_READ_WRITE)
at common/resourceman.c:4674
#14 0x00007f72f6b4ee09 in RM_IsResourceNeededBy3D (psCtx=0x556448eceee0, psResource=0x556449116508, eUsageMask=RM_USAGE_READ_WRITE)
at common/resourceman.c:4712
#15 0x00007f72f6a1147f in FreeFBOStaticPrograms (gc=0x556448eceee0, psFBOStaticPrograms=0x556449116508) at opengles3/volcanic/fbo.c:2997
#16 0x00007f72f6a0e130 in FreeFrameBuffer (gc=0x556448eceee0, psFrameBuffer=0x556449115420) at opengles3/volcanic/fbo.c:1157
#17 0x00007f72f6a0e798 in DisposeFrameBufferObject (gc=0x556448eceee0, psNamedItem=0x556449115420, bIsShutdown=IMG_FALSE) at opengles3/volcanic/fbo.c:1408
#18 0x00007f72f694ee5d in NamedItemDelRefByName (gc=0x556448eceee0, psNamesArray=0x556448eefce0, ui32Num=1, ui32Name=0x5564492de2f4)
at opengles3/names.c:1302
#19 0x00007f72f6a1b126 in glDeleteFramebuffers (n=1, framebuffers=0x5564492de2f4) at opengles3/volcanic/fbo.c:7144
#20 0x00007f72f84a9c56 in glamor_destroy_fbo (glamor_priv=0x556448f29bf0, fbo=0x5564492de2f0) at ../source/glamor/glamor_fbo.c:40
#21 0x00007f72f84aa694 in glamor_pixmap_destroy_fbo (pixmap=0x55644924fbd0) at ../source/glamor/glamor_fbo.c:311
#22 0x00007f72f8484f95 in glamor_close_screen (screen=0x556448f27e60) at ../source/glamor/glamor.c:811
#23 0x00007f72f84823b7 in glamor_egl_close_screen (screen=0x556448f27e60) at ../source/glamor/glamor_egl.c:776
#24 0x0000556446cdbcab in dri3_close_screen (screen=0x556448f27e60) at ../source/dri3/dri3.c:44
#25 0x0000556446c4de29 in SyncCloseScreen (pScreen=0x556448f27e60) at ../source/miext/sync/misync.c:161
#26 0x0000556446b4ff61 in miDCCloseScreen (pScreen=0x556448f27e60) at ../source/mi/midispcur.c:155
#27 0x0000556446c03c2e in damageCloseScreen (pScreen=0x556448f27e60) at ../source/miext/damage/damage.c:1605
#28 0x0000556446b5c813 in miPointerCloseScreen (pScreen=0x556448f27e60) at ../source/mi/mipointer.c:170
#29 0x0000556446b626ed in miSpriteCloseScreen (pScreen=0x556448f27e60) at ../source/mi/misprite.c:379
#30 0x0000556446d5374d in xf86CursorCloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/ramdac/xf86CursorRD.c:151
#31 0x00007f72f88e15c5 in CloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/drivers/modesetting/driver.c:1924
#32 0x0000556446be2676 in RRCloseScreen (pScreen=0x556448f27e60) at ../source/randr/randr.c:112
#33 0x0000556446d28f48 in xf86CrtcCloseScreen (screen=0x556448f27e60) at ../source/hw/xfree86/modes/xf86Crtc.c:785
#34 0x0000556446d0aef4 in DGACloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/common/xf86DGA.c:288
#35 0x0000556446cf98c4 in CMapCloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/common/xf86cmap.c:250
#36 0x0000556446c44c45 in XvCloseScreen (pScreen=0x556448f27e60) at ../source/Xext/xvmain.c:309
#37 0x0000556446d126ad in xf86XVCloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/common/xf86xv.c:1168
#38 0x0000556446c1e3f6 in present_close_screen (screen=0x556448f27e60) at ../source/present/present_screen.c:70
---Type <return> to continue, or q <return> to quit---
#39 0x0000556446d1b1db in VGAarbiterCloseScreen (pScreen=0x556448f27e60) at ../source/hw/xfree86/common/xf86VGAarbiter.c:262
#40 0x0000556446c4ebdf in CursorCloseScreen (pScreen=0x556448f27e60) at ../source/xfixes/cursor.c:205
#41 0x0000556446c04b6d in AnimCurCloseScreen (pScreen=0x556448f27e60) at ../source/render/animcur.c:100
#42 0x0000556446bda093 in compCloseScreen (pScreen=0x556448f27e60) at ../source/composite/compinit.c:86
#43 0x00007f72fae6c32d in glxCloseScreen (pScreen=0x556448f27e60) at ../source/glx/glxscreens.c:171
#44 0x0000556446b93bc7 in dix_main (argc=4, argv=0x7ffedc548d68, envp=0x7ffedc548d90) at ../source/dix/main.c:325
#45 0x0000556446d604b5 in main (argc=4, argv=0x7ffedc548d68, envp=0x7ffedc548d90) at ../source/dix/stubmain.c:34
(gdb)
从堆栈信息可知,Xorg执行函数PVRSRVGetClientEventFilter的第439行时出现segment fault。相关代码如下:
415 IMG_EXPORT IMG_UINT32 PVRSRVGetClientEventFilter(PVRSRV_DEV_CONNECTION *psDevConnection,
416 RGX_HWPERF_CLIENT_API eApi)
417 {
418 HWPERF_CONTEXT *psContext;
419
420 PVR_ASSERT(psDevConnection != NULL);
421 PVR_LOG_RETURN_IF_FALSE(eApi < RGX_HWPERF_CLIENT_API_MAX &&
422 eApi != RGX_HWPERF_CLIENT_API_INVALID,
423 "eApi invalid", 0);
424
425 psContext = psDevConnection->pvHWPerfWriteContext;
426
427 /* If this filter is != 0 it means that stream has already been opened
428 * during the initialisation. Since AppHints have precedence over
429 * the value set by the server we can just 'return' here. */
430 if (psContext->ui32APIEventsFilter[eApi] != 0)
431 {
432 return psContext->ui32APIEventsFilter[eApi];
433 }
434
435 //xxxx modify
436 printf("%s %d psDevConnection->pui32InfoPage:0x%lx offset=%u\n", __func__, __LINE__, (long)psDevConnection->pui32InfoPage, _ApiToInfoPageIdx(eApi));
437
438 /* Prevent lazy initialisation if the filters haven't been set yet. */
439 if (psDevConnection->pui32InfoPage[_ApiToInfoPageIdx(eApi)] == 0)
440 {
441 //xxxx modify
442 printf("%s %d psDevConnection->pui32InfoPage:0x%lx offset=%u\n", __func__, __LINE__, (long)psDevConnection->pui32InfoPage, _ApiToInfoPageIdx(eApi ));
443 return 0;
444 }
.........
第439行是xorg访问information page。在KMD初始化时,会创建一个information page的PMR(physical memory resource),里面记录了一些配置信息(比如允许的最大timeout次数等信息)。Xorg将PMR映射到用户态的虚拟地址,该虚拟地址就是psDevConnection->pui32InfoPage,从而Xorg可以访问KMD的information page。
KMD创建好information page的PMR后,一般不会再去修改。而Xorg得到information page的用户态虚拟地址后,只会读取配置信息,也不会去修改。
3.2 确认pui32InfoPage是否被修改
因为pui32InfoPage是结构体psDevConnection的指针成员,先确认下指针pui32InfoPage是否被修改。
在最开始设置pui32InfoPage的地方加上调试信息,打印pui32InfoPage的值,然后在函数PVRSRVGetClientEventFilter的439行前加入调试信息,打印pui32InfoPage的值,最后确认
pui32InfoPage的值未被修改,且_ApiToInfoPageIdx(eApi)的值也在information page范围内,不会越界。
3.3 疑点猜想
3.3.1 unmap后访问猜想
排除了pui32InfoPage被修改的可能,那么最有可能出现segment fault的情况是:Xorg将information page unmap了,然后又访问information page,出现了segment fault。
为了验证这个猜想,在函数xdxgpu_bo_unmap中加入调试信息,输出unmap的地址。得到如下日志:
PVR: EGL rendertarget cache stats:
PVR: Hits: 0
PVR: Misses: 0
PVR: High watermark: 0
PVRSRVGetClientEventFilter 436 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
PVRSRVGetClientEventFilter 442 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
OSMUnmapPMR 300 gInfoPage=0x7f6a182b0000
xdxgpu_bo_unmap 222 unmap_addr=0x7f6a182b0000
PVRSRVGetClientEventFilter 436 psDevConnection->pui32InfoPage:0x7f6a182b0000 offset=5
(EE)
(EE) Backtrace:
(EE) 0: /usr/local/bin/Xorg (xorg_backtrace+0xbf) [0x5636fd331098]
(EE) 1: /usr/local/bin/Xorg (0x5636fd17b000+0x1ba895) [0x5636fd335895]
(EE) 2: /lib/x86_64-linux-gnu/libc.so.6 (0x7f6a15d4d000+0x3ef10) [0x7f6a15d8bf10]
(EE) 3: /usr/lib/x86_64-linux-gnu/libsrv_um.so (PVRSRVGetClientEventFilter+0xf2) [0x7f6a110498e9]
(EE) 4: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x274528) [0x7f6a10897528]
(EE) 5: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x274f67) [0x7f6a10897f67]
(EE) 6: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x275674) [0x7f6a10898674]
(EE) 7: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27aaff) [0x7f6a1089daff]
(EE) 8: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x277a52) [0x7f6a1089aa52]
(EE) 9: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27ada6) [0x7f6a1089dda6]
(EE) 10: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x27ae09) [0x7f6a1089de09]
(EE) 11: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13d47f) [0x7f6a1076047f]
(EE) 12: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13a130) [0x7f6a1075d130]
(EE) 13: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x13a798) [0x7f6a1075d798]
(EE) 14: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (0x7f6a10623000+0x7ae5d) [0x7f6a1069de5d]
(EE) 15: /usr/lib/x86_64-linux-gnu/libGLESv2_PVR_MESA.so (glDeleteFramebuffers+0x41a) [0x7f6a1076a126]
(EE) 16: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x30c56) [0x7f6a121f8c56]
(EE) 17: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x31694) [0x7f6a121f9694]
(EE) 18: /usr/local/lib/xorg/modules/libglamoregl.so (glamor_close_screen+0x1be) [0x7f6a121d3f95]
(EE) 19: /usr/local/lib/xorg/modules/libglamoregl.so (0x7f6a121c8000+0x93b7) [0x7f6a121d13b7]
(EE) 20: /usr/local/bin/Xorg (0x5636fd17b000+0x1cbcab) [0x5636fd346cab]
(EE) 21: /usr/local/bin/Xorg (0x5636fd17b000+0x13de29) [0x5636fd2b8e29]
(EE) 22: /usr/local/bin/Xorg (0x5636fd17b000+0x3ff61) [0x5636fd1baf61]
(EE) 23: /usr/local/bin/Xorg (0x5636fd17b000+0xf3c2e) [0x5636fd26ec2e]
(EE) 24: /usr/local/bin/Xorg (0x5636fd17b000+0x4c813) [0x5636fd1c7813]
(EE) 25: /usr/local/bin/Xorg (0x5636fd17b000+0x526ed) [0x5636fd1cd6ed]
(EE) 26: /usr/local/bin/Xorg (0x5636fd17b000+0x24374d) [0x5636fd3be74d]
(EE) 27: /usr/local/lib/xorg/modules/drivers/modesetting_drv.so (0x7f6a12623000+0xd5c5) [0x7f6a126305c5]
(EE) 28: /usr/local/bin/Xorg (0x5636fd17b000+0xd2676) [0x5636fd24d676]
(EE) 29: /usr/local/bin/Xorg (0x5636fd17b000+0x218f48) [0x5636fd393f48]
(EE) 30: /usr/local/bin/Xorg (0x5636fd17b000+0x1faef4) [0x5636fd375ef4]
(EE) 31: /usr/local/bin/Xorg (0x5636fd17b000+0x1e98c4) [0x5636fd3648c4]
(EE) 32: /usr/local/bin/Xorg (0x5636fd17b000+0x134c45) [0x5636fd2afc45]
(EE) 33: /usr/local/bin/Xorg (0x5636fd17b000+0x2026ad) [0x5636fd37d6ad]
(EE) 34: /usr/local/bin/Xorg (0x5636fd17b000+0x10e3f6) [0x5636fd2893f6]
(EE) 35: /usr/local/bin/Xorg (0x5636fd17b000+0x20b1db) [0x5636fd3861db]
(EE) 36: /usr/local/bin/Xorg (0x5636fd17b000+0x13ebdf) [0x5636fd2b9bdf]
(EE) 37: /usr/local/bin/Xorg (0x5636fd17b000+0xf4b6d) [0x5636fd26fb6d]
(EE) 38: /usr/local/bin/Xorg (0x5636fd17b000+0xca093) [0x5636fd245093]
(EE) 39: /usr/local/lib/xorg/modules/extensions/libglx.so (0x7f6a14b7f000+0x3c32d) [0x7f6a14bbb32d]
(EE) 40: /usr/local/bin/Xorg (0x5636fd17b000+0x83bc7) [0x5636fd1febc7]
(EE) 41: /usr/local/bin/Xorg (0x5636fd17b000+0x2504b5) [0x5636fd3cb4b5]
(EE) 42: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xe7) [0x7f6a15d6ec87]
(EE) 43: /usr/local/bin/Xorg (_start+0x2a) [0x5636fd1a951a]
(EE)
(EE) Segmentation fault at address 0x7f6a182b0014
(EE)
Fatal server error:
(EE) Caught signal 11 (Segmentation fault). Server aborting
(EE)
(EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
(EE) Please also check the log file at "/usr/local/var/log/Xorg.0.log" for additional information.
(EE)
ACPI: Closing device
(EE) Server terminated with error (1). Closing log file.
从日志可以知道,函数xdxgpu_bo_unmap unmap地址0x7f6a182b0000(information page的虚拟地址)后,随后函数PVRSRVGetClientEventFilter又访问information page,造成了segment fault。
3.3.2 unmap流程获取
到底是谁unmap的呢?因为对xorg的代码流程不熟悉,为了快速获取unmap的函数调用流程,在函数AcquireInfoPage(该函数设置pui32InfoPage)中加入代码,将pui32InfoPage的值赋值给定义的全局变量gInfoPage。在函数OSMUnmapPMR中加入代码,当unmap的地址等于gInfoPage时,打印一句信息。代码如下:
154 unsigned int* gInfoPage = NULL;
155
156 IMG_INTERNAL
157 PVRSRV_ERROR AcquireInfoPage(PVRSRV_DEV_CONNECTION *psDevConnection)
158 {
159 PVRSRV_ERROR eError;
160 IMG_HANDLE hSrvHandle;
161 IMG_DEVMEM_SIZE_T uiImportSize;
162
163 PVR_ASSERT(psDevConnection != NULL);
164
165 /* Obtain the services connection handle */
166 hSrvHandle = GetSrvHandle(psDevConnection);
167
168 /* Acquire server information page (placed in process handle table) */
169 eError = BridgeAcquireInfoPage(hSrvHandle, &psDevConnection->hInfoPagePMR);
170 PVR_LOG_GOTO_IF_ERROR(eError, "BridgeAcquireInfoPage", e0);
171
172 /* Now convert this physical memory into a client memory descriptor
173 * handle */
174 eError = DevmemLocalImport(psDevConnection,
175 psDevConnection->hInfoPagePMR,
176 PVRSRV_MEMALLOCFLAG_CPU_READABLE,
177 &psDevConnection->psImportMemDesc,
178 &uiImportSize,
179 "InfoPageBuffer");
180 PVR_LOG_GOTO_IF_ERROR(eError, "DevmemLocalImport", e1);
181
182 /* Now map the client memory handle into the virtual address space of this
183 * process */
184 eError = DevmemAcquireCpuVirtAddr(psDevConnection->psImportMemDesc,
185 (void**) &psDevConnection->pui32InfoPage);
186
187 gInfoPage = psDevConnection->pui32InfoPage; //添加调试信息
188 printf("%s %d psDevConnection->pui32InfoPage=0x%lx\n", __func__, __LINE__, (unsigned long)psDevConnection->pui32InfoPage);
.......
264 extern unsigned int * gInfoPage;
265
266 IMG_INTERNAL void
267 OSMUnmapPMR(IMG_HANDLE hBridge,
268 IMG_HANDLE hPMR,
269 IMG_HANDLE hOSMMapPrivData,
270 void *pvMappingAddress,
271 size_t uiMappingLength)
272 {
273 #ifndef YAJUN_MMAP
274 IMG_INT32 iStatus;
275 #endif
276 size_t iSizeAsSizeT;
277
278 PVR_UNREFERENCED_PARAMETER(hBridge);
279 PVR_UNREFERENCED_PARAMETER(hPMR);
280 #ifndef YAJUN_MMAP
281 PVR_UNREFERENCED_PARAMETER(hOSMMapPrivData);
282
283 PVR_ASSERT(hOSMMapPrivData == pvMappingAddress);
284 #else
285 PVR_UNREFERENCED_PARAMETER(pvMappingAddress);
286 #endif
287
288 /* We generically use IMG_DEVMEM_SIZE_T throughout, but, munmap
289 requires that we use a size_t. We have to assert that the
290 length being mapped didn't exceed the max that size_t can
291 handle. This should be checked in OSMMapPMR() */
292 iSizeAsSizeT = (size_t) uiMappingLength;
293 PVR_ASSERT(iSizeAsSizeT == uiMappingLength);
294
295 PVRSRVBridgeLog_StartTimer();
296
297 #if 1
298 if (*((unsigned int **)hOSMMapPrivData) == gInfoPage) //添加调试信息
299 {
300 printf("%s %d gInfoPage=0x%lx\n", __func__, __LINE__, (unsigned long)gInfoPage);
301 }
302 #endif
303 xdxgpu_bo_unmap((xdxgpu_handle)hOSMMapPrivData);
.......
然后运行Xorg后,使用gdb attach到Xorg,并设置断点为函数OSMUnmapPMR的第300行即打印调试信息的代码,再运行xgltest1复现问题。
最终得到如下函数调用栈:
(gdb) c
Continuing.
[Thread 0x7fc4184e2700 (LWP 2618) exited]
Thread 1 "Xorg" hit Breakpoint 1, OSMUnmapPMR (hBridge=0x560f5fba2ff0, hPMR=0x2, hOSMMapPrivData=0x560f5f78f770, pvMappingAddress=0x7f6a182b0000,
uiMappingLength=4096) at services/client/env/linux/osmmap.c:300
300 printf("%s %d gInfoPage=0x%lx\n", __func__, __LINE__, (unsigned long)gInfoPage);
(gdb) bt
#0 OSMUnmapPMR (hBridge=0x560f5fba2ff0, hPMR=0x2, hOSMMapPrivData=0x560f5f78f770, pvMappingAddress=0x7f6a182b0000, uiMappingLength=4096)
at services/client/env/linux/osmmap.c:300
#1 0x00007fc41d87a946 in DevmemImportStructCPUUnmap (psImport=0x560f5fbf55c0) at services/shared/common/devicemem_utils.c:1207
#2 0x00007fc41d8783db in DevmemReleaseCpuVirtAddr (psMemDesc=0x560f5fb93000) at services/shared/common/devicemem.c:2631
#3 0x00007fc41d83a73b in ReleaseInfoPage (psDevConnection=0x560f5fb92f40) at services/client/common/srvcore.c:215
#4 0x00007fc41d82e949 in ConnectionDestroy (psConnection=0x560f5fb92f40) at services/client/common/connection.c:454
#5 0x00007fc41d83aae4 in PVRSRVDisconnect (psConnection=0x560f5fb92f40) at services/client/common/srvcore.c:377
#6 0x00007fc41ddf11c9 in PVRDRIDestroyScreenImpl (psScreenImpl=0x560f5faca8f0) at lws/pvr_dri_support/pvrscreen_impl.c:321
#7 0x00007fc41ddf2ec3 in DRIMODDestroyScreen (psPVRScreen=0x560f5fb68830) at lws/pvr_dri_support/pvrdri_mod.c:250
#8 0x00007fc41e421241 in DRISUPDestroyScreen (psDRISUPScreen=0x560f5fb68830) at ../src/mesa/drivers/dri/pvr/pvrcompat.c:304
#9 0x00007fc41e422192 in PVRDRIScreenRemoveReference (psPVRScreen=0x560f5fc65430) at ../src/mesa/drivers/dri/pvr/pvrdri.c:96
#10 0x00007fc41e42277c in PVRDRIDestroyScreen (psDRIScreen=0x560f5fbfcf10) at ../src/mesa/drivers/dri/pvr/pvrdri.c:265
#11 0x00007fc41e425b2c in driDestroyScreen (psp=0x560f5fbfcf10) at ../src/mesa/drivers/dri/common/dri_util.c:239
#12 0x00007fc421376887 in __glXDRIscreenDestroy (baseScreen=0x560f5fa1e770) at ../source/glx/glxdri2.c:944
#13 0x00007fc4213a2319 in glxCloseScreen (pScreen=0x560f5f8d2e60) at ../source/glx/glxscreens.c:169
#14 0x0000560f5d521bc7 in dix_main (argc=4, argv=0x7ffce56e6328, envp=0x7ffce56e6350) at ../source/dix/main.c:325
#15 0x0000560f5d6ee4b5 in main (argc=4, argv=0x7ffce56e6328, envp=0x7ffce56e6350) at ../source/dix/stubmain.c:34
(gdb) c
Continuing.
Thread 1 "Xorg" received signal SIGSEGV, Segmentation fault.
0x00007fc41d8308e9 in PVRSRVGetClientEventFilter (psDevConnection=0x560f5f7db250, eApi=1) at services/client/common/hwperf_client.c:439
439 if (psDevConnection->pui32InfoPage[_ApiToInfoPageIdx(eApi)] == 0)
(gdb) info thread
Id Target Id Frame
* 1 Thread 0x7fc424a69d00 (LWP 2601) "Xorg" 0x00007fc41d8308e9 in PVRSRVGetClientEventFilter (psDevConnection=0x560f5f7db250, eApi=1)
at services/client/common/hwperf_client.c:439
(gdb) c
Continuing.
Thread 1 "Xorg" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb)
Continuing.
Program terminated with signal SIGABRT, Aborted.
The program no longer exists.
(gdb) quit
root@FPGA-test:/home/test#
可以推测是xgltest1结束后,触发了xorg调用glxCloseScreen,貌似也没什么问题。那么为什么xorg会去访问已经unmap的内存呢?Xorg代码没修改过,应该不会出现这种逻辑问题。所以还是把重心放在information page上。通过在函数PVRDRICreateScreen、AcquireInfoPage中添加调试信息(见上面函数AcquireInfoPage行号为188的代码行)得到如下日志:
......
LoaderOpen(/usr/local/lib/xorg/modules/libglamoregl.so)
(II) Loading /usr/local/lib/xorg/modules/libglamoregl.so
(II) Module glamoregl: vendor="X.Org Foundation"
compiled for 1.20.13, module version = 1.0.1
--------------- PVRDRICreateScreen 263
--------------- PVRSRVConnectionCreate 353
--------------- ConnectionCreate 240 fd=15
-------OSMMapPMR 198 fd=15 PMR_handle=0x2 dev=0x563571c00db0 bo=0x563571be09a0 addr=0x7feffb7f8000
xxxx map success OSMMapPMR 235 pvUserVirtualAddress=0x7feffb7f8000
AcquireInfoPage 188 psDevConnection->pui32InfoPage=0x7feffb7f8000 //关键信息
ConnectionCreate 308 psConnection->pui32InfoPage=0x7feffb7f8000 TIMEOUT_INFO_VALUE_RETRIES=20000, TIMEOUT_INFO_VALUE_TIMEOUT_MS=1300, TIMEOUT_INFO_CONDITION_RETRIES=5, TIMEOUT_INFO_CONDITION_TIMEOUT_MS=400, TIMEOUT_INFO_TASK_QUEUE_RETRIES=10, TIMEOUT_INFO_TASK_QUEUE_FLUSH_TIMEOUT_MS=1000
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
......
Sync Extension 3.1
(II) AIGLX: Indirect GLX is not supported
(II) AIGLX: Indirect GLX is not supported
--------------- PVRDRICreateScreen 263
--------------- PVRSRVConnectionCreate 353
--------------- ConnectionCreate 240 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
xdxgpu_device_create 96 not equal r1=1 r2=0 tmp->fd=16 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
over!
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
-------OSMMapPMR 198 fd=24 PMR_handle=0x2 dev=0x563571c00db0 bo=0x563571be09a0 addr=0x7feffb7f8000
xxxx map success OSMMapPMR 235 pvUserVirtualAddress=0x7feffb7f8000
AcquireInfoPage 188 psDevConnection->pui32InfoPage=0x7feffb7f8000 //关键信息
ConnectionCreate 308 psConnection->pui32InfoPage=0x7feffb7f8000 TIMEOUT_INFO_VALUE_RETRIES=20000, TIMEOUT_INFO_VALUE_TIMEOUT_MS=1300, TIMEOUT_INFO_CONDITION_RETRIES=5, TIMEOUT_INFO_CONDITION_TIMEOUT_MS=400, TIMEOUT_INFO_TASK_QUEUE_RETRIES=10, TIMEOUT_INFO_TASK_QUEUE_FLUSH_TIMEOUT_MS=1000
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
xdxgpu_device_create 96 not equal r1=1 r2=0 tmp->fd=16 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
......
分析日志可知,xorg会调用两次函数PVRDRICreateScreen,PVRDRICreateScreen每次都会open /dev/dri/crad0获取fd,然后调用AcquireInfoPage映射information page,且获得的information page的虚拟地址相同。为什么两次将information page映射到用户态,得到的虚拟地址一样?那就需要分析information page的映射流程。
3.3.3 information page映射流程
AcquireInfoPage最终调用函数osmmap将PMR handle(用户态handle,传入内核KMD后转换为相应的PMR)表示的PMR映射到用户态。osmmap代码如下:
50 IMG_INTERNAL PVRSRV_ERROR
51 OSMMapPMR(IMG_HANDLE hBridge,
52 IMG_HANDLE hPMR,
53 IMG_DEVMEM_SIZE_T uiPMRLength,
54 PVRSRV_MEMALLOCFLAGS_T uiFlags,
55 IMG_HANDLE *phOSMMapPrivDataOut,
56 void **ppvMappingAddressInOut,
57 size_t *puiMappingLengthOut)
58 {
59 //xxxx modify
60 PVRSRV_ERROR eError = PVRSRV_OK;
61 size_t uiMMapOffset;
62 size_t uiMMapLength;
63 IMG_UINT32 uiMMapProtFlags = 0;
64 #if defined(MAP_FIXED_NOREPLACE) || !defined(YAJUN_MMAP)
65 IMG_UINT32 uiMMapFlags = MAP_SHARED;
66 #endif
67 void *pvUserVirtualAddress = NULL;
68 void *hOSMMapPrivDataOut = NULL;
69 //xxxx modify
70 #ifndef YAJUN_MMAP
71 IMG_INT iResult, iErrno;
72 #endif
73
74 uiMMapOffset = (size_t)hPMR;
75
76 uiMMapLength = (size_t)uiPMRLength;
77 /* To justify above cast, we assert that no significant bits were lost */
78 PVR_ASSERT((IMG_DEVMEM_SIZE_T)uiMMapLength == uiPMRLength);
..............
171 {
172 IMG_INT ret;
173 xdxgpu_handle devHandle = NULL;
174 IMG_INT drm_fd = *(IMG_INT *)hBridge;
175 xdxgpu_handle bo = NULL;
176
177 ret = xdxgpu_device_create(drm_fd, &devHandle);
178 if (ret) {
179 eError = PVRSRV_ERROR_DEVICEMEM_MAP_FAILED;
180 goto mmap_xdxgpu_end;
181 }
182
183 ret = xdxgpu_bo_import(devHandle,
184 XDXGPU_BO_PVR_HANDLE, (IMG_INT64)hPMR, &bo);
185 if (ret) {
186 printf("xxxx map error %s %d ret=%d \n", __func__, __LINE__, ret);
187 eError = PVRSRV_ERROR_DEVICEMEM_MAP_FAILED;
188 goto mmap_xdxgpu_end;
189 }
190
191 ret = xdxgpu_bo_map(bo, &pvUserVirtualAddress);
192 if (ret) {
193 printf("xxxx map error %s %d ret=%d \n", __func__, __LINE__, ret);
194 eError = PVRSRV_ERROR_DEVICEMEM_MAP_FAILED;
195 }
196
197 hOSMMapPrivDataOut = bo;
198 mmap_xdxgpu_end:
199 if (bo)
200 xdxgpu_bo_destroy(bo);
201 if (devHandle)
202 xdxgpu_device_destroy(devHandle);
203 if (eError != PVRSRV_OK) {
204 printf("xxxx map error %s %d ret=%d \n", __func__, __LINE__, ret);
205 goto e0;
206 }
207 }
..............
osmmap调用了libdrm的如下几个函数实现将PMR handle表示的PMR映射到用户态:
- xdxgpu_device_create
- xdxgpu_bo_import
- xdxgpu_bo_map
其中xdxgpu_bo_import是函数xdxgpu_bo_import_from_pvr_handle的封装,后面就只列出函数xdxgpu_bo_import_from_pvr_handle的代码
这几个函数的代码如下:
static int fd_compare(int fd1, int fd2)
{
char *name1 = drmGetPrimaryDeviceNameFromFd(fd1);
char *name2 = drmGetPrimaryDeviceNameFromFd(fd2);
int result;
if (name1 == NULL || name2 == NULL) {
free(name1);
free(name2);
return 0;
}
result = strcmp(name1, name2);
free(name1);
free(name2);
return result;
}
81 drm_public int xdxgpu_device_create(int fd, xdxgpu_handle *devHandle)
82 {
83 struct xdxgpu_device *dev = NULL, *tmp;
84 drmVersionPtr version;
85 int ret;
86
87 pthread_mutex_lock(&dev_mutex);
88 LIST_FOR_EACH_ENTRY(tmp, &dev_list, node)
89 {
90 if (fd_compare_test(tmp->fd, fd) == 0) {
91 dev = tmp;
92 break;
93 }
94 }
95
96 if (dev) {
97 xdxgpu_device_get(dev);
98 *devHandle = (xdxgpu_handle)dev;
99 pthread_mutex_unlock(&dev_mutex);
100 return 0;
101 }
102
103 /* Create new xdxgpu device */
104 dev = calloc(1, sizeof(struct xdxgpu_device));
105 if (!dev) {
106 fprintf(stderr, "%s: calloc failed\n", __func__);
107 pthread_mutex_unlock(&dev_mutex);
108 return -ENOMEM;
109 }
110
111 ret = drmGetDevice2(fd, 0, &dev->ddev);
112 if (ret) {
113 fprintf(stderr, "%s: get device info failed\n", __func__);
114 free(dev);
115 pthread_mutex_unlock(&dev_mutex);
116 return ret;
117 }
118
119 dev->fd = -1;
120 atomic_set(&dev->refcount, 1);
121
122 version = drmGetVersion(fd);
123 dev->major_version = version->version_major;
124 dev->minor_version = version->version_minor;
125 drmFreeVersion(version);
126
127 dev->fd = fcntl(fd, F_DUPFD_CLOEXEC, 0);
128 dev->bo_table = drmHashCreate();
129
130 list_add(&dev->node, &dev_list);
131 pthread_mutex_init(&dev->bo_tlb_lock, NULL);
132
133 *devHandle = (xdxgpu_handle)dev;
134 pthread_mutex_unlock(&dev_mutex);
135 return 0;
136 }
340 static int xdxgpu_bo_import_from_pvr_handle(struct xdxgpu_device *xdev,
341 uint32_t handle,
342 struct xdxgpu_bo **ppxbo)
343 {
344 union drm_xdxgpu_gem_import_from_pvr args = { 0 };
345 int ret;
346 int32_t gem_handle;
347 struct xdxgpu_bo *xbo;
348 struct xdxgpu_gem_info info = { 0 };
349
350 dev_info(xdev, "%s: pvr handle 0x%x\n", __func__, handle);
351
352 args.in.handle = handle;
353 ret = drmCommandWriteRead(xdev->fd, DRM_XDXGPU_IMPORT_FROME_PVR_HANDLE,
354 &args, sizeof(args));
355 if (ret) {
356 dev_err(xdev, "%s: failed to import from PVR handle (%d)\n",
357 __func__, ret);
358 return ret;
359 }
360
361 gem_handle = args.out.handle;
362
363 pthread_mutex_lock(&xdev->bo_tlb_lock);
364 xbo = xdxgpu_lookup_bo(xdev->bo_table, gem_handle);
365 pthread_mutex_unlock(&xdev->bo_tlb_lock);
366 if (xbo) {
367 *ppxbo = xbo;
368 return 0;
369 }
370
371 // new buffer object in current process
372 xbo = calloc(1, sizeof(struct xdxgpu_bo));
373 if (!xbo) {
374 drmCloseBufferHandle(xdev->fd, gem_handle);
375 return -ENOMEM;
376 }
377
378 ret = xdxgpu_query_gem_info(xdev, gem_handle, &info);
379 if (ret) {
380 free(xbo);
381 drmCloseBufferHandle(xdev->fd, gem_handle);
382 return ret;
383 }
384
385 xdxgpu_device_get(xdev);
386 xbo->xdev = xdev;
387 xbo->gem_handle = gem_handle;
388 xbo->size = info.size;
389 atomic_set(&xbo->refcount, 1);
390 atomic_set(&xbo->map_count, 0);
391
392 pthread_mutex_lock(&xdev->bo_tlb_lock);
393 drmHashInsert(xdev->bo_table, xbo->gem_handle, xbo);
394 pthread_mutex_unlock(&xdev->bo_tlb_lock);
395
396 *ppxbo = xbo;
397
398 return ret;
399 }
169 drm_public int xdxgpu_bo_map(xdxgpu_handle bo, void **cpu)
170 {
171 void *ptr;
172 struct xdxgpu_bo *xbo = (struct xdxgpu_bo *)bo;
173 struct xdxgpu_device *xdev = xbo->xdev;
174 int ret;
175 struct xdxgpu_gem_info info = { 0 };
176
177 assert(xbo != NULL);
178
179 if (xbo->ptr) {
180 *cpu = xbo->ptr;
181 return 0;
182 }
183
184 ret = xdxgpu_query_gem_info(xdev, xbo->gem_handle, &info);
185 if (ret)
186 return ret;
187
188 if (info.mmap_offset == (uint64_t)-1) {
189 dev_err(xdev, "%s: no permission to mmap buffer object %p\n",
190 __func__, xbo);
191 return -errno;
192 }
193
194 /* Map the buffer. */
195 ptr = drm_mmap(NULL, xbo->size, PROT_READ | PROT_WRITE, MAP_SHARED,
196 xdev->fd, info.mmap_offset);
197 if (ptr == MAP_FAILED) {
198 dev_err(xdev, "%s: failed mmap buffer object %p\n", __func__,
199 xbo);
200 return -errno;
201 }
202
203 xdxgpu_bo_get(xbo);
204
205 xbo->ptr = ptr;
206
207 *cpu = xbo->ptr;
208
209 return 0;
210 }
函数xdxgpu_device_create处理流程如下:
- 遍历全局链表dev_list的所有设备结点,比较结点的fd对应的文件名(包含路径和文件名)与参数fd对应的文件名(包含路径和文件名)是否相等。
- 如果有相等的设备结点,则调用xdxgpu_device_get将设备结点的引用计数增1,并获取该设备结点。
- 如果没有相等的设备结点,则创建设备结点并将设备结点引用计数初始化为1,随后调用函数fcntl复制参数fd生成一个新的fd,并将新的fd赋值给设备结点的fd。查看内核代码do_fcntl->f_dupfd,函数f_dupfd调用alloc_fd分配一个空闲的fd,然后将参数fd所指向的file的引用计数增1,并没有创建一个新的file。
函数xdxgpu_bo_import_from_pvr_handle的处理流程如下:
- 调用函数drmCommandWriteRead将PMR handle转换为gem object的handle(对应内核处理函数为xdx_gem_import_from_pvr_ioctl),后面称为gem handle。
- 调用函数xdxgpu_lookup_bo在设备结点的bo_table中查找第1步获取的gem handle是否已经存在,如果存在则将查找到的bo的引用计数增1(xdxgpu_lookup_bo->xdxgpu_bo_get)并返回。
- 如果gem handle不存在,则分配bo并初始化bo的引用计数为1,并调用xdxgpu_device_get增加设备结点引用计数,最后以gem handle为key将bo加入设备结点的hash表即bo_table。
注:每创建一个bo会将设备结点引用计数增1,而bo引用计数为0调用函数xdxgpu_bo_free释放时,会调用函数xdxgpu_bo_free->xdxgpu_device_put将设备结点引用计数减1,如果减1后为0则释放设备结点。
函数xdxgpu_bo_map的处理流程如下:
- 如果bo的虚拟地址ptr不为NULL即bo已经映射,则直接获取bo的ptr,并返回。
- 如果bo未映射,则调用xdxgpu_query_gem_info,该函数会调用ioctl,内核KMD相应的处理函数获取gem object的大小、创建gem map offset(内核函数drm_gem_create_mmap_offset)。
- 调用函数drm_mmap(为mmap的宏定义)将bo映射到用户虚拟地址空间,获取虚拟地址。
- 调用函数xdxgpu_bo_get将bo的引用计数增1。
- 并将虚拟地址复制到bo的ptr。
根据上面的流程可知,函数OSMMapPMR后面调用的xdxgpu_bo_destroy(bo引用计数减1)对应xdxgpu_bo_import中bo引用计数设置为1或增1,但是因为xdxgpu_bo_map中将bo引用计数增1,所以最终并不会释放bo,需要OSUnmapPMR->xdxgpu_bo_unmap->xdxgpu_bo_put将bo引用计数减为0,从而释放bo。
函数OSMMapPMR后面调用的xdxgpu_device_destroy(设备结点引用计数减1)对应xdxgpu_device_create中将设备引用计数设置为1或增1,但是因为xdxgpu_bo_import会增加设备引用计数,所以最终并不会释放设备结点,需要OSUnmapPMR->xdxgpu_bo_unmap->xdxgpu_bo_put将bo引用计数减1,如果bo引用计数减1后为0则调用xdxgpu_bo_free->xdxgpu_device_put将设备结点引用计数减1,从而释放设备结点。
从上面分析可知设备结点的引用计数等于设备结点上的bo个数,当所有bo都释放后设备结点也会被释放。
其中函数xdxgpu_bo_import_from_pvr_handle调用drmCommandWriteRead最终会调用ioctl,对应的内核KMD处理函数如下:
int xdx_gem_import_from_pvr_ioctl(struct drm_device *ddev, void *data,
struct drm_file *filp)
{
struct xdx_device *xdev = drm_to_xdev(ddev);
union drm_xdxgpu_gem_import_from_pvr *args = data;
struct xdx_drm_fpriv *fpriv = filp->driver_priv;
int ret;
void *psPMR;
struct drm_gem_object *gobj;
struct xdx_bo *bo;
uint32_t handle;
struct xdx_bo_property pro;
uint32_t size;
struct drm_file *file;
bool bFound = false;
ret = pvr_pmr_lookup(fpriv->pvr_fpriv, args->in.handle, &psPMR);
if (ret) {
dev_err(xdev->dev, "failed to lookup PMR: handle(0x%x)\n", args->in.handle);
return ret;
}
mutex_lock(&ddev->filelist_mutex);
list_for_each_entry(file, &ddev->filelist, lhead) {
spin_lock(&file->table_lock);
idr_for_each_entry(&file->object_idr, gobj, handle) {
bo = to_xdx_gem(gobj);
if(bo->psPMR == psPMR) {
bFound = true;
drm_gem_object_get(gobj); /* avoid released in other process */
break;
}
}
spin_unlock(&file->table_lock);
if(bFound)
break;
}
mutex_unlock(&ddev->filelist_mutex);
if (bFound) {
if (file != filp) {
ret = drm_gem_handle_create(filp, gobj, &handle);
if (ret) {
dev_err(xdev->dev, "failed to create gem handle\n");
PMRUnrefPMR((PMR *)psPMR);
drm_gem_object_put_unlocked(gobj);
return ret;
}
}
args->out.handle = handle;
PMRUnrefPMR((PMR *)psPMR);
drm_gem_object_put_unlocked(gobj);
return 0;
}
pvr_query_pmr_info(psPMR, &pro, &size);
ret = xdx_gem_object_create(xdev, NULL, (uint64_t)size, PAGE_SIZE,
&pro, 0, xdx_bo_type_pvr, &gobj);
if (ret) {
PMRUnrefPMR((PMR *)psPMR);
dev_err(xdev->dev, "failed to import from pvr handle: size %d\n", size);
return ret;
}
bo = to_xdx_gem(gobj);
bo->psPMR = psPMR;
ret = drm_gem_handle_create(filp, gobj, &handle);
drm_gem_object_put_unlocked(gobj);
if (ret)
return ret;
args->out.handle = handle;
return ret;
}
函数xdx_gem_import_from_pvr_ioctl将应用层PMR handle转换为gem handle,它的处理流程如下:
- 调用函数pvr_pmr_lookup根据PMR handle查找PMR。
- 遍历drm_device的文件链表,然后对于每个文件drm file结点,遍历文件drm file结点的object_idr,如果bo的PMR(bo->psPMR)等于第1步查找到的PMR,则设置bFound为ture。
- 如果bFound为ture即drv_device上找到了查找的PMR,则判断PMR(实际是gem object)已经挂载到的drm file与上层应用本次调用xdx_gem_import_from_pvr_ioctl对应的drm file是否相同。如果不相同,则调用drm_gem_handle_create->....->idr_alloc(&file_priv->object_idr, obj....)分配gem handle,并将gem object挂载到上层应用本次调用xdx_gem_import_from_pvr_ioctl对应的drm file上;如果相同,则直接获取gem object的gem handle。
- 如果bFound为false即drv_device上未找到查找的PMR,则创建相应的gem object,将PMR赋值给gem object相应成员,随后调用drm_gem_handle_create分配gem handle,并将gem object挂载到上层应用本次调用xdx_gem_import_from_pvr_ioctl对应的drm file上。
注:上层应用每次open设备文件/dev/dri/card0或/dev/dri/renderD128时,内核都会分配一个drm file,所以同一个进程drm file可以不相等。
3.3.4 为什么两次映射虚拟地址一样?
从上节的分析,不难分析出:如果上层bo已经映射(bo->ptr不为NULL),则对该bo调用函数xdxgpu_bo_map时会直接返回映射的虚拟地址。
为了验证分析,在xdxgpu_bo_map中加入调试信息打印出bo地址,发现函数AcquireInfoPage最终都是用的同一个bo。日志如下:
......
LoaderOpen(/usr/local/lib/xorg/modules/libglamoregl.so)
(II) Loading /usr/local/lib/xorg/modules/libglamoregl.so
(II) Module glamoregl: vendor="X.Org Foundation"
compiled for 1.20.13, module version = 1.0.1
--------------- PVRDRICreateScreen 263
--------------- PVRSRVConnectionCreate 353
--------------- ConnectionCreate 240 fd=15
-------OSMMapPMR 198 fd=15 PMR_handle=0x2 dev=0x563571c00db0 bo=0x563571be09a0 addr=0x7feffb7f8000 //关键日志
xxxx map success OSMMapPMR 235 pvUserVirtualAddress=0x7feffb7f8000
AcquireInfoPage 188 psDevConnection->pui32InfoPage=0x7feffb7f8000
ConnectionCreate 308 psConnection->pui32InfoPage=0x7feffb7f8000 TIMEOUT_INFO_VALUE_RETRIES=20000, TIMEOUT_INFO_VALUE_TIMEOUT_MS=1300, TIMEOUT_INFO_CONDITION_RETRIES=5, TIMEOUT_INFO_CONDITION_TIMEOUT_MS=400, TIMEOUT_INFO_TASK_QUEUE_RETRIES=10, TIMEOUT_INFO_TASK_QUEUE_FLUSH_TIMEOUT_MS=1000
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
......
Sync Extension 3.1
(II) AIGLX: Indirect GLX is not supported
(II) AIGLX: Indirect GLX is not supported
--------------- PVRDRICreateScreen 263
--------------- PVRSRVConnectionCreate 353
--------------- ConnectionCreate 240 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
xdxgpu_device_create 96 not equal r1=1 r2=0 tmp->fd=16 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
over!
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
-------OSMMapPMR 198 fd=24 PMR_handle=0x2 dev=0x563571c00db0 bo=0x563571be09a0 addr=0x7feffb7f8000 //关键日志
xxxx map success OSMMapPMR 235 pvUserVirtualAddress=0x7feffb7f8000
AcquireInfoPage 188 psDevConnection->pui32InfoPage=0x7feffb7f8000
ConnectionCreate 308 psConnection->pui32InfoPage=0x7feffb7f8000 TIMEOUT_INFO_VALUE_RETRIES=20000, TIMEOUT_INFO_VALUE_TIMEOUT_MS=1300, TIMEOUT_INFO_CONDITION_RETRIES=5, TIMEOUT_INFO_CONDITION_TIMEOUT_MS=400, TIMEOUT_INFO_TASK_QUEUE_RETRIES=10, TIMEOUT_INFO_TASK_QUEUE_FLUSH_TIMEOUT_MS=1000
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
xdxgpu_device_create 96 not equal r1=1 r2=0 tmp->fd=16 fd=24
fd_compare_test 47 name1:/dev/dri/card0 name2:/dev/dri/card0
......
从而可以推出两次映射时xdxgpu_bo_import_from_pvr_handle->drmCommandWriteRead从内核获取的gem handle相等,否则第二次映射调用xdxgpu_bo_import_from_pvr_handle->xdxgpu_lookup_bo会找不到bo。
从而可以推出第二次映射时,内核KMD函数xdx_gem_import_from_pvr_ioctl的bFound为true,且file == filp。bFound为true是因为内核KMD初始化时只创建了一个information page的PMR,第一次映射时创建了gem object并挂载到了drm device的相应drm file上,第二次映射就可以找到。
为什么file == filp?因为PVRDRICreateScreen两次打开的都是/dev/dri/card0,所以第二次映射时fd_compare返回值为0,而xdxgpu_bo_import_from_pvr_handle使用的设备结点就是fd_compare找到的设备结点,而drmCommandWriteRead的参数fd就是设备结点的fd,所以两次映射调用函数drmCommmandWriteRead传入的fd一样,所以file == filp成立。
四、问题原因
从上面分析可知,两次映射information page的虚拟地址一样,当xgltest1结束时,触发了xorg解除某个映射,从而将xorg的虚拟地址解除映射了,内核释放了相应vma,当xorg再次使用已经解除映射的虚拟地址访问数据,会触发page fault,内核page fault处理流程未找到对应的vma,从而给进程发送segment fault的信号。
如下为解除映射代码:
266 IMG_INTERNAL void
267 OSMUnmapPMR(IMG_HANDLE hBridge,
268 IMG_HANDLE hPMR,
269 IMG_HANDLE hOSMMapPrivData,
270 void *pvMappingAddress,
271 size_t uiMappingLength)
272 {
........
303 xdxgpu_bo_unmap((xdxgpu_handle)hOSMMapPrivData);
........
216 drm_public void xdxgpu_bo_unmap(xdxgpu_handle bo)
217 {
218 struct xdxgpu_bo *xbo = (struct xdxgpu_bo *)bo;
219
220 assert(xbo != NULL);
221
222 if (xbo->ptr) {
223 drm_munmap(xbo->ptr, xbo->size);
224 xbo->ptr = NULL;
225 }
226
227 xdxgpu_bo_put(xbo);
228 }
从代码可知,解除映射时并没有判断该bo是否还被其它流程映射,而是直接解除了映射。
五、如何修改
5.1 方法一修改函数fd_compare
5.1.1 修改方法
从上面分析可知,fd_compare是基于进程查找的,只要两个fd表示的文件名(包括路径)一样,则fd_compare返回0。如果fd_compare比较的是内核file是否为同一个,则可以解决这个问题,因为PVRDRICreateScreen open了两次/dev/dri/card0,每次open内核会创建一个file,所以两个fd在内核表示不同的file。
修改代码如下:
20 static int kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1,
21 unsigned long idx2)
22 {
23 return syscall(SYS_kcmp, pid1, pid2, type, idx1, idx2);
24 }
25
26 static int fd_compare(int fd1, int fd2)
27 {
28 pid_t pid = getpid();
29
30 return kcmp(pid, pid, KCMP_FILE, (long)fd1, (long)fd2);
31 }
因为SYS_kcmp比较的时内核的file指针(见文件kernel/kcmp.c),所以第二次映射时调用函数fd_compare不会返回0,从而创建设备结点,从而xdxgpu_bo_import_from_pvr_handle会创建新的bo(因为xdxgpu_lookup_bo在新创建的设备结点上找不到gem handle),从而xdxgpu_bo_map会重新映射,获取的虚拟地址就不一样,从而不会相互影响。
5.1.2 不足
- 这种修改方法将设备结点从进程的范围缩小到open文件,即这样修改后,一个进程如果打开同一个设备文件多次,则上层应用会创建对应个数的设备结点。这样浪费了不必要的内存资源,同时xdxgpu_device_create遍历dev_list查找设备结点、分配设备结点会降低一定性能(可能微不足道)。
- 如果一个进程对同一个PMR handle进行映射,且fd都是指向同一个open的设备文件(内核中file指针相同),则还是会存在上面的问题。
5.2 方法二添加map_count字段
给bo添加一个成员map_count,类型为原子(atomic_t),创建bo时设置map_count为0。map时函数xdxgpu_bo_map中将map_count增1。unmap时函数xdxgpu_bo_map中将map_count减1,并判断map_count是否为0,如果为0则解除映射;否则直接返回不解除映射。修改代码如下:
@@ -174,6 +176,8 @@ drm_public int xdxgpu_bo_map(xdxgpu_handle bo, void **cpu)
assert(xbo != NULL);
+ atomic_inc(&xbo->map_count);
+
if (xbo->ptr) {
*cpu = xbo->ptr;
return 0;
@@ -207,14 +211,30 @@ drm_public int xdxgpu_bo_map(xdxgpu_handle bo, void **cpu)
return 0;
}
drm_public void xdxgpu_bo_unmap(xdxgpu_handle bo)
{
struct xdxgpu_bo *xbo = (struct xdxgpu_bo *)bo;
assert(xbo != NULL);
+ if (!atomic_dec_and_test(&xbo->map_count))
+ return;
@@ -367,6 +387,7 @@ static int xdxgpu_bo_import_from_pvr_handle(struct xdxgpu_device *xdev,
xbo->gem_handle = gem_handle;
xbo->size = info.size;
atomic_set(&xbo->refcount, 1);
+ atomic_set(&xbo->map_count, 0);
这样引用计数流程如下:创建bo时相应的设备结点引用计数增1,第一次对bo进行map时bo的引用计数增1,后续对该bo再进行map时只对bo的map_count增1;对bo进行unmap时将bo的map_count减1,map_count减1后如果为0则对bo的引用计数减1,bo的引用计数减1后如果为0则对设备结点的引用计数减1。其中bo的引用计数减1后为0则需要释放bo相关的资源,设备结点的引用计数减1后为0则需要释放设备结点相关的资源。
这样同一个进程在全局设备链表dev_list中,即使open同一个设备文件多次,也只会创建一个设备结点(见函数xdxgpu_device_create)。
下面分析一下设备结点的引用计数和fd:
- 假设一个进程前后对同一个设备文件调用了两次open函数得到fd1、fd2,并分别调用了两次xdxgpu_device_create。
- 第一次调用xdxgpu_device_create(参数为fd1)的时候会创建设备结点,初始化设备结点引用为1,并调用fcntl复制fd1为dup_fd1并赋值给设备结点的fd成员。此时内核中fd1表示的file的引用计数为2。
- 第二次调用xdxgpu_device_create(参数为fd2)的时候,会找到创建的设备结点并调用xdxgpu_device_get将设备结点的引用计数加1,这样设备结点的引用计数就变为2。
- 当close(fd1)后,内核中fd1表示的file的引用计数变为1。
- 当对该设备结点调用xdxgpu_device_destroy时,最终调用xdxgpu_device_put将设备结点引用计数减为1。
- 当再次对设备结点调用xdxgpu_device_destroy时,最终调用xdxgpu_device_put将设备结点引用计数减为0,并最终调用函数xdxgpu_device_free->close(dev->fd即dup_fd1)、free(dev),调用close(dup_fd1)后,内核中fd1/dup_fd1表示的file的引用计数变为0,内核会释放file资源,调用free(dev)会释放设备结点资源。
根据上面分析,关闭fd1不会影响其他流程,同时明白了为什么xdxgpu_device_create中调用fcntl,而不是直接把fd1赋值给设备结点的fd成员。至于fd2,在需要关闭的时候关闭就行了,不会有影响。
六、总结
解决此问题的关键点是,对information page作用和映射流程要熟悉,在排除虚拟地址被修改的情况时,推测出访问的虚拟地址已经被unmap了,然后一步步加以验证。