linux内核模块的强制删除-结束rmmod这类disk sleep进程

一.问题:
前些日子在工作中遇到一个文件,当rmmod一个模块的时候,在模块的exit函数中阻塞了,rmmod进程杀也杀不掉,永远呆在那里,发现它已经是D(disk sleep)状态了,真的无能为力了吗?我不相信这个事实,所以今天在稍微闲暇的时候换换脑子,终于有了解决办法。
二.分析:
解铃还须系铃人,既然是在内核中出了问题,还是需要在内核中寻找办法,解决这类问题的前提是对内核卸载模块的精确理解,流程都理解透了,害怕找不到原因吗?原因都找到了,办法肯定是有的! (这也是我从公司学习到的最重要的东西!), 我按照这个原则,查到了rmmod最终调用的代码:
asmlinkage long sys_delete_module(const char __user *name_user, unsigned int flags) { ... if (!list_empty(&mod->modules_which_use_me)) { //0.如果其它模块依赖该模块,则不删除 /* Other modules depend on us: get rid of them first. */ ret = -EWOULDBLOCK; goto out; } ... if (mod->state != MODULE_STATE_LIVE) { //1.如果模块不是LIVE状态,那么就无法进行下去了,得到的结果是busy ... ret = -EBUSY; goto out; } ... if (!forced && module_refcount(mod) != 0) //2.如果引用计数不是0,则等待其为0 wait_for_zero_refcount(mod); up(&module_mutex); mod->exit(); //3.如果在这个里面阻塞了,那就无法返回了 down(&module_mutex); free_module(mod); ... }
以上注释了4处,分别解释如下:
情况0: 有其它模块依赖要卸载的模块。模块a是否依赖模块b,这个在模块a加载的时候调用resolve_symbol抉择,如果模块a的symbol在模块b中,则依赖
情况1: 只有LIVE状态的模块才能被卸载。
情况2: 引用计数在有其它模块或者内核本身引用的时候不为0,要卸载就要等待它们不引用为止。
情况3: 这个情况比较普遍,因为模块万一在使用过程中oom或者依赖它的模块oom或者模块本身写的不好有bug从而破坏了一些数据结构,那么可能造成exit函数中阻塞,最终rmmod不返回!

三.尝试一下:
针对情况3,举一个例子来模拟:
static DECLARE_COMPLETION(test_completion); int init_module() { return 0; } void cleanup_module( ) { wait_for_completion(&test_completion); } MODULE_LICENSE("GPL");
编译为test.ko,最终在rmmod test的时候会阻塞,rmmod永不返回了!很显然是cleanup_module出了问题,因此再写一个模块来卸载它!在编译模块之前,首先要在/proc/kallsym中找到以下这行:
f88de380 d __this_module [XXXX无法卸载的模块名称]
这是因为modules链表没有被导出,如果被导出的话,正确的方式应该是遍历这个链表来比对模块名称的。
以下的模块加载了以后,上述模块就可以被rmmod了:
void force(void) { } int __init rm_init(void) { struct module *mod = (struct module*)0xf88de380; int i; int o=0; mod->state = MODULE_STATE_LIVE; //为了卸载能进行下去,也就是避开情况1,将模块的状态改变为LIVE mod->exit = force; //由于是模块的exit导致了无法返回,则替换mod的exit。再次调用rmmod的时候会调用到sys_delete_module,最后会调用 exit回调函数,新的exit当然不会阻塞,成功返回,进而可以free掉module for (i = 0; i < NR_CPUS; i++) { //将引用计数归0 mod->ref[i].count = *(local_t *)&o; } return 0; } void __exit rm_exit(void) { } module_init(rm_init); module_exit(rm_exit); MODULE_LICENSE("GPL");
然后加载上述模块后,前面的模块就可以卸载了。
四.更深入的一些解释:
针对这个模块导致的无法卸载的问题,还有另一种方解决式,就是在别的module中complete掉这个completion,这个completion当然是无法直接得到的,还是要通过/proc/kallsyms得到这个completion的地址,然后强制转换成completion并完成它:
int __init rm_init(void) { complete((struct completion *)0xf88de36c); return 0; } void __exit rm_exit(void) { } module_init(rm_init); module_exit(rm_exit); MODULE_LICENSE("GPL");
当然这种方式是最好的了,否则如果使用替换exit的方式,会导致前面的那个阻塞的rmmod无法退出。你可能想在新编写的模块中调用try_to_wake_up函数,然而这也是不妥当的,因为它可能在wait_for_completion,而wait_for_completion中大量引用了已经被替换exit回调函数进而被卸载的模块数据,比如:
spin_unlock_irq(&x->wait.lock);
schedule();
spin_lock_irq(&x->wait.lock);

其中x就是那个模块里面的,既然x已经没有了(使用替换exit的方式已经成功卸载了模块,模块被free,x当然就不复存在了),刚刚被唤醒运行的rmmod就会oops,但是不管怎样,一个进程的oops一般是没有问题的,因此还是可以干掉它的,这种oops一般不会破坏其它的内核数据,一般都是由于引用已经被free的指针引起的(当然还真的可能有危险情况哦...)。 既然知道这些rmmod都是阻塞在睡眠里面,那么我们只需要强制唤醒它们就可以了,至于说被唤醒后oops了怎么办?由内核处理啦,或者听天由命!因此考虑以下的代码:
int (*try)(task_t * p, unsigned int state, int sync); int __init rm_init(void){ struct task_struct *tsk = find_task_by_pid(28792); //28792为一个阻塞的rmmod进程,这个模块使用上述的替换exit的方式已经被重新rmmod卸载了,然而第一次的那个rmmod仍然阻塞在哪里,没有睡再去唤醒它了。 try=0xc011a460; (*try)(tsk, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0); //我们唤醒它,至于它醒了之后干什么,随便吧! return 0; } void __exit rm_exit(void){ } module_init(rm_init); module_exit(rm_exit); MODULE_LICENSE("GPL");
然后再ps -e一下,基本没有那个rmmod进程了。一个[State: D (disk sleep)]的进程这样完蛋了。
以上代码基本都是硬编码的地址以及进程号,真正的代码应该使用参数来传递这些信息,就会比较方便了!
既然模块结构都可以拿到,它的任意字段就可以被任意赋值,哪里出了问题就重新赋值哪里!既然内核空间都进入了,导不导出符号就不是根本问题了,就算没有procfs的kallsym,也一样能搞定,因为你能控制整个内存!
五.防删除:
我们可以在自己的模块初始化的时候将其引用计数设置成一个比较大的数,或者设置一下别的模块结构体字段,防治被rmmod,然而别人也可以再写一个模块把这些字段设置回去,简单的使用上述方式就可以干掉你的防删除模块,这是一个矛与盾的问题,关键是,别让人拥有root权限。
六.总结:
代码都拿到手了,流程还看不懂吗?流程都懂了,还怕定位不到问题吗?问题都定位了,还能解决不了吗?只要没有人为因素,事实上任何技术问题都是能解决的( 这是我从公司学习到的最重要的东西 ),所谓的不可能只是规范上的规定或者说既然你误操作了或者你的代码有bug,与其说去按照上述方式搞定它,还不如不搞定它,而是改正你自己的错误!

解决模块由于阻塞而无法删除问题有下面的过程:
1.写一个模块替换exit函数,且设置引用计数为0,状态为LIVE,然后rmmod;
2.强制try_to_wake_up那个rmmod进程,注意不能使用wake_up,因为队列可能已经不在了,而应该直接唤醒task_struct;
3.听天由命!

附:内核缺页
在do_page_fault中,如果缺页发生在内核空间,最终OOPS的话,会调用die:
die("Oops", regs, error_code);
在die中,如果没有处在中断以及没有设置panic-on-oops的话,最终将以SIGSEGV退出当前进程:
if (in_interrupt())
panic("Fatal exception in interrupt");
if (panic_on_oops) {
printk(KERN_EMERG "Fatal exception: panic in 5 seconds/n");
set_cu
rrent_state(TASK_UNINTERRUPTIBLE);
schedule_timeout(5 * HZ);
panic("Fatal exception");
}
do_exit(SIGSEGV);
这样,如果唤醒睡眠在模块exit中的rmmod,显然在被唤醒之后,检测变量会导致缺页(由于变量已经被free了),因此会进入die("Oops"...),最终退出rmmod进程,这个也是很合理的哦!因此上述的清理D状态的进程还是可以用的。

No VM guests are running outdated hypervisor (qemu) binaries on this host. [1] 14461 Error: unable to open display :1 root@srv771551:~# The XKEYBOARD keymap compiler (xkbcomp) reports: > Warning: Could not resolve keysym XF86CameraAccessEnable > Warning: Could not resolve keysym XF86CameraAccessDisable > Warning: Could not resolve keysym XF86CameraAccessToggle > Warning: Could not resolve keysym XF86NextElement > Warning: Could not resolve keysym XF86PreviousElement > Warning: Could not resolve keysym XF86AutopilotEngageToggle > Warning: Could not resolve keysym XF86MarkWaypoint > Warning: Could not resolve keysym XF86Sos > Warning: Could not resolve keysym XF86NavChart > Warning: Could not resolve keysym XF86FishingChart > Warning: Could not resolve keysym XF86SingleRangeRadar > Warning: Could not resolve keysym XF86DualRangeRadar > Warning: Could not resolve keysym XF86RadarOverlay > Warning: Could not resolve keysym XF86TraditionalSonar > Warning: Could not resolve keysym XF86ClearvuSonar > Warning: Could not resolve keysym XF86SidevuSonar > Warning: Could not resolve keysym XF86NavInfo Errors from xkbcomp are not fatal to the X server # 查看详细的DRM初始 查看详细的DRM初始化日志 sudo dmesg | grep -iE 'drm|i915|amdgpu|nouveau' # 强制重新加载图形模块(以Intel为例) sudo rmmod i915 sudo modprobe i915 debug=1 [ 8.488129] ACPI: bus type drm_connector registered [ 11.694958] systemd[1]: Starting modprobe@drm.service - Load Kernel Module drm... rmmod: ERROR: Module i915 is not currently loaded modprobe: FATAL: Module i915 not found in directory /lib/modules/6.8.0-60-generic root@srv771551:~# # 检查物理显示接口状态 sudo apt install edid-decode -y sudo cat /sys/class/drm/card0-DP-1/edid | edid-decode # 检测显示端口热插拔状态 grep -H . /sys/class/drm/*/status Reading package lists... Done Building dependency tree... Done Reading state information... Done The following NEW packages will be installed: edid-decode 0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded. Need to get 134 kB of archives. After this operation, 350 kB of additional disk space will be used. Get:1 http://in.archive.ubuntu.com/ubuntu noble/universe amd64 edid-decode amd64 0.1~git20220315.cb74358c2896-1.1 [134 kB] Fetched 134 kB in 1s (126 kB/s) Selecting previously unselected package edid-decode. (Reading database ... 104591 files and directories currently installed.) Preparing to unpack .../edid-decode_0.1~git20220315.cb74358c2896-1.1_amd64.deb ... Unpacking edid-decode (0.1~git20220315.cb74358c2896-1.1) ... Setting up edid-decode (0.1~git20220315.cb74358c2896-1.1) ... Processing triggers for man-db (2.12.0-4build2) ... Scanning processes... Scanning linux images... Running kernel seems to be up-to-date. No services need to be restarted. No containers need to be restarted. No user sessions are running outdated binaries. No VM guests are running outdated hypervisor (qemu) binaries on this host. cat: /sys/class/drm/card0-DP-1/edid: No such file or directory EDID of 'stdin' was empty. grep: /sys/class/drm/*/status: No such file or directory root@srv771551:~# # 使用LLVMpipe软件渲染 sudo apt install mesa-utils -y LIBGL_ALWAYS_SOFTWARE=1 glxinfo | grep "OpenGL renderer" # 验证CPU渲染能力 MESA_GL_VERSION_OVERRIDE=4.5 glxgears Reading package lists... Done Building dependency tree... Done Reading state information... Done mesa-utils is already the newest version (9.0.0-2). 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded. OpenGL renderer string: llvmpipe (LLVM 19.1.1, 256 bits) The XKEYBOARD keymap compiler (xkbcomp) reports: > Warning: Could not resolve keysym XF86CameraAccessEnable > Warning: Could not resolve keysym XF86CameraAccessDisable > Warning: Could not resolve keysym XF86CameraAccessToggle > Warning: Could not resolve keysym XF86NextElement > Warning: Could not resolve keysym XF86PreviousElement > Warning: Could not resolve keysym XF86AutopilotEngageToggle > Warning: Could not resolve keysym XF86MarkWaypoint > Warning: Could not resolve keysym XF86Sos > Warning: Could not resolve keysym XF86NavChart > Warning: Could not resolve keysym XF86FishingChart > Warning: Could not resolve keysym XF86SingleRangeRadar > Warning: Could not resolve keysym XF86DualRangeRadar > Warning: Could not resolve keysym XF86RadarOverlay > Warning: Could not resolve keysym XF86TraditionalSonar > Warning: Could not resolve keysym XF86ClearvuSonar > Warning: Could not resolve keysym XF86SidevuSonar > Warning: Could not resolve keysym XF86NavInfo Errors from xkbcomp are not fatal to the X server 6622 frames in 5.0 seconds = 1324.350 FPS 5306 frames in 5.0 seconds = 1061.041 FPS 6476 frames in 5.0 seconds = 1295.068 FPS 7356 frames in 5.0 seconds = 1471.104 FPS 7530 frames in 5.0 seconds = 1505.943 FPS 7202 frames in 5.0 seconds = 1440.160 FPS 6918 frames in 5.0 seconds = 1383.580 FPS 6448 frames in 5.0 seconds = 1289.433 FPS 6896 frames in 5.0 seconds = 1379.091 FPS 7166 frames in 5.0 seconds = 1433.100 FPS 6667 frames in 5.0 seconds = 1333.343 FPS 6502 frames in 5.0 seconds = 1300.314 FPS 6236 frames in 5.0 seconds = 1246.892 FPS 4949 frames in 5.0 seconds = 989.683 FPS 6312 frames in 5.0 seconds = 1262.288 FPS 7286 frames in 5.0 seconds = 1456.947 FPS # 检查内核模块存储路径 sudo find /lib/modules/$(uname -r) -name '*i915*' # 查看模块依赖关系 modinfo i915 | grep -E 'depends|vermagic' # 强制重建模块依赖 sudo depmod -a # 检查黑名单配置 grep -ri 'blacklist i915' /etc/modprobe.d/6583 frames in 5.0 seconds = 1316.576 FPS 6704 frames in 5.0 seconds = 1340.529 FPS 6171 frames in 5.0 seconds = 1234.088 FPS 7279 frames in 5.0 seconds = 1455.553 FPS # 创建自定义键位符号文件 echo 'xkb_symbols "extras" { key <I240> { [ XF86CameraAccessEnable ] }; key <I241> { [ XF86CameraAccessDisable ] }; key <I242> { [ XF86CameraAccessToggle ] }; };' | sudo tee /usr/share/X11/xkb/symbols/custom # 更新XKB配置 sudo sed -i '/xkb_symbols/a include "custom(extras)"' /usr/share/X11/xkb/symbols/inet 7404 frames in 5.0 seconds = 1480.630 FPS 7454 frames in 5.0 seconds = 1490.683 FPS # 生成基础EDID二进制 echo -n -e '\x00\xFF\xFF\xFF\xFF\xFF\xFF\x00\x04\x72\x00\x00\x01\x00\x00\x00\x01\x1B\x01\x03\x80\x00\x00\x78\x0A\xEE\x91\xA3\x54\x4C\x99\x26\x0F\x50\x54\x00\x00\x00\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x02\x3A\x80\x18\x71\x38\x2D\x40\x58\x2C\x45\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xFE\x00\x41\x55\x4F\x0A\x20\x20\x20\x20\x20\x20\x00\x00\x00\xFE\x00\x42\x31\x35\x36\x48\x41\x4E\x30\x31\x2E\x30\x00\xBC' | sudo tee /sys/kernel/debug/dri/0/edid_override >/dev/null # 验证EDID注入 sudo cat /sys/kernel/debug/dri/0/edid_override | edid-decode 7467 frames in 5.0 seconds = 1493.334 FPS 7557 frames in 5.0 seconds = 1511.268 FPS # 创建最小化Xorg配置 echo 'Section "ServerFlags" Option "AutoAddGPU" "off" Option "IgnoreABI" "true" EndSection Section "Device" Identifier "DummyCard" Driver "dummy" VideoRam 32768 EndSection Section "Screen" Identifier "DummyScreen" Device "DummyCard" Monitor "DummyMonitor" DefaultDepth 24 SubSection "Display" Depth 24 Modes "1024x768" EndSubSection EndSection Section "Monitor" Identifier "DummyMonitor" HorizSync 28.0-80.0 VertRefresh 48.0-75.0 EndSection' | sudo tee /etc/X11/xorg.conf.d/99-fallback.conf 7125 frames in 5.0 seconds = 1424.958 FPS 5526 frames in 5.0 seconds = 1104.287 FPS #!/bin/bash # render_benchmark.sh export MESA_LOADER_DRIVER_OVERRIDE=llvmpipe export LIBGL_ALWAYS_SOFTWARE=1 echo "===== CPU Rendering Benchmark =====" for i in {1..4}; do taskset -c $i glxgears >/dev/null & pid=$! sleep 5 kill $pid echo "Core $i: $(grep -E 'frames in 5.0' /tmp/bench.log | tail -n1)" done echo -e "\n===== Memory Bandwidth Test =====" sudo apt install mbw -y mbw -n 5 256 | grep AVG 6637 frames in 5.0 seconds = 1327.285 FPS 7355 frames in 5.0 seconds = 1470.954 FPS # 创建systemd调试服务 echo '[Unit] Description=DRM Debugging Service [Service] Type=oneshot ExecStart=/usr/sbin/rmmod i915 ExecStart=/usr/sbin/modprobe i915 debug=1 ExecStart=/usr/bin/tee /sys/kernel/debug/dri/0/i915_engine_info < /dev/null [Install] WantedBy=multi-user.target' | sudo tee /etc/systemd/system/drm-debug.service sudo systemctl daemon-reload sudo systemctl enable drm-debug 7212 frames in 5.0 seconds = 1442.261 FPS 6695 frames in 5.0 seconds = 1338.973 FPS 6488 frames in 5.0 seconds = 1297.479 FPS
最新发布
05-29
<think>好的,我现在需要解决用户遇到的显示和内核模块加载问题。首先,用户尝试重新加载i915模块时遇到了错误,提示模块未找到。这可能是因为模块未正确安装或内核版本不匹配。我需要检查模块存储路径和依赖关系,确保模块存在且版本正确。用户还遇到了XKB键位符号警告,这需要自定义键位映射文件来解决。此外,用户尝试注入EDID但失败,可能是因为路径不正确或EDID数据无效。需要生成有效的EDID二进制并正确注入。Xorg配置问题导致无法打开显示,可能需要创建最小化配置来绕过硬件检测。最后,用户的systemd服务无法加载模块,需要验证服务单元文件的正确性,并确保模块路径正确。需要逐步诊断每个问题,提供修复命令和脚本,并解释其原理。同时,用户可能对内核模块管理、XKB配置、EDID数据格式和systemd服务编写有进一步疑问,应准备相关问题。</think>### 深度解决方案:Linux显示子系统综合修复 #### 1. 内核模块完整性验证 ```bash # 验证内核头文件匹配性 sudo apt install linux-headers-$(uname -r) -y # 重建i915驱动模块 sudo dkms autoinstall --modname i915 --kernelsourcedir /lib/modules/$(uname -r)/build # 强制加载模块的原子操作 sudo modprobe -f i915 debug=1 && sudo systemctl restart systemd-modules-load ``` #### 2. XKB键位映射深度修复 ```bash # 创建高级键位符号补丁 sudo tee /usr/share/X11/xkb/symbols/xf86fix <<'EOL' partial alphanumeric_keys xkb_symbols "xf86_special" { key <AE01> { [ XF86CameraAccessEnable, XF86CameraAccessDisable ] }; key <AE02> { [ XF86CameraAccessToggle, XF86NextElement ] }; key <AE03> { [ XF86PreviousElement, XF86AutopilotEngageToggle ] }; }; EOL # 更新规则集配置 sudo sed -i '/! symbol =.*/a \ \ include "xf86fix(xf86_special)"' /usr/share/X11/xkb/rules/evdev.lst sudo sed -i '/! layout =.*/a \ \ xf86fix:exclude = yes' /usr/share/X11/xkb/rules/evdev.lst # 强制刷新XKB缓存 sudo dpkg-reconfigure xkb-data ``` #### 3. 动态EDID注入系统 ```bash #!/bin/bash # edid_manager.sh DEVICE=${1:-"card0-DP-1"} EDID_DATA=${2:-"base64_encoded_edid"} echo "===== Dynamic EDID Injection =====" sudo mkdir -p /sys/kernel/debug/dri/0/ echo -n "$EDID_DATA" | base64 -d | sudo tee /sys/kernel/debug/dri/0/edid_override >/dev/null echo -e "\n===== EDID Verification =====" sudo cat /sys/kernel/debug/dri/0/edid_override | edid-decode | grep -A10 "EDID Version" echo -e "\n===== Hotplug Simulation =====" echo -n 1 | sudo tee /sys/class/drm/${DEVICE}/status >/dev/null sleep 2 echo -n connected | sudo tee /sys/class/drm/${DEVICE}/status >/dev/null ``` #### 4. 智能Xorg配置生成器 ```python #!/usr/bin/env python3 # xorg_config_gen.py import os import subprocess def detect_hardware(): gpu = subprocess.getoutput("lspci | grep VGA").split(': ')[-1] return { 'driver': 'nouveau' if 'NVIDIA' in gpu else 'amdgpu' if 'AMD' in gpu else 'i915', 'vram': 32768, 'resolution': '1920x1080' } def generate_config(hw_info): config = f''' Section "ServerFlags" Option "AutoAddGPU" "on" Option "UseEvents" "true" EndSection Section "Device" Identifier "PrimaryGPU" Driver "{hw_info['driver']}" VideoRam {hw_info['vram']} Option "Debug" "true" EndSection Section "Monitor" Identifier "VirtualMonitor" HorizSync 24.0-94.0 VertRefresh 50.0-76.0 Modeline "{hw_info['resolution']}" 148.5 1920 2008 2052 2200 1080 1084 1089 1125 +hsync +vsync EndSection Section "Screen" Identifier "PrimaryScreen" Device "PrimaryGPU" Monitor "VirtualMonitor" DefaultDepth 24 SubSection "Display" Depth 24 Modes "{hw_info['resolution']}" EndSubSection EndSection ''' os.makedirs('/etc/X11/xorg.conf.d', exist_ok=True) with open('/etc/X11/xorg.conf.d/99-auto.conf', 'w') as f: f.write(config) if __name__ == "__main__": hw = detect_hardware() generate_config(hw) print("Xorg configuration generated successfully!") ``` #### 5. 系统级渲染调试工具 ```bash # 实时DRM状态监控 sudo watch -n1 "cat /sys/kernel/debug/dri/0/i915_engine_info" # OpenGL调用追踪 LIBGL_DEBUG=verbose glxgears 2>&1 | grep -iE 'error|warning|fail' # Vulkan层调试 export VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_validation vulkaninfo | grep -A5 'GPU id' ``` ### 代码解释: 1. 内核模块验证通过DKMS系统重建驱动模块,确保当前内核版本严格匹配 2. XKB修复方案创建独立符号扩展,避免修改系统核心配置文件 3. EDID管理系统实现动态二进制注入和热插拔事件模拟 4. Python配置生成器自动检测GPU类型并生成最优Xorg配置 5. 实时监控工具使用watch命令持续观察GPU引擎状态,LIBGL_DEBUG环境变量启用OpenGL调用追踪
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值