内核互斥量-优快云博客

本文链接：https://blog.youkuaiyun.com/congchp/article/details/146861895

本文主要基于linux 4.12介绍内核的互斥量机制，会和信号量进行比较。最后简单介绍用户空间使用的posix互斥量。

互斥量基本原理

互斥量（mutex）只允许一个进程进入临界区，适合保护比较长的临界区，因为竞争互斥量时进程可能睡眠和再次唤醒，代价很高。

互斥量是在原子操作和spinlock 基础上实现的。互斥量不能进行递归锁或多次解锁，能用于交互上下文但不能用中断上下文，同一时刻只有一个任务持有该锁，而且只有这个任务可以对互斥量进行解锁。linux内核中mutex的特性，可以参考如下说明：

/*
 * Simple, straightforward mutexes with strict semantics:
 *
 * - only one task can hold the mutex at a time
 * - only the owner can unlock the mutex
 * - multiple unlocks are not permitted
 * - recursive locking is not permitted
 * - a mutex object must be initialized via the API
 * - a mutex object must not be initialized via memset or copying
 * - task may not exit with mutex held
 * - memory areas where held locks reside must not be freed
 * - held mutexes must not be reinitialized
 * - mutexes may not be used in hardware or software interrupt
 *   contexts such as tasklets and timers
 *
 * These semantics are fully enforced when DEBUG_MUTEXES is
 * enabled. Furthermore, besides enforcing the above rules, the mutex
 * debugging code also implements a number of additional features
 * that make lock debugging easier and faster:
 *
 * - uses symbolic names of mutexes, whenever they are printed in debug output
 * - point-of-acquire tracking, symbolic lookup of function names
 * - list of all locks held in the system, printout of them
 * - owner tracking
 * - detects self-recursing locks and prints out all relevant info
 * - detects multi-task circular deadlocks and prints out all affected
 *   locks and tasks (and only those tasks)
 */

数据结构及API

struct mutex

struct mutex {
	atomic_long_t		owner;
	spinlock_t		wait_lock;
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
	struct optimistic_spin_queue osq; /* Spinner MCS lock */
#endif
	struct list_head	wait_list;
#ifdef CONFIG_DEBUG_MUTEXES
	void			*magic;
#endif
#ifdef CONFIG_DEBUG_LOCK_ALLOC
	struct lockdep_map	dep_map;
#endif
};

**owner**：
- 原子变量，表示当前持有锁的task（task_struct*）。
- 如果锁未被持有，owner 为 0。
- task_struct* 是按 8 字节对齐的（64 位系统），指针的最低3位始终为 0，可以用于存储标志位。
- 各个标志位如下：

/*
 * @owner: contains: 'struct task_struct *' to the current lock owner,
 * NULL means not owned. Since task_struct pointers are aligned at
 * at least L1_CACHE_BYTES, we have low bits to store extra state.
 *
 * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
 * Bit1 indicates unlock needs to hand the lock to the top-waiter
 * Bit2 indicates handoff has been done and we're waiting for pickup.
 */
#define MUTEX_FLAG_WAITERS	0x01
#define MUTEX_FLAG_HANDOFF	0x02
#define MUTEX_FLAG_PICKUP	0x04

#define MUTEX_FLAGS		0x07

**wait_lock**：
- 自旋锁，用于保护 wait_list 的并发访问。
**osq**：
- 乐观自旋队列（Optimistic Spinning Queue），用于实现乐观自旋（Optimistic Spinning）。
**wait_list**：
- 等待队列，存放等待获取锁的线程。
**magic**：
- 调试字段，用于检测锁的滥用。
**dep_map**：
- 锁的依赖映射，用于锁调试（Lockdep）。

API

初始化mutex

初始化静态mutex的方法如下：

DEFINE_MUTEX(mutexname);

运行时动态初始化mutex的方法如下：

mutex_init(mutex)

申请mutex

申请mutex方法如下：

（1）void mutex_lock(struct mutex *lock);

申请mutex，如果锁已经被占有，进程深度睡眠。

（2）int mutex_lock_interruptible(struct mutex *lock);

申请mutex，如果锁已经被占有，进程轻度睡眠。

（3）int mutex_lock_killable(struct mutex *lock);

申请mutex，如果锁已经被占有，进程中度睡眠。

（4）int mutex_trylock(struct mutex *lock);

非阻塞方式申请mutex，如果申请成功，返回1；如果锁被其它进程占有，那么进程不等待，返回0。

释放mutex

void mutex_unlock(struct mutex *lock);

函数实现机制

`mutex_lock`

/**
 * mutex_lock - acquire the mutex
 * @lock: the mutex to be acquired
 *
 * Lock the mutex exclusively for this task. If the mutex is not
 * available right now, it will sleep until it can get it.
 *
 * The mutex must later on be released by the same task that
 * acquired it. Recursive locking is not allowed. The task
 * may not exit without first unlocking the mutex. Also, kernel
 * memory where the mutex resides must not be freed with
 * the mutex still locked. The mutex must first be initialized
 * (or statically defined) before it can be locked. memset()-ing
 * the mutex to 0 is not allowed.
 *
 * ( The CONFIG_DEBUG_MUTEXES .config option turns on debugging
 *   checks that will enforce the restrictions and will also do
 *   deadlock debugging. )
 *
 * This function is similar to (but not equivalent to) down().
 */
void __sched mutex_lock(struct mutex *lock)
{
	might_sleep();

	if (!__mutex_trylock_fast(lock))
		__mutex_lock_slowpath(lock);
}
EXPORT_SYMBOL(mutex_lock);

先尝试快速获取锁，如果失败则进入慢速路径。

**might_sleep**：
- 提示内核当前函数可能会睡眠，用于调试和检测潜在的睡眠问题。
**__mutex_trylock_fast**：
- 尝试快速获取锁，如果成功则返回 true，否则返回 false。
**__mutex_lock_slowpath**：
- 处理锁获取失败的情况，将当前线程加入等待队列并进入睡眠状态。

快速路径

/*
 * Optimistic trylock that only works in the uncontended case. Make sure to
 * follow with a __mutex_trylock() before failing.
 */
static __always_inline bool __mutex_trylock_fast(struct mutex *lock)
{
	unsigned long curr = (unsigned long)current;

	if (!atomic_long_cmpxchg_acquire(&lock->owner, 0UL, curr))
		return true;

	return false;
}

尝试快速获取锁。如果锁未被持有（**owner** 为 **0**），则将当前task设置为锁的持有者。

**atomic_long_cmpxchg_acquire**：
- 原子地比较并交换 lock->owner 的值。
- 如果 lock->owner 的当前值等于 0（表示锁没有被占用），则将其设置为当前task的指针（task_struct*）。
**acquire** ****语义：
- 确保该操作之后的内存访问不会被重排序到该操作之前。

慢速路径

static noinline void __sched
__mutex_lock_slowpath(struct mutex *lock)
{
	__mutex_lock(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
}

static int __sched
__mutex_lock(struct mutex *lock, long state, unsigned int subclass,
	     struct lockdep_map *nest_lock, unsigned long ip)
{
	return __mutex_lock_common(lock, state, subclass, nest_lock, ip, NULL, false);
}

处理快速路径锁获取失败的情况，将当前task加入等待队列并进入睡眠状态。

**__mutex_lock**：
- 实际执行锁获取的慢速路径。
- 先利用owner和乐观自旋队列ops，用于快速检查写锁的持有者（task）是否正在运行（即写者是否在 CPU 上运行）。如果写锁的持有者正在运行，则当前task自旋忙等待（内核互斥量并不是完全的公平锁，支持新申请锁的任务先获取锁）；否则进行睡眠。提升性能。
- 将当前task加入到wait_list时，同时需要设置owner中的标志位（**MUTEX_FLAG_WAITERS**），表示wait_list不为空。
**TASK_UNINTERRUPTIBLE**：
- 将当前task设置为不可中断的状态。

`mutex_unlock`

/**
 * mutex_unlock - release the mutex
 * @lock: the mutex to be released
 *
 * Unlock a mutex that has been locked by this task previously.
 *
 * This function must not be used in interrupt context. Unlocking
 * of a not locked mutex is not allowed.
 *
 * This function is similar to (but not equivalent to) up().
 */
void __sched mutex_unlock(struct mutex *lock)
{
#ifndef CONFIG_DEBUG_LOCK_ALLOC
	if (__mutex_unlock_fast(lock))
		return;
#endif
	__mutex_unlock_slowpath(lock, _RET_IP_);
}
EXPORT_SYMBOL(mutex_unlock);

先尝试通过快速路径释放锁，快死路径失败则通过慢速路径。
**unlock**为什么需要快速路径和慢速路径？
- 快速路径用于wait_list为空的场景，不涉及spinlock以及wakeup的操作，快速释放锁，提升性能。

快速路径

static __always_inline bool __mutex_unlock_fast(struct mutex *lock)
{
	unsigned long curr = (unsigned long)current;

	if (atomic_long_cmpxchg_release(&lock->owner, curr, 0UL) == curr)
		return true;

	return false;
}

尝试通过原子操作快速释放锁。

获取当前任务的指针 current(task_struct*)，并将其转换为 unsigned long 类型。
使用 atomic_long_cmpxchg_release 原子操作，将锁的所有者（lock->owner）从当前任务（curr）设置为 0（表示锁未被持有）。
如果操作成功（即锁的当前所有者确实是当前任务，并且**owner**中没有保存标志位，即**wait_list**为空），返回 true；否则返回 false。
如果检测到MUTEX_FLAG_WAITERS 标志，会立即退出快速路径，进入慢速路径。

慢速路径

/*
 * Release the lock, slowpath:
 */
static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
{
	struct task_struct *next = NULL;
	DEFINE_WAKE_Q(wake_q);
	unsigned long owner;

	mutex_release(&lock->dep_map, 1, ip);

	/*
	 * Release the lock before (potentially) taking the spinlock such that
	 * other contenders can get on with things ASAP.
	 *
	 * Except when HANDOFF, in that case we must not clear the owner field,
	 * but instead set it to the top waiter.
	 */
	owner = atomic_long_read(&lock->owner);
	for (;;) {
		unsigned long old;

#ifdef CONFIG_DEBUG_MUTEXES
		DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current);
		DEBUG_LOCKS_WARN_ON(owner & MUTEX_FLAG_PICKUP);
#endif

		if (owner & MUTEX_FLAG_HANDOFF)
			break;

		old = atomic_long_cmpxchg_release(&lock->owner, owner,
						  __owner_flags(owner));
		if (old == owner) {
			if (owner & MUTEX_FLAG_WAITERS)
				break;

			return;
		}

		owner = old;
	}

	spin_lock(&lock->wait_lock);
	debug_mutex_unlock(lock);
	if (!list_empty(&lock->wait_list)) {
		/* get the first entry from the wait-list: */
		struct mutex_waiter *waiter =
			list_first_entry(&lock->wait_list,
					 struct mutex_waiter, list);

		next = waiter->task;

		debug_mutex_wake_waiter(lock, waiter);
		wake_q_add(&wake_q, next);
	}

	if (owner & MUTEX_FLAG_HANDOFF)
		__mutex_handoff(lock, next);

	spin_unlock(&lock->wait_lock);

	wake_up_q(&wake_q);
}

慢速路径的代码比较复杂，简单来说就是**唤醒等待任务**。
linux内核mutex并不是严格的公平锁，因为锁释放的时候并不是严格按照FIFO的顺序唤醒wait_list中的任务。
- 如果设置了 MUTEX_FLAG_HANDOFF 标志，锁会直接交给等待队列中的下一个任务（next）。
- 如果没有设置 MUTEX_FLAG_HANDOFF，锁会被释放，新来的任务可以竞争锁。
- 这种设计也是为了提升系统整体的性能，新来的任务可以快速获取锁，不需要加入wait_list并且睡眠，避免了频繁任务切换带来的开销。
- 但也可能会导致wait_list中的task长时间获取不到锁。

内核mutex存在的问题

优先级反转
- 高优先级任务被低优先级任务阻塞。当一个高优先级任务等待一个低优先级任务持有的锁时，会导致高优先级任务被阻塞。调度器可能会调度其它任务，导致低优先级任务执行时间被延长，无法及时释放锁，等待锁的高优先级任务等待时间就会更长。
不完全公平锁
- 可能存在wait_list中的task长时间得不到执行。
- 大部分时候是公平锁。

内核互斥量和信号量比较

特性	Semaphore	Mutex
计数值	`>=0`可多个进程/线程同时获取	只能是 `0`或 `1`
是否允许多个进程持有	是	否
释放机制	任何进程都能 `up()`释放，即使它没有获取过信号量	只能由持有者释放
递归锁	不支持，递归执行P操作会死锁	不支持(内核mutex不支持，posix可以设置mutex属性支持递归锁)
适用场景	资源计数（如生产者-消费者）	互斥访问（如临界区保护）

内核mutex不支持递归锁

内核互斥量（mutex）递归锁测试程序

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/mutex.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("congchp");
MODULE_DESCRIPTION("Test mutex recursive lock");

static DEFINE_MUTEX(my_mutex);

static int __init mutex_test_init(void) {
    mutex_lock(&my_mutex);
    printk(KERN_INFO "First lock acquired\n");

    mutex_lock(&my_mutex);

    printk(KERN_INFO "Second lock acquired (recursive?)\n");

    mutex_unlock(&my_mutex);
    mutex_unlock(&my_mutex); // 如果第二次加锁成功，需要两次解锁

    return 0;
}

static void __exit mutex_test_exit(void) {
    printk(KERN_INFO "Mutex test module unloaded\n");
}

module_init(mutex_test_init);
module_exit(mutex_test_exit);

编译和运行：

保存代码：
- 将代码保存为 kernel_mtx_rec_test.c。
编写 Makefile：
- 在代码统计目录创建一个名为 Makefile 的文件，内容如下：

obj-m:=kernel_mtx_rec_test.o	

CURRENT_PAHT:=$(shell pwd) 
LINUX_KERNEL:=$(shell uname -r)   

LINUX_KERNEL_PATH:=/usr/src/linux-headers-$(LINUX_KERNEL)
all:

	make -C $(LINUX_KERNEL_PATH) M=$(CURRENT_PAHT) modules

clean:

	make -C $(LINUX_KERNEL_PATH) M=$(CURRENT_PAHT) cleals

编译内核模块：
- 在终端中，进入代码所在的目录，运行 make 命令。
加载内核模块：
- 使用 sudo insmod kernel_mtx_rec_test.ko 命令加载模块。
查看内核日志：
- 使用 dmesg | tail 命令查看内核日志。
卸载内核模块：
- 使用 sudo rmmod mutex_test 命令卸载模块。

预期输出及现象：

内核日志只输出First lock acquired，第二次执行mutex_lock会死锁。
执行insmod的终端会阻塞。
会有一个insmod的用户进行，无法kill。
执行rmmod也会失败，因为Module xxx is in use。
在linux内核中，mutex是不支持递归锁的，所以第二次加锁，会直接造成死锁。

内核semaphore不支持递归锁

内核信号量（semaphore）递归锁测试程序

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/semaphore.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("congchp");
MODULE_DESCRIPTION("Test semaphore recursive lock");

static DEFINE_SEMAPHORE(my_semaphore); // 初始化信号量为1

static int __init semaphore_test_init(void) {

    down(&my_semaphore);
    printk(KERN_INFO "First semaphore acquired\n");

    down(&my_semaphore);
    printk(KERN_INFO "Second semaphore acquired (recursive?)\n");


    up(&my_semaphore);
    up(&my_semaphore);

    return 0;
}

static void __exit semaphore_test_exit(void) {
    printk(KERN_INFO "Semaphore test module unloaded\n");
}

module_init(semaphore_test_init);
module_exit(semaphore_test_exit);

编译和运行：

与互斥量测试程序类似，使用相同的 Makefile 进行编译和运行。

预期输出：

内核日志只有First semaphore acquired，第二次执行down操作会导致死锁。
执行insmod的终端会阻塞。
会有一个insmod的用户进行，无法kill。
执行rmmod也会失败，因为Module xxx is in use。
linux内核的信号量与用户空间的信号量类似，它不支持递归锁，但是可以多次获取。

POSIX互斥量

posix mutex支持递归锁

POSIX 互斥量（mutex）递归锁测试程序

#include <stdio.h>
#include <pthread.h>
#include <errno.h>
#include <string.h>

pthread_mutex_t mutex;

void recursive_lock_test() {
    int ret1, ret2;

    ret1 = pthread_mutex_lock(&mutex);
    if (ret1 != 0) {
        printf("第一次加锁失败: %s\n", strerror(ret1));
        return;
    }
    printf("第一次加锁成功\n");

    ret2 = pthread_mutex_lock(&mutex);
    if (ret2 != 0) {
        printf("第二次加锁失败: %s\n", strerror(ret2));
    } else {
        printf("第二次加锁成功（递归锁）\n");
    }

    pthread_mutex_unlock(&mutex);
    pthread_mutex_unlock(&mutex); //需要两次解锁
}

int main() {
    pthread_mutexattr_t attr;
    pthread_mutexattr_init(&attr);
    pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE); // 设置互斥量为递归锁
    pthread_mutex_init(&mutex, &attr);
    pthread_mutexattr_destroy(&attr);

    recursive_lock_test();

    pthread_mutex_destroy(&mutex);
    return 0;
}

编译和运行：

gcc -pthread mutex_recursive_test.c -o mutex_recursive_test
./mutex_recursive_test

预期输出：

第一次加锁成功
第二次加锁成功（递归锁）

解释：

pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE)：将互斥量属性设置为递归锁。
pthread_mutex_lock(&mutex)：尝试两次获取同一个互斥量。
如果第二次加锁成功，则说明互斥量支持递归锁。
需要两次pthread_mutex_unlock解锁，才能释放锁。

posix semaphore不支持递归锁

POSIX 信号量（semaphore）递归锁测试程序

#include <stdio.h>
#include <semaphore.h>
#include <errno.h>
#include <string.h>

sem_t semaphore;

void recursive_semaphore_test() {
    int ret1, ret2;

    ret1 = sem_wait(&semaphore);
    if (ret1 != 0) {
        printf("第一次获取信号量失败: %s\n", strerror(errno));
        return;
    }
    printf("第一次获取信号量成功\n");

    ret2 = sem_wait(&semaphore);
    if (ret2 != 0) {
        printf("第二次获取信号量失败: %s\n", strerror(errno));
    } else {
        printf("第二次获取信号量成功（支持递归锁）\n");
    }

    sem_post(&semaphore);
    sem_post(&semaphore);
}

int main() {
    sem_init(&semaphore, 0, 1); // 初始化信号量为1

    recursive_semaphore_test();

    sem_destroy(&semaphore);
    return 0;
}

编译和运行：

gcc semaphore_recursive_test.c -o semaphore_recursive_test
./semaphore_recursive_test

预期输出：

第一次获取信号量成功

解释：

sem_init(&semaphore, 0, 1)：初始化信号量为 1，表示可用资源数为 1。
sem_wait(&semaphore)：尝试两次获取同一个信号量。
如果第二次获取信号量成功，则说明互斥信号量支持递归锁。结果是没有此输出，说明互斥信号量不支持递归锁。
sem_post(&semaphore):释放信号量。

结论：

POSIX 互斥量可以通过设置属性 PTHREAD_MUTEX_RECURSIVE 来支持递归锁。
POSIX 信号量本身不支持递归锁，但可以重复获取，直到信号量的值变为 0。

参考资料

Professional Linux Kernel Architecture，Wolfgang Mauerer
Linux内核深度解析，余华兵
Linux设备驱动开发详解，宋宝华
linux kernel 4.12