linux open函数源码解析

一直想研究研究linux的文件系统,就从这入手吧
linux kernel版本:6.10 64位
架构:x86
查看源码网址:https://elixir.bootlin.com/linux/v6.10/source
场景:open(“dirTest/ppshuoTest”,O_WRONLY);
源码:

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
	if (force_o_largefile())
		flags |= O_LARGEFILE;
	return do_sys_open(AT_FDCWD, filename, flags, mode);
}

呐 一个宏,这宏跟系统调用有关,这方面不作解释,我研究的是它的文件系统。咱就可以理解为open函数的代码就这几句。

流程就是先判断force_o_largefile()函数返回的值,如果为真就给flags加一个标记,然后再执行do_sys_open函数。

force_o_largefile()

/ include / linux / fcntl.h
#ifndef force_o_largefile
#define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
#endif

哦,是个宏

CONFIG_ARCH_32BIT_OFF_T

/ arch / Kconfig

config ARCH_32BIT_OFF_T
	bool
	depends on !64BIT
	help
	  All new 32-bit architectures should have 64-bit off_t type on
	  userspace side which corresponds to the loff_t kernel type. This
	  is the requirement for modern ABIs. Some existing architectures
	  still support 32-bit off_t. This option is enabled for all such
	  architectures explicitly.

简单理解就是说他支持32位的off_t,如果被配置成内置,CONFIG_ARCH_32BIT_OFF_T就被定义为1,如果是模块,CONFIG_ARCH_32BIT_OFF_T_MODULE 就被定义为1,例如:

#define CONFIG_ARCH_32BIT_OFF_T 1
或者
#define CONFIG_ARCH_32BIT_OFF_T_MODULE 1

IS_ENABLED()

/ include / linux / kconfig.h

/*
 * IS_ENABLED(CONFIG_FOO) evaluates to 1 if CONFIG_FOO is set to 'y' or 'm',
 * 0 otherwise.  Note that CONFIG_FOO=y results in "#define CONFIG_FOO 1" in
 * autoconf.h, while CONFIG_FOO=m results in "#define CONFIG_FOO_MODULE 1".
 */
#define IS_ENABLED(option) __or(IS_BUILTIN(option), IS_MODULE(option))

人家的意思是说看CONFIG_FOO (也就是咱这个CONFIG_ARCH_32BIT_OFF_T)是不是配置成内置或者模块了 。

再看看force_o_largefile,是取他的非值,也就是没配置才为真。好了,看来这跟文件系统没啥关系。是跟编译前的配置有关,但我还是想把它解释清楚,因为内容不多。以下内容可跳过直接看do_sys_open,那里才是真正的源码解析。

IS_BUILTIN

/ include / linux / kconfig.h
/*
 * IS_BUILTIN(CONFIG_FOO) evaluates to 1 if CONFIG_FOO is set to 'y', 0
 * otherwise. For boolean options, this is equivalent to
 * IS_ENABLED(CONFIG_FOO).
 */
#define IS_BUILTIN(option) __is_defined(option)

/*
 * IS_MODULE(CONFIG_FOO) evaluates to 1 if CONFIG_FOO is set to 'm', 0
 * otherwise.  CONFIG_FOO=m results in "#define CONFIG_FOO_MODULE 1" in
 * autoconf.h.
 */
#define IS_MODULE(option) __is_defined(option##_MODULE)

都调用一个宏__is_defined,那就看看这宏是干啥用的

/ include / linux / kconfig.h

#define __ARG_PLACEHOLDER_1 0,
#define __take_second_arg(__ignored, val, ...) val

#define __is_defined(x)			___is_defined(x)
#define ___is_defined(val)		____is_defined(__ARG_PLACEHOLDER_##val)
#define ____is_defined(arg1_or_junk)	__take_second_arg(arg1_or_junk 1, 0)

这里面最重要的是#define __ARG_PLACEHOLDER_1 0, 注意!!ARG_PLACEHOLDER_1 被定义的是有个0, 这分号可很重要。
(1)如果CONFIG_ARCH_32BIT_OFF_T被定义为1,
那么__ARG_PLACEHOLDER
##val就等于__ARG_PLACEHOLDER_1 ,
到____is_defined(arg1_or_junk)就是____is_defined(0,)
到__take_second_arg(arg1_or_junk 1, 0)这就是__take_second_arg(0,1, 0) 然后返回1
(2)如果CONFIG_ARCH_32BIT_OFF_T 没被定义
那么__ARG_PLACEHOLDER
##val就等于__ARG_PLACEHOLDER,__ARG_PLACEHOLDER这个不存在
____is_defined(arg1_or_junk)就是____is_defined()
__take_second_arg(arg1_or_junk 1, 0)就是__take_second_arg(1, 0) 然后返回0

do_sys_open

/ fs / open.c

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
	struct open_how how = build_open_how(flags, mode);
	return do_sys_openat2(dfd, filename, &how);
}

这函数就两行代码
第一行根据flags和mode构建一个open_how的对象
然后调用do_sys_openat2函数
形参dfd是AT_FDCWD

#define AT_FDCWD		-100    /* Special value used to indicate
                                           openat should use the current
                                           working directory. */

是用来表示用当前的工作目录,这是啥意思?
这个得跟openat函数调用一起说,咱解析的不是open函数嘛。先看两个函数的区别

int openat(int dirfd, const char *pathname, int flags);
int open(const char *pathname, int flags);

区别就在第一个参数dirfd,字面意思就是目录文件dir的标识符fd。
他的作用是如果pathname是相对路径,那相对的就不是当前的工作目录,而是文件标识符dirfd所代表的文件目录。

build_open_how


#define WILL_CREATE(flags)	(flags & (O_CREAT | __O_TMPFILE))
#define O_PATH_FLAGS		(O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC)

inline struct open_how build_open_how(int flags, umode_t mode)
{
	struct open_how how = {
		.flags = flags & VALID_OPEN_FLAGS,
		.mode = mode & S_IALLUGO,
	};

	/* O_PATH beats everything else. */
	if (how.flags & O_PATH)
		how.flags &= O_PATH_FLAGS;
	/* Modes should only be set for create-like flags. */
	if (!WILL_CREATE(how.flags))
		how.mode = 0;
	return how;
}

首先使用与操作对flags内容进行筛选,VALID_OPEN_FLAGS定义如下

#define VALID_OPEN_FLAGS \
	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
	 O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \
	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)

看是不是眼熟,就是open函数需要设置的,mode也一样
然后进行特殊操作

#define O_PATH		010000000

if (how.flags & O_PATH)
		how.flags &= O_PATH_FLAGS;

如果有设置O_PATH的话,那只能保留跟path相关的操作了

#define WILL_CREATE(flags)	(flags & (O_CREAT | __O_TMPFILE))

/* Modes should only be set for create-like flags. */
if (!WILL_CREATE(how.flags))
		how.mode = 0;

翻译写的很清楚,接下来是看flag有没有设置O_CREAT | __O_TMPFILE 这两个标志,如果设置的话,啥事没有,不是的话mode就给我老老实实的等于0。

do_sys_openat2

static long do_sys_openat2(int dfd, const char __user *filename,
			   struct open_how *how)
{
	struct open_flags op;
	int fd = build_open_flags(how, &op);
	struct filename *tmp;

	if (fd)
		return fd;

	tmp = getname(filename);
	if (IS_ERR(tmp))
		return PTR_ERR(tmp);

	fd = get_unused_fd_flags(how->flags);
	if (fd >= 0) {
		struct file *f = do_filp_open(dfd, tmp, &op);
		if (IS_ERR(f)) {
			put_unused_fd(fd);
			fd = PTR_ERR(f);
		} else {
			fd_install(fd, f);
		}
	}
	putname(tmp);
	return fd;
}

从参数上看,之前的flag和mode已经被整理好了,用open_how结构体保存。do_sys_openat2的整体流程是:
1 调用build_open_flags函数,设置op
2 调用getname函数,设置filename(返回一个结构体指针,说明是在里面申请的内存,并设置好filename那些成员)
3 调用get_unused_fd_flags函数,应该只是检查检查
4 调用do_filp_open函数,看传入的参数也能猜出来这应该才是真正的打开文件的步骤;
5 调用fd_install,从这能看出 fd在get_unused_fd_flags那应该被赋予了正确的数值。
6 调用putname函数,释放filename

build_open_flags

源码如下

inline int build_open_flags(const struct open_how *how, struct open_flags *op)
{
	u64 flags = how->flags;
	u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;
	int lookup_flags = 0;
	int acc_mode = ACC_MODE(flags);

	BUILD_BUG_ON_MSG(upper_32_bits(VALID_OPEN_FLAGS),
			 "struct open_flags doesn't yet handle flags > 32 bits");

	/*
	 * Strip flags that either shouldn't be set by userspace like
	 * FMODE_NONOTIFY or that aren't relevant in determining struct
	 * open_flags like O_CLOEXEC.
	 */
	flags &= ~strip;

	/*
	 * Older syscalls implicitly clear all of the invalid flags or argument
	 * values before calling build_open_flags(), but openat2(2) checks all
	 * of its arguments.
	 */
	if (flags & ~VALID_OPEN_FLAGS)
		return -EINVAL;
	if (how->resolve & ~VALID_RESOLVE_FLAGS)
		return -EINVAL;

	/* Scoping flags are mutually exclusive. */
	if ((how->resolve & RESOLVE_BENEATH) && (how->resolve & RESOLVE_IN_ROOT))
		return -EINVAL;

	/* Deal with the mode. */
	if (WILL_CREATE(flags)) {
		if (how->mode & ~S_IALLUGO)
			return -EINVAL;
		op->mode = how->mode | S_IFREG;
	} else {
		if (how->mode != 0)
			return -EINVAL;
		op->mode = 0;
	}

	/*
	 * Block bugs where O_DIRECTORY | O_CREAT created regular files.
	 * Note, that blocking O_DIRECTORY | O_CREAT here also protects
	 * O_TMPFILE below which requires O_DIRECTORY being raised.
	 */
	if ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT))
		return -EINVAL;

	/* Now handle the creative implementation of O_TMPFILE. */
	if (flags & __O_TMPFILE) {
		/*
		 * In order to ensure programs get explicit errors when trying
		 * to use O_TMPFILE on old kernels we enforce that O_DIRECTORY
		 * is raised alongside __O_TMPFILE.
		 */
		if (!(flags & O_DIRECTORY))
			return -EINVAL;
		if (!(acc_mode & MAY_WRITE))
			return -EINVAL;
	}
	if (flags & O_PATH) {
		/* O_PATH only permits certain other flags to be set. */
		if (flags & ~O_PATH_FLAGS)
			return -EINVAL;
		acc_mode = 0;
	}

	/*
	 * O_SYNC is implemented as __O_SYNC|O_DSYNC.  As many places only
	 * check for O_DSYNC if the need any syncing at all we enforce it's
	 * always set instead of having to deal with possibly weird behaviour
	 * for malicious applications setting only __O_SYNC.
	 */
	if (flags & __O_SYNC)
		flags |= O_DSYNC;

	op->open_flag = flags;

	/* O_TRUNC implies we need access checks for write permissions */
	if (flags & O_TRUNC)
		acc_mode |= MAY_WRITE;

	/* Allow the LSM permission hook to distinguish append
	   access from general write access. */
	if (flags & O_APPEND)
		acc_mode |= MAY_APPEND;

	op->acc_mode = acc_mode;

	op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;

	if (flags & O_CREAT) {
		op->intent |= LOOKUP_CREATE;
		if (flags & O_EXCL) {
			op->intent |= LOOKUP_EXCL;
			flags |= O_NOFOLLOW;
		}
	}

	if (flags & O_DIRECTORY)
		lookup_flags |= LOOKUP_DIRECTORY;
	if (!(flags & O_NOFOLLOW))
		lookup_flags |= LOOKUP_FOLLOW;

	if (how->resolve & RESOLVE_NO_XDEV)
		lookup_flags |= LOOKUP_NO_XDEV;
	if (how->resolve & RESOLVE_NO_MAGICLINKS)
		lookup_flags |= LOOKUP_NO_MAGICLINKS;
	if (how->resolve & RESOLVE_NO_SYMLINKS)
		lookup_flags |= LOOKUP_NO_SYMLINKS;
	if (how->resolve & RESOLVE_BENEATH)
		lookup_flags |= LOOKUP_BENEATH;
	if (how->resolve & RESOLVE_IN_ROOT)
		lookup_flags |= LOOKUP_IN_ROOT;
	if (how->resolve & RESOLVE_CACHED) {
		/* Don't bother even trying for create/truncate/tmpfile open */
		if (flags & (O_TRUNC | O_CREAT | __O_TMPFILE))
			return -EAGAIN;
		lookup_flags |= LOOKUP_CACHED;
	}

	op->lookup_flags = lookup_flags;
	return 0;
}
u64 flags = how->flags;
u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;

第一句没啥说的赋值而已,__FMODE_NONOTIFY的相关源码如下,

#define __FMODE_NONOTIFY	((__force int) FMODE_NONOTIFY)
#define FMODE_NONOTIFY		((__force fmode_t)(1 << 26))
typedef unsigned int __bitwise fmode_t;

__bitwise和__force这两个可参考:https://blog.youkuaiyun.com/RNG_uzi1111111/article/details/140937157?spm=1001.2014.3001.5501
现在就可以当他两不存在就行。
说白了 strip就是两个标志的并,FMODE_NONOTIFY这有啥用还没看出来
案例:

在这里插入图片描述strip的的值为67633152 8进制下就是402000000 能对的上

#define O_ACCMODE	00000003
#define O_RDONLY	00000000
#define O_WRONLY	00000001
#define O_RDWR		00000002

#define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])

int lookup_flags = 0;
int acc_mode = ACC_MODE(flags);

这个就是看flag被设置了这四个flag中的几个,然后赋予相应的数值,就是这写法感觉有点怪,下面解释一下

"\004\002\006\006"[(x)&O_ACCMODE]

参考这个 https://xiaoxiami.gitbook.io/c/zhuan-yi-zi-fu-he-kong-bai-fu
总之"\004\002\006\006"就是个整形数组
也就是说

flagacc_mode
O_ACCMODE6
O_RDONLY4
O_WRONLY2
O_RDWR6

案例:
在这里插入图片描述flags是557056,8进制下就是2100000 也就是O_LARGEFILE(00100000)和O_CLOEXEC(02000000)的并 ,也就是和O_ACCMODE与操作后就是0,也就是acc_mode为4

BUILD_BUG_ON_MSG(upper_32_bits(VALID_OPEN_FLAGS),
			 "struct open_flags doesn't yet handle flags > 32 bits");
			 
#define upper_32_bits(n) ((u32)(((n) >> 16) >> 16))
/**
 * BUILD_BUG_ON_MSG - break compile if a condition is true & emit supplied
 *		      error message.
 * @condition: the condition which the compiler should know is false.
 *
 * See BUILD_BUG_ON for description.
 */
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)

就是检查下VALID_OPEN_FLAGS,有没有高出32位的,有的话停止编译,所以compiletime_assert 宏展开应该是个编译器的宏

/*
	 * Strip flags that either shouldn't be set by userspace like
	 * FMODE_NONOTIFY or that aren't relevant in determining struct
	 * open_flags like O_CLOEXEC.
	 */
	flags &= ~strip;

翻译说的很明确__FMODE_NONOTIFY 和O_CLOEXEC 不应该是用户设的,所以全设为0
案例
在这里插入图片描述32768的8进制是100000,也就是O_LARGEFILE。原本O_LARGEFILE和O_CLOEXEC的并,现在只剩一个了,调用这个函数的O_CLOEXEC都无效。

/*
	 * Older syscalls implicitly clear all of the invalid flags or argument
	 * values before calling build_open_flags(), but openat2(2) checks all
	 * of its arguments.
	 */
	if (flags & ~VALID_OPEN_FLAGS)
		return -EINVAL;
	if (how->resolve & ~VALID_RESOLVE_FLAGS)
		return -EINVAL;

	/* Scoping flags are mutually exclusive. */
	if ((how->resolve & RESOLVE_BENEATH) && (how->resolve & RESOLVE_IN_ROOT))
		return -EINVAL;

第一个if是再检查下,因为这函数还会被别的函数( io_openat2)调用,所以为了安全,检查一下。后面两个if对咱们而言没啥意义,因为resolve一直为0,都能顺利通过。

/* Deal with the mode. */
	if (WILL_CREATE(flags)) {
		if (how->mode & ~S_IALLUGO)
			return -EINVAL;
		op->mode = how->mode | S_IFREG;
	} else {
		if (how->mode != 0)
			return -EINVAL;
		op->mode = 0;
	}

看看flags是不是设置了O_CREAT 和 __O_TMPFILE ,设置的话就得看看mode了,并根据how的mode设置op的mode,没设置的检查一下how的mode是不是0,然后设置op的mode为0.

/*
	 * Block bugs where O_DIRECTORY | O_CREAT created regular files.
	 * Note, that blocking O_DIRECTORY | O_CREAT here also protects
	 * O_TMPFILE below which requires O_DIRECTORY being raised.
	 */
	if ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT))
		return -EINVAL;

因为O_DIRECTORY 的作用是确保要打开的是个文件夹。不是O_DIRECTORY 和O_CREAT放一块是创建文件夹。所以如果flags把O_DIRECTORY 和O_CREAT都设置了,就返回。

#define MAY_WRITE		0x00000002

/* Now handle the creative implementation of O_TMPFILE. */
	if (flags & __O_TMPFILE) {
		/*
		 * In order to ensure programs get explicit errors when trying
		 * to use O_TMPFILE on old kernels we enforce that O_DIRECTORY
		 * is raised alongside __O_TMPFILE.
		 */
		if (!(flags & O_DIRECTORY))
			return -EINVAL;
		if (!(acc_mode & MAY_WRITE))
			return -EINVAL;
	}

看open是不是要创建临时文件,而且为了能获得错误在老的kernel上,必须还得设置O_DIRECTORY 。
acc_mode必须得设置了读写(O_RDWR)或写(O_WRONLY),也就是acc_mode的值必须为2或6才能行,要不然就无效

if (flags & O_PATH) {
		/* O_PATH only permits certain other flags to be set. */
		if (flags & ~O_PATH_FLAGS)
			return -EINVAL;
		acc_mode = 0;
	}

再检查一下,如果设置了O_PATH 那么flags的设置必须在O_PATH_FLAGS中,要不然无效,并且acc_mode设置为0。

#define O_DSYNC		00010000	

#ifndef O_SYNC
#define __O_SYNC	04000000
#define O_SYNC		(__O_SYNC|O_DSYNC)
#endif
/*
	 * O_SYNC is implemented as __O_SYNC|O_DSYNC.  As many places only
	 * check for O_DSYNC if the need any syncing at all we enforce it's
	 * always set instead of having to deal with possibly weird behaviour
	 * for malicious applications setting only __O_SYNC.
	 */
	if (flags & __O_SYNC)
		flags |= O_DSYNC;

说的很明确,怕有恶意程序只设置一个,保险起见

op->open_flag = flags;

这时候flags就被设置好了。

/* O_TRUNC implies we need access checks for write permissions */
	if (flags & O_TRUNC)
		acc_mode |= MAY_WRITE;

	/* Allow the LSM permission hook to distinguish append
	   access from general write access. */
	if (flags & O_APPEND)
		acc_mode |= MAY_APPEND;

	op->acc_mode = acc_mode;

这是设置acc_mode,顾名思义,acc_mode是用来存储访问模式的,是读呢还是写呢还是读写呢。但flags中不也存着呢吗?为啥单独拿出来?我个人觉得是因为这样能够把它与其余的flags分开吧,要不然都放一起,逻辑上也不清晰。后面acc_mode会被用来检查进程是否有权限打开文件。

#define LOOKUP_OPEN		0x0100	


op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;

看看有设置了跟path相关的位吗,有的话intent等于0 ,没有的话设置为LOOKUP_OPEN(有有个疑问,LOOKUP_OPEN有啥用?),

if (flags & O_CREAT) {
		op->intent |= LOOKUP_CREATE;
		if (flags & O_EXCL) {
			op->intent |= LOOKUP_EXCL;
			flags |= O_NOFOLLOW;
		}
	}

如果flags设置了O_CREAT,op->intent加上LOOKUP_CREATE,同理LOOKUP_EXCL也是,并而外给flags加上O_NOFOLLOW,为啥添加看下面

O_EXCL 
		Ensure that this call creates the file: if this flag is
          specified in conjunction with O_CREAT, and pathname
          already exists, then open() fails with the error EEXIST.

          When these two flags are specified, symbolic links are not
          followed: if pathname is a symbolic link, then open()
          fails regardless of where the symbolic link points.
 O_NOFOLLOW
          If the trailing component (i.e., basename) of pathname is
          a symbolic link, then the open fails, with the error
          ELOOP.

翻译过来就是设置O_EXCL的话,路径中就不能有链接,但其实系统内部这事归 O_NOFOLLOW管,所以给他加上了(emmm,如果我把这句话注掉是不是就没这限制了)


	if (flags & O_DIRECTORY)
		lookup_flags |= LOOKUP_DIRECTORY;
	if (!(flags & O_NOFOLLOW))
		lookup_flags |= LOOKUP_FOLLOW;

	if (how->resolve & RESOLVE_NO_XDEV)
		lookup_flags |= LOOKUP_NO_XDEV;
	if (how->resolve & RESOLVE_NO_MAGICLINKS)
		lookup_flags |= LOOKUP_NO_MAGICLINKS;
	if (how->resolve & RESOLVE_NO_SYMLINKS)
		lookup_flags |= LOOKUP_NO_SYMLINKS;
	if (how->resolve & RESOLVE_BENEATH)
		lookup_flags |= LOOKUP_BENEATH;
	if (how->resolve & RESOLVE_IN_ROOT)
		lookup_flags |= LOOKUP_IN_ROOT;
	if (how->resolve & RESOLVE_CACHED) {
		/* Don't bother even trying for create/truncate/tmpfile open */
		if (flags & (O_TRUNC | O_CREAT | __O_TMPFILE))
			return -EAGAIN;
		lookup_flags |= LOOKUP_CACHED;
	}

	op->lookup_flags = lookup_flags;
	return 0;

根据flags和how->resolve设置lookup_flags,然后给op->lookup_flags赋值,很好,我们这里how->resolve都为0。
案例:
在这里插入图片描述刚才说了intent 因为flags到那只乘O_LARGEFILE了,intent是256 也就是LOOKUP_OPEN,最后的lookup_flags怎么设置的忘说了

#define LOOKUP_FOLLOW		0x0001	/* follow links at the end */

if (!(flags & O_NOFOLLOW))
		lookup_flags |= LOOKUP_FOLLOW;

看 如果没设置O_NOFOLLOW,就设置lookup_flags。好了都能对上

getname


struct filename *
getname(const char __user * filename)
{
	return getname_flags(filename, 0, NULL);
}

ok 简单的封装,__user宏相关的看这里:https://blog.youkuaiyun.com/RNG_uzi1111111/article/details/141070859?spm=1001.2014.3001.5501
不看也行,把它忽略掉就行。
getname_flags函数代码如下

struct filename *
getname_flags(const char __user *filename, int flags, int *empty)
{
	struct filename *result;
	char *kname;
	int len;

	result = audit_reusename(filename);
	if (result)
		return result;

	result = __getname();
	if (unlikely(!result))
		return ERR_PTR(-ENOMEM);

	/*
	 * First, try to embed the struct filename inside the names_cache
	 * allocation
	 */
	kname = (char *)result->iname;
	result->name = kname;

	len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
	if (unlikely(len < 0)) {
		__putname(result);
		return ERR_PTR(len);
	}

	/*
	 * Uh-oh. We have a name that's approaching PATH_MAX. Allocate a
	 * separate struct filename so we can dedicate the entire
	 * names_cache allocation for the pathname, and re-do the copy from
	 * userland.
	 */
	if (unlikely(len == EMBEDDED_NAME_MAX)) {
		const size_t size = offsetof(struct filename, iname[1]);
		kname = (char *)result;

		/*
		 * size is chosen that way we to guarantee that
		 * result->iname[0] is within the same object and that
		 * kname can't be equal to result->iname, no matter what.
		 */
		result = kzalloc(size, GFP_KERNEL);
		if (unlikely(!result)) {
			__putname(kname);
			return ERR_PTR(-ENOMEM);
		}
		result->name = kname;
		len = strncpy_from_user(kname, filename, PATH_MAX);
		if (unlikely(len < 0)) {
			__putname(kname);
			kfree(result);
			return ERR_PTR(len);
		}
		if (unlikely(len == PATH_MAX)) {
			__putname(kname);
			kfree(result);
			return ERR_PTR(-ENAMETOOLONG);
		}
	}

	atomic_set(&result->refcnt, 1);
	/* The empty path is special. */
	if (unlikely(!len)) {
		if (empty)
			*empty = 1;
		if (!(flags & LOOKUP_EMPTY)) {
			putname(result);
			return ERR_PTR(-ENOENT);
		}
	}

	result->uptr = filename;
	result->aname = NULL;
	audit_getname(result);
	return result;
}

整体流程:
1 调用audit_reusename,看看是不是跟审计有关,在他那的话,直接用它的
2 调用__getname,申请个内存
3 让result的成员name指向自己的iname那块数组 (变长数组可参考:https://www.cnblogs.com/gexin/p/9116292.html)
4 把字符串复制给result
5 复制成功就把相应的属性设一下
6 没成功的话,作特殊处理

audit_reusename
static inline struct filename *audit_reusename(const __user char *name)
{
	if (unlikely(!audit_dummy_context()))
		return __audit_reusename(name);
	return NULL;
}

先调用audit_dummy_context检查当前进程有没有审计对象,有的话调用__audit_reusename看看这文件是不是这里面的,没有的话后面创建完也会加进去的。

audit_dummy_context

audit_dummy_context代码如下:


static inline bool audit_dummy_context(void)
{
	void *p = audit_context();
	return !p || *(int *)p;
}

audit_context代码如下

static inline struct audit_context *audit_context(void)
{
	return current->audit_context;
}

current为当前进程的指针,audit_context函数就是返回当前进程的审计对象。
audit_dummy_context的逻辑有点不太好看,返回false的条件是审计对象指针必须不为空且它的 dummy成员(dummy是audit_context的第一个成员,所以可以那么写)必须为0才行。应该是0有什么特殊用途,超出探讨范围,目前不做研究。

__audit_reusename

代码如下

struct filename *
__audit_reusename(const __user char *uptr)
{
	struct audit_context *context = audit_context();
	struct audit_names *n;

	list_for_each_entry(n, &context->names_list, list) {
		if (!n->name)
			continue;
		if (n->name->uptr == uptr) {
			atomic_inc(&n->name->refcnt);
			return n->name;
		}
	}
	return NULL;
}

主要流程是返回当前进程的审计对象,然后跟它的文件列表比较,有的话就范围,没有就算了,返回空。

__getname

extern struct kmem_cache *names_cachep;

#define __getname()		kmem_cache_alloc(names_cachep, GFP_KERNEL)
#define __putname(name)		kmem_cache_free(names_cachep, (void *)(name))

涉及到内存管理那部分了,只需要知道跟获得了一块内存即可

kname = (char *)result->iname;
	result->name = kname;

	len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
	if (unlikely(len < 0)) {
		__putname(result);
		return ERR_PTR(len);
	}

这块主要是让结构体里面的指针指向自己内部的字符数组。然后把open函数输入的路径参数复制进去。原本想着如果简单一些就讲讲strncpy_from_user函数,结果…算了本来打算的就是研究文件系统。其余的只需要知道功能即可。

if (unlikely(len == EMBEDDED_NAME_MAX)) {
		const size_t size = offsetof(struct filename, iname[1]);
		kname = (char *)result;

		/*
		 * size is chosen that way we to guarantee that
		 * result->iname[0] is within the same object and that
		 * kname can't be equal to result->iname, no matter what.
		 */
		result = kzalloc(size, GFP_KERNEL);
		if (unlikely(!result)) {
			__putname(kname);
			return ERR_PTR(-ENOMEM);
		}
		result->name = kname;
		len = strncpy_from_user(kname, filename, PATH_MAX);
		if (unlikely(len < 0)) {
			__putname(kname);
			kfree(result);
			return ERR_PTR(len);
		}
		if (unlikely(len == PATH_MAX)) {
			__putname(kname);
			kfree(result);
			return ERR_PTR(-ENAMETOOLONG);
		}
	}

如果文件路径大于等于最大的名字范围,进行一下特殊处理,
1 计算不带可变字符数组时filename的大小size
2 申请size大小的内存区域
3 把之前申请的filename对象全部用来存储路径
4 如果够的话无所谓,不够的话报错
一般情况不至于

atomic_set(&result->refcnt, 1);
	/* The empty path is special. */
	if (unlikely(!len)) {
		if (empty)
			*empty = 1;
		if (!(flags & LOOKUP_EMPTY)) {
			putname(result);
			return ERR_PTR(-ENOENT);
		}
	}

标明result目前被引用,atomic_set是个原子操作,功能就是设置这个值,然后如果是空路径,特殊处理
1 empty在咱这里肯定是NULL 因为

struct filename *
getname(const char __user * filename)
{
	return getname_flags(filename, 0, NULL);
}

咱在外面调用这函数时设置的就是NULL
2 看看flags的LOOKUP_EMPTY标志 如果没设置,那肯定不行,报错

	result->uptr = filename;
	result->aname = NULL;
	audit_getname(result);
	return result;

记录一下原始的路径地址,把那个审计对象设置为空,然后加到审计里面去,audit_getname就不讲了,以后有时间会去研究研究。

总结:getname就是把路径数据从用户空间存到内核空间,然后用filename表示它。

get_unused_fd_flags

代码如下

#ifndef RLIMIT_NOFILE
# define RLIMIT_NOFILE		7	/* max number of open files */
#endif
int get_unused_fd_flags(unsigned flags)
{
	return __get_unused_fd_flags(flags, rlimit(RLIMIT_NOFILE));
}
rlimit

rlimit定义如下

static inline unsigned long rlimit(unsigned int limit)
{
	return task_rlimit(current, limit);
}

只是作了一个封装,current表示当前进程指针。
task_rlimit定义如下

static inline unsigned long task_rlimit(const struct task_struct *task,
		unsigned int limit)
{
	return READ_ONCE(task->signal->rlim[limit].rlim_cur);
}

READ_ONCE参考链接看上面,总之是读取了索引值为7的rlim对象中的rlim_cur成员。rlim_cur的作用看下面

__get_unused_fd_flags

int __get_unused_fd_flags(unsigned flags, unsigned long nofile)
{
	return alloc_fd(0, nofile, flags);
}

这是调用了一下alloc_fd函数,其中nofile的值就是上面的rlim_cur成员的值。
alloc_fd函数定义如下

/*
 * allocate a file descriptor, mark it busy.
 */
static int alloc_fd(unsigned start, unsigned end, unsigned flags)
{
	struct files_struct *files = current->files;
	unsigned int fd;
	int error;
	struct fdtable *fdt;

	spin_lock(&files->file_lock);
repeat:
	fdt = files_fdtable(files);
	fd = start;
	if (fd < files->next_fd)
		fd = files->next_fd;

	if (fd < fdt->max_fds)
		fd = find_next_fd(fdt, fd);

	/*
	 * N.B. For clone tasks sharing a files structure, this test
	 * will limit the total number of files that can be opened.
	 */
	error = -EMFILE;
	if (fd >= end)
		goto out;

	error = expand_files(files, fd);
	if (error < 0)
		goto out;

	/*
	 * If we needed to expand the fs array we
	 * might have blocked - try again.
	 */
	if (error)
		goto repeat;

	if (start <= files->next_fd)
		files->next_fd = fd + 1;

	__set_open_fd(fd, fdt);
	if (flags & O_CLOEXEC)
		__set_close_on_exec(fd, fdt);
	else
		__clear_close_on_exec(fd, fdt);
	error = fd;
#if 1
	/* Sanity check */
	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
		rcu_assign_pointer(fdt->fd[fd], NULL);
	}
#endif

out:
	spin_unlock(&files->file_lock);
	return error;
}

这么一看就清晰了,形参start的值是0,形参end的值就是rlim_cur成员的值,flags就是how->flags(提示:build_open_how函数里面设置的,比较接近用户设置的了)。这函数整体目标就是从当前进程的filetable中找到一个bit位为0的位置,进程结构体中专门有块数据作为位图用的,这块数据的每一位都表示一个file对象,如果为0表示可用,为1表示正在用。

struct files_struct *files = current->files;
	unsigned int fd;
	int error;
	struct fdtable *fdt;

没啥说的,给files赋值,声明一些变量。

spin_lock(&files->file_lock);

锁,保证互斥

fdt = files_fdtable(files);
	fd = start;
	if (fd < files->next_fd)
		fd = files->next_fd;

	if (fd < fdt->max_fds)
		fd = find_next_fd(fdt, fd);
#define files_fdtable(files) \
	rcu_dereference_check_fdtable((files), (files)->fdt)

涉及到了rcu机制,这里只知道是是把files的fdt成员赋值给了fdt即可。
然后从start位置开始查找,如果start小于next_fd成员值,那么就从next_fd成员值开始,这么看next_fd很用可能是用来记载上一个bit为0的位置的值,比方说上个bit为0的位置的值是100,那么说明0-100之间肯定没有啦,只能是100之后才有可能有,所以这样做可以节省点效率。
检查下fd的值是不是小于max_fds,一般情况应该会比它小,然后执行find_next_fd(fdt, fd)找到那个索引值。
find_next_f函数如下

static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
{
	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
	unsigned int maxbit = maxfd / BITS_PER_LONG;
	unsigned int bitbit = start / BITS_PER_LONG;

	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
	if (bitbit >= maxfd)
		return maxfd;
	if (bitbit > start)
		start = bitbit;
	return find_next_zero_bit(fdt->open_fds, maxfd, start);
}
#ifdef CONFIG_64BIT
#define BITS_PER_LONG 64
#else
#define BITS_PER_LONG 32
#endif 

看,这跟操作系统位数有关,也就是说maxfd值是64,128,192等等。然后maxfd,start都整除BITS_PER_LONG,分别赋值给maxbit和bitbit。就相当于把位图区划分成块了,然后先找哪块有0位0然后再仔细去找。比一个个找有效率。

find_next_zero_bit函数定义如下

/**
 * find_next_zero_bit - find the next cleared bit in a memory region
 * @addr: The address to base the search on
 * @size: The bitmap size in bits
 * @offset: The bitnumber to start searching at
 *
 * Returns the bit number of the next zero bit
 * If no bits are zero, returns @size.
 */
static inline
unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
				 unsigned long offset)
{
	if (small_const_nbits(size)) {
		unsigned long val;

		if (unlikely(offset >= size))
			return size;

		val = *addr | ~GENMASK(size - 1, offset);
		return val == ~0UL ? size : ffz(val);
	}

	return _find_next_zero_bit(addr, size, offset);
}

small_const_nbits定义如下

#define small_const_nbits(nbits) \
	(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG && (nbits) > 0)

__builtin_constant_p是gcc的函数,是判断nbits是否为常量,咱们这个肯定不是阿,所以调用_find_next_zero_bit

_find_next_zero_bit函数如下

unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
					 unsigned long start)
{
	return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
}

这里需要注意的是传进去的第一个参数是~addr[idx],是个取反操作
FIND_NEXT_BIT宏定义如下

#define FIND_NEXT_BIT(FETCH, MUNGE, size, start)				\
({										\
	unsigned long mask, idx, tmp, sz = (size), __start = (start);		\
										\
	if (unlikely(__start >= sz))						\
		goto out;							\
										\
	mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start));				\
	idx = __start / BITS_PER_LONG;						\
										\
	for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) {			\
		if ((idx + 1) * BITS_PER_LONG >= sz)				\
			goto out;						\
		idx++;								\
	}									\
										\
	sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz);			\
out:										\
	sz;									\
})

宏展开如下

({ 
		unsigned long mask, idx, tmp, sz = (nbits), __start = (start); 
		if (unlikely(__start >= sz)) goto out; 
		mask = (BITMAP_FIRST_WORD_MASK(__start)); 
		idx = __start / BITS_PER_LONG; 
		for (tmp = (~addr[idx]) & mask; !tmp; tmp = (~addr[idx])) {
			 if ((idx + 1) * BITS_PER_LONG >= sz) goto out;
		   idx++; 
		   } 
  		sz = min(idx * BITS_PER_LONG + __ffs((tmp)), sz); 
  out: 
  		sz; })

这么看就清晰了很多,按上面的展开就是这样子,FIND_NEXT_BIT这个宏单就是找bit位为0的位置,FETCH表示long类型数组,MUNGE咱这里没用到,size是总位长,start是开始位。
首先是先比较start和size,一般不会出问题
然后能是算一个mask

#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))

计算流程:
1 把0置为1
2 向左移动start位

然后在把__start / BITS_PER_LONG取结果赋值给idx
进行遍历,之所以初始化时tmp = (FETCH) & mask,是因为start代表之前的bit位都为1了,从start开始即可,(虽然我觉得有些多此一举,但逻辑上是应该这么考虑),然后直到某一块bit位有0,idx即为那位图的索引,tmp即为那位图的值,计算一下即可,
__ffs函数如下:

static unsigned int generic___ffs(unsigned long word)
{
	unsigned int num = 0;

#if BITS_PER_LONG == 64
	if ((word & 0xffffffff) == 0) {
		num += 32;
		word >>= 32;
	}
#endif
	if ((word & 0xffff) == 0) {
		num += 16;
		word >>= 16;
	}
	if ((word & 0xff) == 0) {
		num += 8;
		word >>= 8;
	}
	if ((word & 0xf) == 0) {
		num += 4;
		word >>= 4;
	}
	if ((word & 0x3) == 0) {
		num += 2;
		word >>= 2;
	}
	if ((word & 0x1) == 0)
		num += 1;
	return num;
}

这函数就是找word中bit为1的位置。流程就是先看看前32位有没有,没有的话往后看,基本上比较清晰,不必多说。
这个宏说清了,这次再去看看find_next_fd函数,就清晰了,

static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
{
	unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
	unsigned int maxbit = maxfd / BITS_PER_LONG;
	unsigned int bitbit = start / BITS_PER_LONG;

	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
	if (bitbit >= maxfd)
		return maxfd;
	if (bitbit > start)
		start = bitbit;
	return find_next_zero_bit(fdt->open_fds, maxfd, start);
}

先找哪一块有位置,然后再好好找那一块,注意!!!!!!第一次是从full_fds_bits成员找的,第二次是从open_fds中找的,之前我以为是从一个地方找的,看了半天没看明白…
找完以后回头接着看alloc_fd函数

if (fd < fdt->max_fds)
		fd = find_next_fd(fdt, fd);

	/*
	 * N.B. For clone tasks sharing a files structure, this test
	 * will limit the total number of files that can be opened.
	 */
	error = -EMFILE;
	if (fd >= end)
		goto out;

	error = expand_files(files, fd);
	if (error < 0)
		goto out;

	/*
	 * If we needed to expand the fs array we
	 * might have blocked - try again.
	 */
	if (error)
		goto repeat;

	if (start <= files->next_fd)
		files->next_fd = fd + 1;

找到以后看看有没有超过终点,然后调用expand_files函数
expand_files函数定义如下

static int expand_files(struct files_struct *files, unsigned int nr)
	__releases(files->file_lock)
	__acquires(files->file_lock)
{
	struct fdtable *fdt;
	int expanded = 0;

repeat:
	fdt = files_fdtable(files);

	/* Do we need to expand? */
	if (nr < fdt->max_fds)
		return expanded;

	/* Can we expand? */
	if (nr >= sysctl_nr_open)
		return -EMFILE;

	if (unlikely(files->resize_in_progress)) {
		spin_unlock(&files->file_lock);
		expanded = 1;
		wait_event(files->resize_wait, !files->resize_in_progress);
		spin_lock(&files->file_lock);
		goto repeat;
	}

	/* All good, so we try */
	files->resize_in_progress = true;
	expanded = expand_fdtable(files, nr);
	files->resize_in_progress = false;

	wake_up_all(&files->resize_wait);
	return expanded;
}

其中__releases宏和__acquires又又又涉及到gcc的 attribute,运行时没用,只是检查用的

# define __releases(x)	__attribute__((context(x,1,0)))
# define __acquire(x)	__context__(x,1)

具体参考这个链接:http://blog.chinaunix.net/uid-14528823-id-4284946.html
files_fdtable上面说了,检查用不用扩展,一般情况不用,直接返回

接着看alloc_fd

/*
	 * If we needed to expand the fs array we
	 * might have blocked - try again.
	 */
	if (error)
		goto repeat;

	if (start <= files->next_fd)
		files->next_fd = fd + 1;

	__set_open_fd(fd, fdt);
	if (flags & O_CLOEXEC)
		__set_close_on_exec(fd, fdt);
	else
		__clear_close_on_exec(fd, fdt);
	error = fd;
#if 1
	/* Sanity check */
	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
		rcu_assign_pointer(fdt->fd[fd], NULL);
	}
#endif

out:
	spin_unlock(&files->file_lock);
	return error;

流程如下:
1 files->next_fd记录下当前位置的下一个bit位置
2 调用__set_open_fd函数,然后看看有没有设置O_CLOEXEC,没有的话调用__clear_close_on_exec函数
3 error = fd;
4 然后检查一下fdt->fd[fd],这个就是最后要放的位置,rcu_access_pointerrcu_assign_pointer不做介绍,就当rcu_access_pointer不存在即可,rcu_assign_pointer就当是一个赋值操作吧
5 解锁返回正确值
下面说下__set_open_fd__clear_close_on_exec和它思路差不多,不再过多赘述
__set_open_fd函数定义如下:

static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
{
	__set_bit(fd, fdt->open_fds);
	fd /= BITS_PER_LONG;
	if (!~fdt->open_fds[fd])
		__set_bit(fd, fdt->full_fds_bits);
}

整体上这个函数就是先把open_fds的第fd位置1,然后除BITS_PER_LONG算出该bit位属于哪一块,判断这一块(先按位取反,然后取非,这样的话只有全部bit都为1才能行)是否全部占满,是的话将full_fds_bits的表示该块的bit位置1。
__set_bit函数定义如下

#define __set_bit(nr, addr)		bitop(___set_bit, nr, addr)

bitop定义如下

#define bitop(op, nr, addr)						\
	((__builtin_constant_p(nr) &&					\
	  __builtin_constant_p((uintptr_t)(addr) != (uintptr_t)NULL) &&	\
	  (uintptr_t)(addr) != (uintptr_t)NULL &&			\
	  __builtin_constant_p(*(const unsigned long *)(addr))) ?	\
	 const##op(nr, addr) : op(nr, addr))

__builtin_constant_p为gcc内置函数,我理解就和sizeof啥的效果差不多,用来判断是否为常量的。就是判断nr是否为常量值,常量值用调用常量值对应的操作函数,非常量值用调用非常量值对应的操作函数。
___set_bit相关定义如下

#define ___set_bit		arch___set_bit
#define arch___set_bit generic___set_bit
static __always_inline void
generic___set_bit(unsigned long nr, volatile unsigned long *addr)
{
	unsigned long mask = BIT_MASK(nr);
	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
	*p  |= mask;
}
#define BIT_MASK(nr)		(UL(1) << ((nr) % BITS_PER_LONG))
#define BIT_WORD(nr)		((nr) / BITS_PER_LONG)

#define UL(x)		(_UL(x))
#define _UL(x)		(_AC(x, UL))
#define __AC(X,Y)	(X##Y)

BIT_WORD好说,就是看看是哪一块,BIT_MASK稍微麻烦点,UL宏主要用来转换类型的,比如UL(1)会变成1UL,确保是无符号长整型,然后移动nr%BITS_PER_LONG位,*p |= mask;这就是个简单的位操作了。

do_filp_open

代码如下

struct file *do_filp_open(int dfd, struct filename *pathname,
		const struct open_flags *op)
{
	struct nameidata nd;
	int flags = op->lookup_flags;
	struct file *filp;

	set_nameidata(&nd, dfd, pathname, NULL);
	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
	if (unlikely(filp == ERR_PTR(-ECHILD)))
		filp = path_openat(&nd, op, flags);
	if (unlikely(filp == ERR_PTR(-ESTALE)))
		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
	restore_nameidata();
	return filp;
}

这个dfd是真坚挺啊,一直是AT_FDCWD,也没说赋值给某个结构体中。主要流程如下
1 调用set_nameidatand成员赋值
2 调用path_openat,打开该文件
3 然后看返回的filp有问题换换flag接着试试,但nd会有变化吗?
4 调用restore_nameidata

set_nameidata

源码如下

static inline void set_nameidata(struct nameidata *p, int dfd, struct filename *name,
			  const struct path *root)
{
	__set_nameidata(p, dfd, name);
	p->state = 0;
	if (unlikely(root)) {
		p->state = ND_ROOT_PRESET;
		p->root = *root;
	}
}

流程如下:
1 调用__set_nameidata函数
2 设置p指向的对象state成员为0
3 因为root形参对应的实参为NULL,所以不进if语句里面

__set_nameidata

__set_nameidata源码如下

static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
{
	struct nameidata *old = current->nameidata;
	p->stack = p->internal;
	p->depth = 0;
	p->dfd = dfd;
	p->name = name;
	p->path.mnt = NULL;
	p->path.dentry = NULL;
	p->total_link_count = old ? old->total_link_count : 0;
	p->saved = old;
	current->nameidata = p;
}

基本流程如下:
1 把当前进程的nameidata保存一下
2 stack 成员指向 internal成员,nameidata结构体如下,也就是internal成员是个saved结构体数组,然后用stack指针指向它,以后好方便操作,这个使用来管符号链接时用的

#define EMBEDDED_LEVELS 2
struct nameidata {
	struct path	path;
	struct qstr	last;
	struct path	root;
	struct inode	*inode; /* path.dentry.d_inode */
	unsigned int	flags, state;
	unsigned	seq, next_seq, m_seq, r_seq;
	int		last_type;
	unsigned	depth;
	int		total_link_count;
	struct saved {
		struct path link;
		struct delayed_call done;
		const char *name;
		unsigned seq;
	} *stack, internal[EMBEDDED_LEVELS];
	struct filename	*name;
	struct nameidata *saved;
	unsigned	root_seq;
	int		dfd;
	vfsuid_t	dir_vfsuid;
	umode_t		dir_mode;
} __randomize_layout;

3 depth赋值为0
4 dfd赋值给dfd成员,dfd终于有存它的数据结构啦,从这以后就不用单独给dfd留参数了,name成员赋值,name指针就是getname(filename);那获取的
5 path成员初始化,这个可有老鼻子用了
6 total_link_count成员赋值,当前进程的nameidata成员如果之前有,就延续,没有的话就为0
7 把当前进程的nameidata成员保存到saved中,这就形成了一个链表,
8 当前进程的nameidata成员替换为p

path_openat

代码如下

static struct file *path_openat(struct nameidata *nd,
			const struct open_flags *op, unsigned flags)
{
	struct file *file;
	int error;

	file = alloc_empty_file(op->open_flag, current_cred());
	if (IS_ERR(file))
		return file;

	if (unlikely(file->f_flags & __O_TMPFILE)) {
		error = do_tmpfile(nd, flags, op, file);
	} else if (unlikely(file->f_flags & O_PATH)) {
		error = do_o_path(nd, flags, file);
	} else {
		const char *s = path_init(nd, flags);
		while (!(error = link_path_walk(s, nd)) &&
		       (s = open_last_lookups(nd, file, op)) != NULL)
			;
		if (!error)
			error = do_open(nd, file, op);
		terminate_walk(nd);
	}
	if (likely(!error)) {
		if (likely(file->f_mode & FMODE_OPENED))
			return file;
		WARN_ON(1);
		error = -EINVAL;
	}
	fput(file);
	if (error == -EOPENSTALE) {
		if (flags & LOOKUP_RCU)
			error = -ECHILD;
		else
			error = -ESTALE;
	}
	return ERR_PTR(error);
}

整体流程如下:
1 调用alloc_empty_file函数,收集一个file结构体
2 根据flags选择如何打开文件,咱们就看最后一个,do_open的那个
3 没问题就返回

current_cred()

current_cred()定义如下

#define current_cred() \
	rcu_dereference_protected(current->cred, 1)

又是涉及到rcu机制的,总之就认为是返回current->cred就行

alloc_empty_file

alloc_empty_file定义如下

struct file *alloc_empty_file(int flags, const struct cred *cred)
{
	static long old_max;
	struct file *f;
	int error;

	/*
	 * Privileged users can go above max_files
	 */
	if (get_nr_files() >= files_stat.max_files && !capable(CAP_SYS_ADMIN)) {
		/*
		 * percpu_counters are inaccurate.  Do an expensive check before
		 * we go and fail.
		 */
		if (percpu_counter_sum_positive(&nr_files) >= files_stat.max_files)
			goto over;
	}

	f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
	if (unlikely(!f))
		return ERR_PTR(-ENOMEM);

	error = init_file(f, flags, cred);
	if (unlikely(error)) {
		kmem_cache_free(filp_cachep, f);
		return ERR_PTR(error);
	}

	percpu_counter_inc(&nr_files);

	return f;

over:
	/* Ran out of filps - report that */
	if (get_nr_files() > old_max) {
		pr_info("VFS: file-max limit %lu reached\n", get_max_files());
		old_max = get_nr_files();
	}
	return ERR_PTR(-ENFILE);
}

整体流程如下:
1 判断当前已经打开的文件数量是不是到了最大值或者权限是否满足,失败则返回错误信息
2 申请一块内存,存放file结构体
3 初始化file结构体
4 当前已经打开的文件数量加1

get_nr_files
static long get_nr_files(void)
{
	return percpu_counter_read_positive(&nr_files);
}

percpu_counter_read_positive函数定义如下

static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
	/* Prevent reloads of fbc->count */
	s64 ret = READ_ONCE(fbc->count);

	if (ret >= 0)
		return ret;
	return 0;
}

READ_ONCE宏前面有参考链接
简单的读一下这个count成员
nr_files是一个全局变量

static struct percpu_counter nr_files __cacheline_aligned_in_smp;

__cacheline_aligned_in_smp定义如下

#define __cacheline_aligned_in_smp					\
	__attribute__((__aligned__(INTERNODE_CACHE_BYTES)))		\
	__page_aligned_data

#define INTERNODE_CACHE_BYTES (1 << INTERNODE_CACHE_SHIFT)
#define INTERNODE_CACHE_SHIFT L1_CACHE_SHIFT
#define L1_CACHE_SHIFT		5

#define __page_aligned_data	__section(".data..page_aligned") __aligned(PAGE_SIZE)
# define PAGE_SIZE 4096
#define __section(section)              __attribute__((__section__(section)))

__attribute__((__aligned__(INTERNODE_CACHE_BYTES))) 是用来把这个数据在内存中对齐,也就是得是32的倍数

__page_aligned_data的作用是把nr_files放到叫做".data…page_aligned"的数据段中,参考链接

推测nr_files就是用来记载打开了多少个文件了

capable

这个涉及到权限那块了,不做过多解释

kmem_cache_zalloc

涉及到内存管理那块了,只需要知道能申请快内存存放file结构体,并返回指针即可

init_file

代码如下

static int init_file(struct file *f, int flags, const struct cred *cred)
{
	int error;

	f->f_cred = get_cred(cred);
	error = security_file_alloc(f);
	if (unlikely(error)) {
		put_cred(f->f_cred);
		return error;
	}

	rwlock_init(&f->f_owner.lock);
	spin_lock_init(&f->f_lock);
	mutex_init(&f->f_pos_lock);
	f->f_flags = flags;
	f->f_mode = OPEN_FMODE(flags);
	/* f->f_version: 0 */

	/*
	 * We're SLAB_TYPESAFE_BY_RCU so initialize f_count last. While
	 * fget-rcu pattern users need to be able to handle spurious
	 * refcount bumps we should reinitialize the reused file first.
	 */
	atomic_long_set(&f->f_count, 1);
	return 0;
}

流程如下:
1 给f_cred成员赋值
2 分配内存给file
3 初始化读写锁、自旋锁、信号量
4 给f_flags成员赋值
5 给f_mode成员赋值
6 设置f_count成员为1
基本上就是初始化各个成员

get_cred相关定义如下

static inline const struct cred *get_cred(const struct cred *cred)
{
	return get_cred_many(cred, 1);
}
static inline const struct cred *get_cred_many(const struct cred *cred, int nr)
{
	struct cred *nonconst_cred = (struct cred *) cred;
	if (!cred)
		return cred;
	nonconst_cred->non_rcu = 0;
	return get_new_cred_many(nonconst_cred, nr);
}
static inline struct cred *get_new_cred_many(struct cred *cred, int nr)
{
	atomic_long_add(nr, &cred->usage);
	return cred;
}

atomic_long_add是原子操作相关知识,只需要知道是让usage加1即可。
饶了一大圈就是给usage值加1
security_file_alloc涉及到安全方面的了,这里不做探究

OPEN_FMODE定义如下

#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
					    (flag & __FMODE_NONOTIFY)))
#define O_ACCMODE	00000003
#define O_WRONLY	00000001
#define __FMODE_NONOTIFY	((__force int) FMODE_NONOTIFY)
#define FMODE_NONOTIFY		((__force fmode_t)(1 << 26))

咱们设置的O_WRONLY是1,结果应该是2

path_init

代码如下

/* must be paired with terminate_walk() */
static const char *path_init(struct nameidata *nd, unsigned flags)
{
	int error;
	const char *s = nd->name->name;

	/* LOOKUP_CACHED requires RCU, ask caller to retry */
	if ((flags & (LOOKUP_RCU | LOOKUP_CACHED)) == LOOKUP_CACHED)
		return ERR_PTR(-EAGAIN);

	if (!*s)
		flags &= ~LOOKUP_RCU;
	if (flags & LOOKUP_RCU)
		rcu_read_lock();
	else
		nd->seq = nd->next_seq = 0;

	nd->flags = flags;
	nd->state |= ND_JUMPED;

	nd->m_seq = __read_seqcount_begin(&mount_lock.seqcount);
	nd->r_seq = __read_seqcount_begin(&rename_lock.seqcount);
	smp_rmb();

	if (nd->state & ND_ROOT_PRESET) {
		struct dentry *root = nd->root.dentry;
		struct inode *inode = root->d_inode;
		if (*s && unlikely(!d_can_lookup(root)))
			return ERR_PTR(-ENOTDIR);
		nd->path = nd->root;
		nd->inode = inode;
		if (flags & LOOKUP_RCU) {
			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
			nd->root_seq = nd->seq;
		} else {
			path_get(&nd->path);
		}
		return s;
	}

	nd->root.mnt = NULL;

	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
	if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
		error = nd_jump_root(nd);
		if (unlikely(error))
			return ERR_PTR(error);
		return s;
	}

	/* Relative pathname -- get the starting-point it is relative to. */
	if (nd->dfd == AT_FDCWD) {
		if (flags & LOOKUP_RCU) {
			struct fs_struct *fs = current->fs;
			unsigned seq;

			do {
				seq = read_seqcount_begin(&fs->seq);
				nd->path = fs->pwd;
				nd->inode = nd->path.dentry->d_inode;
				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
			} while (read_seqcount_retry(&fs->seq, seq));
		} else {
			get_fs_pwd(current->fs, &nd->path);
			nd->inode = nd->path.dentry->d_inode;
		}
	} else {
		/* Caller must check execute permissions on the starting path component */
		struct fd f = fdget_raw(nd->dfd);
		struct dentry *dentry;

		if (!f.file)
			return ERR_PTR(-EBADF);

		if (flags & LOOKUP_LINKAT_EMPTY) {
			if (f.file->f_cred != current_cred() &&
			    !ns_capable(f.file->f_cred->user_ns, CAP_DAC_READ_SEARCH)) {
				fdput(f);
				return ERR_PTR(-ENOENT);
			}
		}

		dentry = f.file->f_path.dentry;

		if (*s && unlikely(!d_can_lookup(dentry))) {
			fdput(f);
			return ERR_PTR(-ENOTDIR);
		}

		nd->path = f.file->f_path;
		if (flags & LOOKUP_RCU) {
			nd->inode = nd->path.dentry->d_inode;
			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
		} else {
			path_get(&nd->path);
			nd->inode = nd->path.dentry->d_inode;
		}
		fdput(f);
	}

	/* For scoped-lookups we need to set the root to the dirfd as well. */
	if (flags & LOOKUP_IS_SCOPED) {
		nd->root = nd->path;
		if (flags & LOOKUP_RCU) {
			nd->root_seq = nd->seq;
		} else {
			path_get(&nd->root);
			nd->state |= ND_ROOT_GRABBED;
		}
	}
	return s;
}

整体流程如下:
1 先检查flags是不是设置了缓存标志,是的话就返回
2 检查路径名字是不是为空,空的话去掉rcu标志
3 检查flags是不是设置了LOOKUP_RCU(咱研究设置了的那个情况),是的话调用rcu_read_lock()函数
4 设置flags成员和state成员
5 开启两个读锁
6 调用smp_rmb函数,在多处理器情况下保持数据一致性
7 检查state成员是否设置了 ND_ROOT_PRESET,咱这个肯定没有因为调用时就设置的为NULL

set_nameidata(&nd, dfd, pathname, NULL);

static inline void set_nameidata(struct nameidata *p, int dfd, struct filename *name,
			  const struct path *root)
{
	__set_nameidata(p, dfd, name);
	p->state = 0;
	if (unlikely(root)) {
		p->state = ND_ROOT_PRESET;
		p->root = *root;
	}
}

8 设置mnt成员为空
9 判断是不是绝对路径,咱想的情况是相对路径那种,直接跳过
10 判断dfd成员是不是为AT_FDCWD,咱这个就是设置的AT_FDCWD,所以进这个判断里面
11 看看flags是不是设置了LOOKUP_RCU,那肯定呀
12 获取当前进程的fs成员
13 调用read_seqcount_begin函数,声明要读取fs里面的数据。
14 给path,inode,seq成员赋值,其中path中的name成员竟然只是上一级的,这是为啥?盲猜跟挂载有关系,证据如下

上图是调试的内核,进程的运行目录

在这里插入图片描述这个是nameidata的path成员中的name成员值,inode目前尚不清楚是哪个值,留做后面观察

15 调用read_seqcount_retry函数,我觉得作用是查看在上锁后数据有没有被写入过,被写过的话就再读一遍,要不然为啥放while里面呢。
16 查看flags是不是设置了 LOOKUP_IS_SCOPED,不讨论这个情况
总结来说这函数其实就是设置nd的path成员和inode成员。里面涉及的函数都属于rcu和smp里的,这里不做探讨

link_path_walk

代码如下

/*
 * Name resolution.
 * This is the basic name resolution function, turning a pathname into
 * the final dentry. We expect 'base' to be positive and a directory.
 *
 * Returns 0 and nd will have valid dentry and mnt on success.
 * Returns error and drops reference to input namei data on failure.
 */
static int link_path_walk(const char *name, struct nameidata *nd)
{
	int depth = 0; // depth <= nd->depth
	int err;

	nd->last_type = LAST_ROOT;
	nd->flags |= LOOKUP_PARENT;
	if (IS_ERR(name))
		return PTR_ERR(name);
	while (*name=='/')
		name++;
	if (!*name) {
		nd->dir_mode = 0; // short-circuit the 'hardening' idiocy
		return 0;
	}

	/* At this point we know we have a real path component. */
	for(;;) {
		struct mnt_idmap *idmap;
		const char *link;
		u64 hash_len;
		int type;

		idmap = mnt_idmap(nd->path.mnt);
		err = may_lookup(idmap, nd);
		if (err)
			return err;

		hash_len = hash_name(nd->path.dentry, name);

		type = LAST_NORM;
		if (name[0] == '.') switch (hashlen_len(hash_len)) {
			case 2:
				if (name[1] == '.') {
					type = LAST_DOTDOT;
					nd->state |= ND_JUMPED;
				}
				break;
			case 1:
				type = LAST_DOT;
		}
		if (likely(type == LAST_NORM)) {
			struct dentry *parent = nd->path.dentry;
			nd->state &= ~ND_JUMPED;
			if (unlikely(parent->d_flags & DCACHE_OP_HASH)) {
				struct qstr this = { { .hash_len = hash_len }, .name = name };
				err = parent->d_op->d_hash(parent, &this);
				if (err < 0)
					return err;
				hash_len = this.hash_len;
				name = this.name;
			}
		}

		nd->last.hash_len = hash_len;
		nd->last.name = name;
		nd->last_type = type;

		name += hashlen_len(hash_len);
		if (!*name)
			goto OK;
		/*
		 * If it wasn't NUL, we know it was '/'. Skip that
		 * slash, and continue until no more slashes.
		 */
		do {
			name++;
		} while (unlikely(*name == '/'));
		if (unlikely(!*name)) {
OK:
			/* pathname or trailing symlink, done */
			if (!depth) {
				nd->dir_vfsuid = i_uid_into_vfsuid(idmap, nd->inode);
				nd->dir_mode = nd->inode->i_mode;
				nd->flags &= ~LOOKUP_PARENT;
				return 0;
			}
			/* last component of nested symlink */
			name = nd->stack[--depth].name;
			link = walk_component(nd, 0);
		} else {
			/* not the last component */
			link = walk_component(nd, WALK_MORE);
		}
		if (unlikely(link)) {
			if (IS_ERR(link))
				return PTR_ERR(link);
			/* a symlink to follow */
			nd->stack[depth++].name = name;
			name = link;
			continue;
		}
		if (unlikely(!d_can_lookup(nd->path.dentry))) {
			if (nd->flags & LOOKUP_RCU) {
				if (!try_to_unlazy(nd))
					return -ECHILD;
			}
			return -ENOTDIR;
		}
	}
}

看上面的注释,可知这函数主要是通过路径名找到最终的dentry
整体流程如下:
1 先给last_type成员赋值个初始值为LAST_ROOT,last_type主要用来记载当前的这个名字是啥类型,比如路径名是 a/b/c/d,那么后面每次循环 当前的这个名字就为a,b,c,d
LAST_ROOT定义如下

enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};

见名知义,这个枚举体成员分别代表普通的、根、点、点点。其实确实也就这四种
2 给flags多加个属性,这有啥用?
3 跳过路径中多余的“/”
4 进入循环
5 调用may_lookup检查权限够不够
6 调用hash_name获取一个值,这个是由hash值和当前的路径名称单元(例如:dir1/dir2的路径名称就是dir1或dir2)的长度组成
7 设置当前路径的类型为LAST_NORM,默认为一般的那种,很合理
8 检查下是不是.或者… 这种情况咱不讨论,只讨论最一般的情况。
9 判断如果类型为LAST_NORM,设置下state成员
10 给last成员赋值
11 跳过这个路径名称单元(例如:原本是dir1/dir2 跳过去就变成/dir2)
12 跳过‘/’字符
13 检查路径名称是不是到头啦
14 如果是的话就给dir_vfsuid、dir_mode、flags赋值
15 不是的话,调用walk_component函数根据nd 去查找dentry并存到path成员中。如果当前的是symbollink 返回link的路径,这种情况暂不考虑
16 检查link
17 检查权限

整体上来说就是根据dentry成员和路径名获取hash值,然后根据hash值和路径名去找dentry,这么反反复复,直到找到最后的上一个的dentry

hash_name
static inline u64 hash_name(const void *salt, const char *name)
{
	unsigned long a = 0, b, x = 0, y = (unsigned long)salt;
	unsigned long adata, bdata, mask, len;
	const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;

	len = 0;
	goto inside;

	do {
		HASH_MIX(x, y, a);
		len += sizeof(unsigned long);
inside:
		a = load_unaligned_zeropad(name+len);
		b = a ^ REPEAT_BYTE('/');
	} while (!(has_zero(a, &adata, &constants) | has_zero(b, &bdata, &constants)));

	adata = prep_zero_mask(a, adata, &constants);
	bdata = prep_zero_mask(b, bdata, &constants);
	mask = create_zero_mask(adata | bdata);
	x ^= a & zero_bytemask(mask);

	return hashlen_create(fold_hash(x, y), len + find_zero(mask));
}

形参salt就是nd->path.dentry,name就是路径名。
整体流程比较简单:
1 按8个字符往前取
2 如果这8个字符有0或者‘/’ 停止循环
3 计算这个路径单元的长度,并根据这个过程中的数值计算hash值。
4 然后把hash值和长度放到一个long对象中。

关于hash计算的就不讲了,看不懂。着重讲怎么算长度的
困难点在于这几个函数。
WORD_AT_A_TIME_CONSTANTS定义如下

#define WORD_AT_A_TIME_CONSTANTS { REPEAT_BYTE(0x01), REPEAT_BYTE(0x80) }

REPEAT_BYTE定义如下

#define REPEAT_BYTE(x)	((~0ul / 0xff) * (x))

REPEAT_BYTE(x)的返回值就是0x0101010101010101的x倍
所以constants变量的one_bits的值就是0x0101010101010101,high_bits的值就是0x0101010101010101*0x80
作用就是为后面的筛选做铺垫

HASH_MIX的定义如下

#define HASH_MIX(x, y, a)	\
	(	x ^= (a),	\
	y ^= x,	x = rol64(x,12),\
	x += y,	y = rol64(y,45),\
	y *= 9			)

流程就是x与a进行异或操作,结果给x,然后y与x进行异或操作,结果给y,调用rol64结果给x,
x与y相加,结果给x。调用rol64结果给y,然后y再加9。
这一套下来没太看懂,可能是某个hash算法能让hash值分布更均匀?
rol64定义如下

static inline __u64 rol64(__u64 word, unsigned int shift)
{
	return (word << (shift & 63)) | (word >> ((-shift) & 63));
}

这比较好理解,就是把word按照第shift位,进行旋转,例如:rol64(0X12345678,12)的结果就是0X4567890ABCDEF123

load_unaligned_zeropad定义如下

/*
 * Load an unaligned word from kernel space.
 *
 * In the (very unlikely) case of the word being a page-crosser
 * and the next page not being mapped, take the exception and
 * return zeroes in the non-existing part.
 */
static inline unsigned long load_unaligned_zeropad(const void *addr)
{
	unsigned long ret;

	asm volatile(
		"1:	mov %[mem], %[ret]\n"
		"2:\n"
		_ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_ZEROPAD)
		: [ret] "=r" (ret)
		: [mem] "m" (*(unsigned long *)addr));

	return ret;
}

额。。。我看着就是把addr指针指向的值的后面8字节,至于_ASM_EXTABLE_TYPE起什么作用,抱歉在下才疏学浅,如果有知道的麻烦在评论区告知一二,感激不尽。

has_zero定义如下

/* Return nonzero if it has a zero */
static inline unsigned long has_zero(unsigned long a, unsigned long *bits, const struct word_at_a_time *c)
{
	unsigned long mask = ((a - c->one_bits) & ~a) & c->high_bits;
	*bits = mask;
	return mask;
}

这个可以用来检查a是否存在零字节或者是否存在‘/’字符,为零字节或者是否‘/’字符的字节位设置为0x80,其余字节置零,最终结果就是每个字节不是0x80就是0。具体解析参考链接 ,写的非常好
案例
在这里插入图片描述可看见a的低三位字节为‘q’ ‘w’ ‘e’ 然后下一个字符为0了,说明此时路径已经遍历完了。在往下面的字符就没啥意义了。

prep_zero_mask定义如下

static inline unsigned long prep_zero_mask(unsigned long a, unsigned long bits, const struct word_at_a_time *c)
{
	return bits;
}

简单的返回bits

create_zero_mask定义如下

static inline unsigned long create_zero_mask(unsigned long bits)
{
	bits = (bits - 1) & ~bits;
	return bits >> 7;
}

首先明确一下,bits中的每一字节不是0x80就是0,这个函数的作用是把最低位截至到为0x80的中间字节全部设为0XFF。如:0x8080808080000000 结果为0xFFFFFF。因为-1以后0字节全部变为0xFF,一直到第一个0x80,0x80会变为0x7F,也就是该字节除了最高位其余7位全部为1,这样再跟之前的数取反再进行与操作后,上面的不受减1影响,都变为0,下面的都能保留,然后向右移7位,把多余的7位抵消。
这时候有多少个0xFF字节就有多少个剩余字符(除了0或者‘/’字符)了

zero_bytemask定义如下

#define zero_bytemask(mask) (mask)

比较简单

fold_hash定义如下

#define GOLDEN_RATIO_64 0x61C8864680B583EBull
static inline unsigned int fold_hash(unsigned long x, unsigned long y)
{
	y ^= x * GOLDEN_RATIO_64;
	y *= GOLDEN_RATIO_64;
	return y >> 32;
}

一通计算,这到底是涉及到哪块知识呢?

find_zero函数定义如下

static inline long count_masked_bytes(unsigned long mask)
{
	return mask*0x0001020304050608ul >> 56;
}
static inline unsigned long find_zero(unsigned long mask)
{
	return count_masked_bytes(mask);
}

注意 mask按顺序由若干个0XFF组成。这个函数的作用就是返回mask有多少个0XFF。
当有一个0xff的时候,因为0x0001020304050608ul乘上0xFF,基本上最高位的1会往前进8位,就到了64位的高8位,再向右移动完后,只剩下了1。
同理,当有两个的时候也这样。看这个数也很有规律嘛,从1到6再到8,至于为啥最后不是7,因为0xffffffffffffff×0x0001020304050607ul等于6fefdfcfbfaf9f9,右移完后剩个6。但0xffffffffffffff×0x0001020304050608ul就等于7fefdfcfbfaf9f8。有意思。
hashlen_create定义如下

#define hashlen_create(hash, len) ((u64)(len)<<32 | (u32)(hash))

hash和len各占一半,没什么说的

hashlen_len

hashlen_len定义如下

#define hashlen_len(hashlen)  ((u32)((hashlen) >> 32))

很简单,把len取出来。

walk_component

walk_component定义如下

static const char *walk_component(struct nameidata *nd, int flags)
{
	struct dentry *dentry;
	/*
	 * "." and ".." are special - ".." especially so because it has
	 * to be able to know about the current root directory and
	 * parent relationships.
	 */
	if (unlikely(nd->last_type != LAST_NORM)) {
		if (!(flags & WALK_MORE) && nd->depth)
			put_link(nd);
		return handle_dots(nd, nd->last_type);
	}
	dentry = lookup_fast(nd);
	if (IS_ERR(dentry))
		return ERR_CAST(dentry);
	if (unlikely(!dentry)) {
		dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags);
		if (IS_ERR(dentry))
			return ERR_CAST(dentry);
	}
	if (!(flags & WALK_MORE) && nd->depth)
		put_link(nd);
	return step_into(nd, flags, dentry);
}

整体流程如下:
1 先判断当前的路径单元名字是不是.或者… 是的话另做处理。
2 调用lookup_fast函数根据nd去查dentry。
3 查不到的话,调用lookup_slow再去查。(这里只讲lookup_fast能查到)
4 判断flag是不是设置的WALK_MORE和depth有没有数
5 调用step_into,如果是symbollink就返回链接的地址,没有返回空。

lookup_fast函数定义如下

static struct dentry *lookup_fast(struct nameidata *nd)
{
	struct dentry *dentry, *parent = nd->path.dentry;
	int status = 1;

	/*
	 * Rename seqlock is not required here because in the off chance
	 * of a false negative due to a concurrent rename, the caller is
	 * going to fall back to non-racy lookup.
	 */
	if (nd->flags & LOOKUP_RCU) {
		dentry = __d_lookup_rcu(parent, &nd->last, &nd->next_seq);
		if (unlikely(!dentry)) {
			if (!try_to_unlazy(nd))
				return ERR_PTR(-ECHILD);
			return NULL;
		}

		/*
		 * This sequence count validates that the parent had no
		 * changes while we did the lookup of the dentry above.
		 */
		if (read_seqcount_retry(&parent->d_seq, nd->seq))
			return ERR_PTR(-ECHILD);

		status = d_revalidate(dentry, nd->flags);
		if (likely(status > 0))
			return dentry;
		if (!try_to_unlazy_next(nd, dentry))
			return ERR_PTR(-ECHILD);
		if (status == -ECHILD)
			/* we'd been told to redo it in non-rcu mode */
			status = d_revalidate(dentry, nd->flags);
	} else {
		dentry = __d_lookup(parent, &nd->last);
		if (unlikely(!dentry))
			return NULL;
		status = d_revalidate(dentry, nd->flags);
	}
	if (unlikely(status <= 0)) {
		if (!status)
			d_invalidate(dentry);
		dput(dentry);
		return ERR_PTR(status);
	}
	return dentry;
}

整体流程如下:
1 给parent赋值为nd->path.dentry;(nd的path成员是在path_init函数中被赋值的)
2 检查nd的的flags是不设置了LOOKUP_RCU,当然是了,这句设置的

filp = path_openat(&nd, op, flags | LOOKUP_RCU);

3 调用__d_lookup_rcu查找dentry
4 调用d_revalidate重新验证一下
5 检查返回值status是否有问题,没问题就返回dentry

__d_lookup_rcu定义如下

struct dentry *__d_lookup_rcu(const struct dentry *parent,
				const struct qstr *name,
				unsigned *seqp)
{
	u64 hashlen = name->hash_len;
	const unsigned char *str = name->name;
	struct hlist_bl_head *b = d_hash(hashlen_hash(hashlen));
	struct hlist_bl_node *node;
	struct dentry *dentry;
	if (unlikely(parent->d_flags & DCACHE_OP_COMPARE))
		return __d_lookup_rcu_op_compare(parent, name, seqp);
		
	hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
		unsigned seq;
		seq = raw_seqcount_begin(&dentry->d_seq);
		if (dentry->d_parent != parent)
			continue;
		if (d_unhashed(dentry))
			continue;
		if (dentry->d_name.hash_len != hashlen)
			continue;
		if (dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0)
			continue;
		*seqp = seq;
		return dentry;
	}
	return NULL;
}

整体流程如下:
1 根据hashlen截取hash值
2 调用d_hash获得hlist_bl_head表头b
3 遍历这个表,通过比较路径名找到dentry

d_hash定义如下

static unsigned int d_hash_shift __ro_after_init;
static struct hlist_bl_head *dentry_hashtable __ro_after_init;
static inline struct hlist_bl_head *d_hash(unsigned int hash)
{
	return dentry_hashtable + (hash >> d_hash_shift);
}

流程
1 将hash值右移d_hash_shift位
2 将移动后的结果值作为索引找到对应的表,该hash表中的key值也是个表

__ro_after_init的意思是初始化后就不能改变。

hlist_bl_for_each_entry_rcu定义如下

#define hlist_bl_for_each_entry_rcu(tpos, pos, head, member)		\
	for (pos = hlist_bl_first_rcu(head);				\
		pos &&							\
		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
		pos = rcu_dereference_raw(pos->next))

宏展开如下

for (node = hlist_bl_first_rcu(b); node && ({ dentry = hlist_bl_entry(node, typeof(*dentry), d_hash); 1; }); node = rcu_dereference_raw(node->next));

相关变量声明如下

struct hlist_bl_head *b = d_hash(hashlen_hash(hashlen));
struct hlist_bl_node *node;
struct dentry *dentry;
struct dentry {
...
	struct hlist_bl_node d_hash;	/* lookup hash list */
...
}

整体流程如下
1 获得表中第一个节点
2 判断节点是不是为空
3 非空的话根据这个节点的地址计算得到dentry,该dentry包含了该节点
4 循环完一次后获取下一个节点

hlist_bl_first_rcu和rcu_dereference_raw都涉及到了rcu机制,这里不讲

hlist_bl_entry定义及相关宏如下

#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,member)
#define container_of(ptr, type, member) ({				\
	void *__mptr = (void *)(ptr);					\
	static_assert(__same_type(*(ptr), ((type *)0)->member) ||	\
		      __same_type(*(ptr), void),			\
		      "pointer type mismatch in container_of()");	\
	((type *)(__mptr - offsetof(type, member))); })

#define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
#define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
#define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))

只是简单的封装了一下container_of
container_of流程如下
1 先把ptr转为void类型指针__mptr
2 判断ptr指针指向的对象类型和member成员是不是冲突,然后判断ptr指针指向的对象类型和void是不是冲突,冲突及报编译错误
3 根据偏移计算type指针
static_assert是c++11的特性,参考链接
简单来说就是在编译阶段进行判断,如果为0,报错误信息
__builtin_types_compatible_p 是gcc的内置函数,判断两个类型是否兼容,简单理解为是否一样,具体参考链接

dentry_string_cmp定义如下

/*
 * NOTE! 'cs' and 'scount' come from a dentry, so it has a
 * aligned allocation for this particular component. We don't
 * strictly need the load_unaligned_zeropad() safety, but it
 * doesn't hurt either.
 *
 * In contrast, 'ct' and 'tcount' can be from a pathname, and do
 * need the careful unaligned handling.
 */
static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char *ct, unsigned tcount)
{
	unsigned long a,b,mask;

	for (;;) {
		a = read_word_at_a_time(cs);
		b = load_unaligned_zeropad(ct);
		if (tcount < sizeof(unsigned long))
			break;
		if (unlikely(a != b))
			return 1;
		cs += sizeof(unsigned long);
		ct += sizeof(unsigned long);
		tcount -= sizeof(unsigned long);
		if (!tcount)
			return 0;
	}
	mask = bytemask_from_count(tcount);
	return unlikely(!!((a ^ b) & mask));
}

注释解释了为什么调用load_unaligned_zeropad,给b赋值。
流程比较简单,在循环中,按8个字节进行比较。然后将剩余不到8个的通过位操作比较一下,相同返回0,不同返回1。

read_word_at_a_time定义如下

static inline bool kasan_check_read(const volatile void *p, unsigned int size)
{
	return true;
}

static __no_kasan_or_inline unsigned long read_word_at_a_time(const void *addr)
{
	kasan_check_read(addr, 1);
	return *(unsigned long *)addr;
}

简单的赋值转换。

bytemask_from_count定义如下

#define bytemask_from_count(cnt)	(~(~0ul << (cnt)*8))

功能就是将前((cnt)*8)位都置为1

d_revalidate定义如下

static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
{
	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
		return dentry->d_op->d_revalidate(dentry, flags);
	else
		return 1;
}

流程比较简单,只是判断了下dentry->d_flags是否设置了DCACHE_OP_REVALIDATE如果有,就调用dentry->d_op->d_revalidate指针指向的函数。

step_into定义如下

static const char *step_into(struct nameidata *nd, int flags,
		     struct dentry *dentry)
{
	struct path path;
	struct inode *inode;
	int err = handle_mounts(nd, dentry, &path);

	if (err < 0)
		return ERR_PTR(err);
	inode = path.dentry->d_inode;
	if (likely(!d_is_symlink(path.dentry)) ||
	   ((flags & WALK_TRAILING) && !(nd->flags & LOOKUP_FOLLOW)) ||
	   (flags & WALK_NOFOLLOW)) {
		/* not a symlink or should not follow */
		if (nd->flags & LOOKUP_RCU) {
			if (read_seqcount_retry(&path.dentry->d_seq, nd->next_seq))
				return ERR_PTR(-ECHILD);
			if (unlikely(!inode))
				return ERR_PTR(-ENOENT);
		} else {
			dput(nd->path.dentry);
			if (nd->path.mnt != path.mnt)
				mntput(nd->path.mnt);
		}
		nd->path = path;
		nd->inode = inode;
		nd->seq = nd->next_seq;
		return NULL;
	}
	if (nd->flags & LOOKUP_RCU) {
		/* make sure that d_is_symlink above matches inode */
		if (read_seqcount_retry(&path.dentry->d_seq, nd->next_seq))
			return ERR_PTR(-ECHILD);
	} else {
		if (path.mnt == nd->path.mnt)
			mntget(path.mnt);
	}
	return pick_link(nd, &path, inode, flags);
}

整体流程如下:
1 在函数handle_mounts中设置path
2 给inode赋值
3 判断dentry是否是symbollink,(咱讨论不是的)
4 给nd的path和inode赋值
5 返回空
依据如下:
在这里插入图片描述
handle_mounts定义如下


static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
			  struct path *path)
{
	bool jumped;
	int ret;

	path->mnt = nd->path.mnt;
	path->dentry = dentry;
	if (nd->flags & LOOKUP_RCU) {
		unsigned int seq = nd->next_seq;
		if (likely(__follow_mount_rcu(nd, path)))
			return 0;
		// *path and nd->next_seq might've been clobbered
		path->mnt = nd->path.mnt;
		path->dentry = dentry;
		nd->next_seq = seq;
		if (!try_to_unlazy_next(nd, dentry))
			return -ECHILD;
	}
	ret = traverse_mounts(path, &jumped, &nd->total_link_count, nd->flags);
	if (jumped) {
		if (unlikely(nd->flags & LOOKUP_NO_XDEV))
			ret = -EXDEV;
		else
			nd->state |= ND_JUMPED;
	}
	if (unlikely(ret)) {
		dput(path->dentry);
		if (path->mnt != nd->path.mnt)
			mntput(path->mnt);
	}
	return ret;
}

这函数应该是处理挂载点的,但咱们涉及不到那部分,咱只涉及前半部分
整体流程如下:
1 给path赋值
2 判断nd->flags
3 调用__follow_mount_rcu函数
4 返回0

依据如下
在这里插入图片描述调用handle_mounts时,if判断完,直接返回到step_into函数。所以根本没走底下的traverse_mounts函数。

open_last_lookups
static const char *open_last_lookups(struct nameidata *nd,
		   struct file *file, const struct open_flags *op)
{
	struct dentry *dir = nd->path.dentry;
	int open_flag = op->open_flag;
	bool got_write = false;
	struct dentry *dentry;
	const char *res;

	nd->flags |= op->intent;

	if (nd->last_type != LAST_NORM) {
		if (nd->depth)
			put_link(nd);
		return handle_dots(nd, nd->last_type);
	}

	if (!(open_flag & O_CREAT)) {
		if (nd->last.name[nd->last.len])
			nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
		/* we _can_ be in RCU mode here */
		dentry = lookup_fast(nd);
		if (IS_ERR(dentry))
			return ERR_CAST(dentry);
		if (likely(dentry))
			goto finish_lookup;

		if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
			return ERR_PTR(-ECHILD);
	} else {
		/* create side of things */
		if (nd->flags & LOOKUP_RCU) {
			if (!try_to_unlazy(nd))
				return ERR_PTR(-ECHILD);
		}
		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
		/* trailing slashes? */
		if (unlikely(nd->last.name[nd->last.len]))
			return ERR_PTR(-EISDIR);
	}

	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
		got_write = !mnt_want_write(nd->path.mnt);
		/*
		 * do _not_ fail yet - we might not need that or fail with
		 * a different error; let lookup_open() decide; we'll be
		 * dropping this one anyway.
		 */
	}
	if (open_flag & O_CREAT)
		inode_lock(dir->d_inode);
	else
		inode_lock_shared(dir->d_inode);
	dentry = lookup_open(nd, file, op, got_write);
	if (!IS_ERR(dentry)) {
		if (file->f_mode & FMODE_CREATED)
			fsnotify_create(dir->d_inode, dentry);
		if (file->f_mode & FMODE_OPENED)
			fsnotify_open(file);
	}
	if (open_flag & O_CREAT)
		inode_unlock(dir->d_inode);
	else
		inode_unlock_shared(dir->d_inode);

	if (got_write)
		mnt_drop_write(nd->path.mnt);

	if (IS_ERR(dentry))
		return ERR_CAST(dentry);

	if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) {
		dput(nd->path.dentry);
		nd->path.dentry = dentry;
		return NULL;
	}

finish_lookup:
	if (nd->depth)
		put_link(nd);
	res = step_into(nd, WALK_TRAILING, dentry);
	if (unlikely(res))
		nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
	return res;
}

看着多,其实真正运行的没多少。
整体流程如下:
1 nd->flags与op->intent进行或操作,op->intent的值只有0或者LOOKUP_OPEN(当设置O_PATH时才会是LOOKUP_OPEN),所以nd->flags不变
2 判断nd->last_type是不是正规文件,而不是.或者…
3 判断最后一个路径单元是不是软链接或者文件夹
4 查找最后一个路径单元,获得dentry
5 判断是不是查到了
6 跳到finish_lookup
7 调用step_into,返回最后一个路径单元链接的路径,正常情况下该路径为空

do_open

代码如下

/*
 * Handle the last step of open()
 */
static int do_open(struct nameidata *nd,
		   struct file *file, const struct open_flags *op)
{
	struct mnt_idmap *idmap;
	int open_flag = op->open_flag;
	bool do_truncate;
	int acc_mode;
	int error;

	if (!(file->f_mode & (FMODE_OPENED | FMODE_CREATED))) {
		error = complete_walk(nd);
		if (error)
			return error;
	}
	if (!(file->f_mode & FMODE_CREATED))
		audit_inode(nd->name, nd->path.dentry, 0);
	idmap = mnt_idmap(nd->path.mnt);
	if (open_flag & O_CREAT) {
		if ((open_flag & O_EXCL) && !(file->f_mode & FMODE_CREATED))
			return -EEXIST;
		if (d_is_dir(nd->path.dentry))
			return -EISDIR;
		error = may_create_in_sticky(idmap, nd,
					     d_backing_inode(nd->path.dentry));
		if (unlikely(error))
			return error;
	}
	if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
		return -ENOTDIR;

	do_truncate = false;
	acc_mode = op->acc_mode;
	if (file->f_mode & FMODE_CREATED) {
		/* Don't check for write permission, don't truncate */
		open_flag &= ~O_TRUNC;
		acc_mode = 0;
	} else if (d_is_reg(nd->path.dentry) && open_flag & O_TRUNC) {
		error = mnt_want_write(nd->path.mnt);
		if (error)
			return error;
		do_truncate = true;
	}
	error = may_open(idmap, &nd->path, acc_mode, open_flag);
	if (!error && !(file->f_mode & FMODE_OPENED))
		error = vfs_open(&nd->path, file);
	if (!error)
		error = security_file_post_open(file, op->acc_mode);
	if (!error && do_truncate)
		error = handle_truncate(idmap, file);
	if (unlikely(error > 0)) {
		WARN_ON(1);
		error = -EINVAL;
	}
	if (do_truncate)
		mnt_drop_write(nd->path.mnt);
	return error;
}

注释都说了,这是最后一步。加油!
整体流程如下:
1 首先根据file->f_mode判断是不是FMODE_OPENED或 FMODE_CREATED,咱这个场景设置得到的是FMODE_WRITE
2 调用complete_walk将nd的一些值清理一下
3 调用audit_inode进行审计相关方面操作
4 调用mnt_idmap(nd->path.mnt);获得idmap
5 调用may_open进行权限检查
6 调用vfs_open打开该文件

complete_walk

定义如下

static int complete_walk(struct nameidata *nd)
{
	struct dentry *dentry = nd->path.dentry;
	int status;

	if (nd->flags & LOOKUP_RCU) {
		/*
		 * We don't want to zero nd->root for scoped-lookups or
		 * externally-managed nd->root.
		 */
		if (!(nd->state & ND_ROOT_PRESET))
			if (!(nd->flags & LOOKUP_IS_SCOPED))
				nd->root.mnt = NULL;
		nd->flags &= ~LOOKUP_CACHED;
		if (!try_to_unlazy(nd))
			return -ECHILD;
	}

	if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
		/*
		 * While the guarantee of LOOKUP_IS_SCOPED is (roughly) "don't
		 * ever step outside the root during lookup" and should already
		 * be guaranteed by the rest of namei, we want to avoid a namei
		 * BUG resulting in userspace being given a path that was not
		 * scoped within the root at some point during the lookup.
		 *
		 * So, do a final sanity-check to make sure that in the
		 * worst-case scenario (a complete bypass of LOOKUP_IS_SCOPED)
		 * we won't silently return an fd completely outside of the
		 * requested root to userspace.
		 *
		 * Userspace could move the path outside the root after this
		 * check, but as discussed elsewhere this is not a concern (the
		 * resolved file was inside the root at some point).
		 */
		if (!path_is_under(&nd->path, &nd->root))
			return -EXDEV;
	}

	if (likely(!(nd->state & ND_JUMPED)))
		return 0;

	if (likely(!(dentry->d_flags & DCACHE_OP_WEAK_REVALIDATE)))
		return 0;

	status = dentry->d_op->d_weak_revalidate(dentry, nd->flags);
	if (status > 0)
		return 0;

	if (!status)
		status = -ESTALE;

	return status;
}

整体流程如下:
1 判断nd->flags是不是设置了LOOKUP_RCU
2 如果nd->state 没设置 ND_ROOT_PRESET和nd->flags 没设置 LOOKUP_IS_SCOPED,讲nd->root.mnt置空
nd->state是在set_nameidata和path_init函数中设置的

nd->flags的值调试得知:
在这里插入图片描述也就是设置的LOOKUP_OPEN | LOOKUP_FOLLOW | LOOKUP_RCU
是在这:
nd->flag的数据流程大致如下:

1 do_sys_open函数 open_how how = build_open_how(flags, mode);
if (!(flags & O_NOFOLLOW))
		lookup_flags |= LOOKUP_FOLLOW;   注:LOOKUP_FOLLOW是在这设置的
2 do_sys_openat2函数  int fd = build_open_flags(how, &op);
3  do_filp_open 函数 int  flags = op->lookup_flags; filp = path_openat(&nd, op, flags | LOOKUP_RCU);  注:LOOKUP_RCU是这儿
4 path_init 函数 nd->flags = flags;
5 open_last_lookups函数 nd->flags |= op->intent;  注:LOOKUP_OPEN是这儿

3 消除掉LOOKUP_CACHED位,本来也没设置
4 调用try_to_unlazy去清理一些nd值
5 检查nd->flags 是否设置LOOKUP_IS_SCOPED (咱这情景没有)
6 检查nd->state 是否设置 ND_JUMPED,没设置返回

try_to_unlazy函数定义如下

static bool try_to_unlazy(struct nameidata *nd)
{
	struct dentry *parent = nd->path.dentry;

	BUG_ON(!(nd->flags & LOOKUP_RCU));

	if (unlikely(!legitimize_links(nd)))
		goto out1;
	if (unlikely(!legitimize_path(nd, &nd->path, nd->seq)))
		goto out;
	if (unlikely(!legitimize_root(nd)))
		goto out;
	leave_rcu(nd);
	BUG_ON(nd->inode != parent->d_inode);
	return true;

out1:
	nd->path.mnt = NULL;
	nd->path.dentry = NULL;
out:
	leave_rcu(nd);
	return false;
}

整体流程如下:
1 调用legitimize_links,跟链接有关,不详细说明
2 调用legitimize_path,跟链接有关,不详细说明
3 调用legitimize_root,跟链接有关,不详细说明
4 调用leave_rcu,猜测时是退出rcu机制了,应该是后面不在修改nd的值了

leave_rcu定义如下

static void leave_rcu(struct nameidata *nd)
{
	nd->flags &= ~LOOKUP_RCU;
	nd->seq = nd->next_seq = 0;
	rcu_read_unlock();
}

整理流程比较简单:
1 取消设置nd->flags的LOOKUP_RCU
2 将相关顺序锁置0
3 调用rcu_read_unlock,解锁rcu

mnt_idmap和may_open

涉及到了多核,不在赘述

may_open

涉及了权限检查,不在赘述

vfs_open

定义如下

int vfs_open(const struct path *path, struct file *file)
{
	int ret;

	file->f_path = *path;
	ret = do_dentry_open(file, NULL);
	if (!ret) {
		/*
		 * Once we return a file with FMODE_OPENED, __fput() will call
		 * fsnotify_close(), so we need fsnotify_open() here for
		 * symmetry.
		 */
		fsnotify_open(file);
	}
	return ret;
}

整体流程比较简单:
1 给path赋值 注:前面通过lookup_fast找到的dentry,通过在handle_mounts函数中将dentry赋值给了path->dentry,然后这里再将path给file->path
2调用do_dentry_open函数打开文件
3打开成功调用fsnotify_open函数,通知一下,涉及到了事件机制,不在这里赘述

do_dentry_open
static int do_dentry_open(struct file *f,
			  int (*open)(struct inode *, struct file *))
{
	static const struct file_operations empty_fops = {};
	struct inode *inode = f->f_path.dentry->d_inode;
	int error;

	path_get(&f->f_path);
	f->f_inode = inode;
	f->f_mapping = inode->i_mapping;
	f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
	f->f_sb_err = file_sample_sb_err(f);

	if (unlikely(f->f_flags & O_PATH)) {
		f->f_mode = FMODE_PATH | FMODE_OPENED;
		f->f_op = &empty_fops;
		return 0;
	}

	if ((f->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) {
		i_readcount_inc(inode);
	} else if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) {
		error = file_get_write_access(f);
		if (unlikely(error))
			goto cleanup_file;
		f->f_mode |= FMODE_WRITER;
	}

	/* POSIX.1-2008/SUSv4 Section XSI 2.9.7 */
	if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode))
		f->f_mode |= FMODE_ATOMIC_POS;

	f->f_op = fops_get(inode->i_fop);
	if (WARN_ON(!f->f_op)) {
		error = -ENODEV;
		goto cleanup_all;
	}

	error = security_file_open(f);
	if (error)
		goto cleanup_all;

	error = break_lease(file_inode(f), f->f_flags);
	if (error)
		goto cleanup_all;

	/* normally all 3 are set; ->open() can clear them if needed */
	f->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
	if (!open)
		open = f->f_op->open;
	if (open) {
		error = open(inode, f);
		if (error)
			goto cleanup_all;
	}
	f->f_mode |= FMODE_OPENED;
	if ((f->f_mode & FMODE_READ) &&
	     likely(f->f_op->read || f->f_op->read_iter))
		f->f_mode |= FMODE_CAN_READ;
	if ((f->f_mode & FMODE_WRITE) &&
	     likely(f->f_op->write || f->f_op->write_iter))
		f->f_mode |= FMODE_CAN_WRITE;
	if ((f->f_mode & FMODE_LSEEK) && !f->f_op->llseek)
		f->f_mode &= ~FMODE_LSEEK;
	if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
		f->f_mode |= FMODE_CAN_ODIRECT;

	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
	f->f_iocb_flags = iocb_flags(f);

	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);

	if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
		return -EINVAL;

	/*
	 * XXX: Huge page cache doesn't support writing yet. Drop all page
	 * cache for this file before processing writes.
	 */
	if (f->f_mode & FMODE_WRITE) {
		/*
		 * Paired with smp_mb() in collapse_file() to ensure nr_thps
		 * is up to date and the update to i_writecount by
		 * get_write_access() is visible. Ensures subsequent insertion
		 * of THPs into the page cache will fail.
		 */
		smp_mb();
		if (filemap_nr_thps(inode->i_mapping)) {
			struct address_space *mapping = inode->i_mapping;

			filemap_invalidate_lock(inode->i_mapping);
			/*
			 * unmap_mapping_range just need to be called once
			 * here, because the private pages is not need to be
			 * unmapped mapping (e.g. data segment of dynamic
			 * shared libraries here).
			 */
			unmap_mapping_range(mapping, 0, 0, 0);
			truncate_inode_pages(mapping, 0);
			filemap_invalidate_unlock(inode->i_mapping);
		}
	}

	return 0;

cleanup_all:
	if (WARN_ON_ONCE(error > 0))
		error = -EINVAL;
	fops_put(f->f_op);
	put_file_access(f);
cleanup_file:
	path_put(&f->f_path);
	f->f_path.mnt = NULL;
	f->f_path.dentry = NULL;
	f->f_inode = NULL;
	return error;
}

这函数就是pen函数的核心函数
整体流程如下:
1 首先给file的部分成员赋值,大部分来自前面查找到的dentry中的d_inode成员,(因为在虚拟文件系统中inode代表了真正的文件,大部分文件信息都在这里面)
2 判断是不是设置了O_PATH
3 查看是设置了读操作还是写操作,读操作就将记录读者数量的i_readcount成员加1,写操作就得调用 file_get_write_access函数检查此时能不能写入
4调用 fops_get函数获取inode的i_fop成员给f->f_op成员
5 进行最后的安全权限检查
6 获取 f->f_op的open函数
7 调用open函数,这句代码就是整个open函数的核心,整个open函数都是为这句代码做准备,前面的一切都是虚拟文件系统做的事,这个函数是要打开的文件所在的具体文件系统所要做的事情。它连接了两者,执行完这句代码后,后面的操作都是做一些清理工作了。
8 给f->f_mode 添加上已打开的标识
9 如果是读取加上FMODE_CAN_READ的标识,意味着能读取了,同理写入也一样
10 调用file_ra_state_init初始化file关于内存的部分
11 如果是写入,调用smp_mb();涉及到多核问题

if (filemap_nr_thps(inode->i_mapping)) {
			struct address_space *mapping = inode->i_mapping;

			filemap_invalidate_lock(inode->i_mapping);
			/*
			 * unmap_mapping_range just need to be called once
			 * here, because the private pages is not need to be
			 * unmapped mapping (e.g. data segment of dynamic
			 * shared libraries here).
			 */
			unmap_mapping_range(mapping, 0, 0, 0);
			truncate_inode_pages(mapping, 0);
			filemap_invalidate_unlock(inode->i_mapping);
		}

12 这部分代码没太细看,大体猜测一下,应该是查看当前文件有没有缓存,有的话清理一下,希望有懂的大佬指点迷津

terminate_walk
static void terminate_walk(struct nameidata *nd)
{
	drop_links(nd);
	if (!(nd->flags & LOOKUP_RCU)) {
		int i;
		path_put(&nd->path);
		for (i = 0; i < nd->depth; i++)
			path_put(&nd->stack[i].link);
		if (nd->state & ND_ROOT_GRABBED) {
			path_put(&nd->root);
			nd->state &= ~ND_ROOT_GRABBED;
		}
	} else {
		leave_rcu(nd);
	}
	nd->depth = 0;
	nd->path.mnt = NULL;
	nd->path.dentry = NULL;
}

整体流程如下:
1 调用drop_links函数,清空nd里面的link信息
2 判断nd->flags是否设置了LOOKUP_RCU,(在try_to_unlazy函数中调用过一次leave_rcu函数,将LOOKUP_RCU清空了)
3 调用path_put,将nd->path的引用数量减一
4 查看depth成员是否大于0,大于的话将stack栈中的path都减一。
5 检查是否设置了
6 将depth置0
7 将nd->path.mnt 和 nd->path.dentry置空

drop_links函数

static inline void do_delayed_call(struct delayed_call *call)
{
	if (call->fn)
		call->fn(call->arg);
}

static inline void clear_delayed_call(struct delayed_call *call)
{
	call->fn = NULL;
}
static void drop_links(struct nameidata *nd)
{
	int i = nd->depth;
	while (i--) {
		struct saved *last = nd->stack + i;
		do_delayed_call(&last->done);
		clear_delayed_call(&last->done);
	}
}

整体流程如下:
1 出栈
2 调用do_delayed_call,调用fn指向的函数
3 调用clear_delayed_call,将fn置空

restore_nameidata
static void restore_nameidata(void)
{
	struct nameidata *now = current->nameidata, *old = now->saved;

	current->nameidata = old;
	if (old)
		old->total_link_count = now->total_link_count;
	if (now->stack != now->internal)
		kfree(now->stack);
}

整体流程如下:
1 将之前的nameidata成员重新给当前进程
2 将now->total_link_count赋给old->total_link_count
3 清空新的nd的栈

fd_install


void fd_install(unsigned int fd, struct file *file)
{
	struct files_struct *files = current->files;
	struct fdtable *fdt;

	if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
		return;

	rcu_read_lock_sched();

	if (unlikely(files->resize_in_progress)) {
		rcu_read_unlock_sched();
		spin_lock(&files->file_lock);
		fdt = files_fdtable(files);
		BUG_ON(fdt->fd[fd] != NULL);
		rcu_assign_pointer(fdt->fd[fd], file);
		spin_unlock(&files->file_lock);
		return;
	}
	/* coupled with smp_wmb() in expand_fdtable() */
	smp_rmb();
	fdt = rcu_dereference_sched(files->fdt);
	BUG_ON(fdt->fd[fd] != NULL);
	rcu_assign_pointer(fdt->fd[fd], file);
	rcu_read_unlock_sched();
}

这个函数大部分代码都是rcu保持数据一致性的,就一句真正有用,就是rcu_assign_pointer(fdt->fd[fd], file);将file放到file表中的fd位置。
rcu_assign_pointer定义如下

#define rcu_assign_pointer(p, v)	do { (p) = (v); } while (0)

简单的赋值,至于为什么是do while(0),这样和直接大括号有啥区别,参考链接

putname

void putname(struct filename *name)
{
	if (IS_ERR(name))
		return;

	if (WARN_ON_ONCE(!atomic_read(&name->refcnt)))
		return;

	if (!atomic_dec_and_test(&name->refcnt))
		return;

	if (name->name != name->iname) {
		__putname(name->name);
		kfree(name);
	} else
		__putname(name);
}

基本上就是把name所占用的内存释放
整体流程如下:
1 检查是否指针指向的是否为错误信息
2 检查name的引用是否为0
3 将引用值减一
4 name->name 与 name->iname是否相等,在咱们这情景中相等
在这里插入图片描述5 调用__putname清理之前申请的内存

__putname定义如下

#define __putname(name)		kmem_cache_free(names_cachep, (void *)(name))

到这里,open函数代码解析完毕,由于涉及的东西较多,难免会有疏漏和错误,欢迎大家指正

遗留问题

这些都是在写的过程中遇到的不明白的地方,特此记录一下,留着以后慢慢解决

build_open_flags

1 FMODE_NONOTIFY这有啥用还没看出来
2 compiletime_assert在写一个blog
3 为啥设置__O_TMPFILE还得设置O_DIRECTORY 要不然不行,兼容啥老版本?
4 LOOKUP_OPEN的作用
5 (emmm,如果我把这句话 flags |= O_NOFOLLOW; 注掉是不是就没这限制了)

getname

1 审计机制以后可以了解一下
2 atomic_set(&result->refcnt, 1); 这个原子操作以后可以了解一下
3 整体流程 第6不 没成功的话,光看这一层有点看不懂
4 __getname()也可以研究研究 slab之前在书上看到过
5 原子操作以后得琢磨琢磨,跟抢占有啥关系

get_unused_fd_flags

1读取了索引值为7的rlim对象中的rlim_cur成员有啥用?(位图的end值)
2spin_lock(&files->file_lock);
3rcu机制以后可以了解一下rcu_access_pointer和rcu_assign_pointer

do_filp_open

1 restore_nameidata函数得看一看,有啥作用?到时候补充在do_filp_open整体流程那
2 __set_nameidata里的p->stack = p->internal;是管符号链接用的吗?
3 nameidata的path成员很有用吧?
4 nameidata的total_link_count成员有啥有?
5 nameidata的saved呢?又有啥用?研究明白后把__set_nameidata的整体流程好好梳理一遍
6 然后看返回的filp有问题换换flag接着试试,但nd会有变化吗?

path_init

1 给nameidata的path赋的啥值清楚了,但inode目前尚不清楚是哪个值,留做后面观察

link_path_walk

1给flags多加个属性,这有啥用?

load_unaligned_zeropad

1 _ASM_EXTABLE_TYPE的作用

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值