一直想研究研究linux的文件系统,就从这入手吧
linux kernel版本:6.10 64位
架构:x86
查看源码网址:https://elixir.bootlin.com/linux/v6.10/source
场景:open(“dirTest/ppshuoTest”,O_WRONLY);
源码:
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
if (force_o_largefile())
flags |= O_LARGEFILE;
return do_sys_open(AT_FDCWD, filename, flags, mode);
}
呐 一个宏,这宏跟系统调用有关,这方面不作解释,我研究的是它的文件系统。咱就可以理解为open函数的代码就这几句。
流程就是先判断force_o_largefile()函数返回的值,如果为真就给flags加一个标记,然后再执行do_sys_open函数。
force_o_largefile()
/ include / linux / fcntl.h
#ifndef force_o_largefile
#define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
#endif
哦,是个宏
CONFIG_ARCH_32BIT_OFF_T
/ arch / Kconfig
config ARCH_32BIT_OFF_T
bool
depends on !64BIT
help
All new 32-bit architectures should have 64-bit off_t type on
userspace side which corresponds to the loff_t kernel type. This
is the requirement for modern ABIs. Some existing architectures
still support 32-bit off_t. This option is enabled for all such
architectures explicitly.
简单理解就是说他支持32位的off_t,如果被配置成内置,CONFIG_ARCH_32BIT_OFF_T就被定义为1,如果是模块,CONFIG_ARCH_32BIT_OFF_T_MODULE 就被定义为1,例如:
#define CONFIG_ARCH_32BIT_OFF_T 1
或者
#define CONFIG_ARCH_32BIT_OFF_T_MODULE 1
IS_ENABLED()
/ include / linux / kconfig.h
/*
* IS_ENABLED(CONFIG_FOO) evaluates to 1 if CONFIG_FOO is set to 'y' or 'm',
* 0 otherwise. Note that CONFIG_FOO=y results in "#define CONFIG_FOO 1" in
* autoconf.h, while CONFIG_FOO=m results in "#define CONFIG_FOO_MODULE 1".
*/
#define IS_ENABLED(option) __or(IS_BUILTIN(option), IS_MODULE(option))
人家的意思是说看CONFIG_FOO (也就是咱这个CONFIG_ARCH_32BIT_OFF_T)是不是配置成内置或者模块了 。
再看看force_o_largefile,是取他的非值,也就是没配置才为真。好了,看来这跟文件系统没啥关系。是跟编译前的配置有关,但我还是想把它解释清楚,因为内容不多。以下内容可跳过直接看do_sys_open,那里才是真正的源码解析。
IS_BUILTIN
/ include / linux / kconfig.h
/*
* IS_BUILTIN(CONFIG_FOO) evaluates to 1 if CONFIG_FOO is set to 'y', 0
* otherwise. For boolean options, this is equivalent to
* IS_ENABLED(CONFIG_FOO).
*/
#define IS_BUILTIN(option) __is_defined(option)
/*
* IS_MODULE(CONFIG_FOO) evaluates to 1 if CONFIG_FOO is set to 'm', 0
* otherwise. CONFIG_FOO=m results in "#define CONFIG_FOO_MODULE 1" in
* autoconf.h.
*/
#define IS_MODULE(option) __is_defined(option##_MODULE)
都调用一个宏__is_defined,那就看看这宏是干啥用的
/ include / linux / kconfig.h
#define __ARG_PLACEHOLDER_1 0,
#define __take_second_arg(__ignored, val, ...) val
#define __is_defined(x) ___is_defined(x)
#define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val)
#define ____is_defined(arg1_or_junk) __take_second_arg(arg1_or_junk 1, 0)
这里面最重要的是#define __ARG_PLACEHOLDER_1 0, 注意!!ARG_PLACEHOLDER_1 被定义的是有个0, 这分号可很重要。
(1)如果CONFIG_ARCH_32BIT_OFF_T被定义为1,
那么__ARG_PLACEHOLDER##val就等于__ARG_PLACEHOLDER_1 ,
到____is_defined(arg1_or_junk)就是____is_defined(0,)
到__take_second_arg(arg1_or_junk 1, 0)这就是__take_second_arg(0,1, 0) 然后返回1
(2)如果CONFIG_ARCH_32BIT_OFF_T 没被定义
那么__ARG_PLACEHOLDER##val就等于__ARG_PLACEHOLDER,__ARG_PLACEHOLDER这个不存在
____is_defined(arg1_or_junk)就是____is_defined()
__take_second_arg(arg1_or_junk 1, 0)就是__take_second_arg(1, 0) 然后返回0
do_sys_open
/ fs / open.c
long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
struct open_how how = build_open_how(flags, mode);
return do_sys_openat2(dfd, filename, &how);
}
这函数就两行代码
第一行根据flags和mode构建一个open_how的对象
然后调用do_sys_openat2函数
形参dfd是AT_FDCWD
#define AT_FDCWD -100 /* Special value used to indicate
openat should use the current
working directory. */
是用来表示用当前的工作目录,这是啥意思?
这个得跟openat函数调用一起说,咱解析的不是open函数嘛。先看两个函数的区别
int openat(int dirfd, const char *pathname, int flags);
int open(const char *pathname, int flags);
区别就在第一个参数dirfd,字面意思就是目录文件dir的标识符fd。
他的作用是如果pathname是相对路径,那相对的就不是当前的工作目录,而是文件标识符dirfd所代表的文件目录。
build_open_how
#define WILL_CREATE(flags) (flags & (O_CREAT | __O_TMPFILE))
#define O_PATH_FLAGS (O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC)
inline struct open_how build_open_how(int flags, umode_t mode)
{
struct open_how how = {
.flags = flags & VALID_OPEN_FLAGS,
.mode = mode & S_IALLUGO,
};
/* O_PATH beats everything else. */
if (how.flags & O_PATH)
how.flags &= O_PATH_FLAGS;
/* Modes should only be set for create-like flags. */
if (!WILL_CREATE(how.flags))
how.mode = 0;
return how;
}
首先使用与操作对flags内容进行筛选,VALID_OPEN_FLAGS定义如下
#define VALID_OPEN_FLAGS \
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
看是不是眼熟,就是open函数需要设置的,mode也一样
然后进行特殊操作
#define O_PATH 010000000
if (how.flags & O_PATH)
how.flags &= O_PATH_FLAGS;
如果有设置O_PATH的话,那只能保留跟path相关的操作了
#define WILL_CREATE(flags) (flags & (O_CREAT | __O_TMPFILE))
/* Modes should only be set for create-like flags. */
if (!WILL_CREATE(how.flags))
how.mode = 0;
翻译写的很清楚,接下来是看flag有没有设置O_CREAT | __O_TMPFILE 这两个标志,如果设置的话,啥事没有,不是的话mode就给我老老实实的等于0。
do_sys_openat2
static long do_sys_openat2(int dfd, const char __user *filename,
struct open_how *how)
{
struct open_flags op;
int fd = build_open_flags(how, &op);
struct filename *tmp;
if (fd)
return fd;
tmp = getname(filename);
if (IS_ERR(tmp))
return PTR_ERR(tmp);
fd = get_unused_fd_flags(how->flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, &op);
if (IS_ERR(f)) {
put_unused_fd(fd);
fd = PTR_ERR(f);
} else {
fd_install(fd, f);
}
}
putname(tmp);
return fd;
}
从参数上看,之前的flag和mode已经被整理好了,用open_how结构体保存。do_sys_openat2的整体流程是:
1 调用build_open_flags函数,设置op
2 调用getname函数,设置filename(返回一个结构体指针,说明是在里面申请的内存,并设置好filename那些成员)
3 调用get_unused_fd_flags函数,应该只是检查检查
4 调用do_filp_open函数,看传入的参数也能猜出来这应该才是真正的打开文件的步骤;
5 调用fd_install,从这能看出 fd在get_unused_fd_flags那应该被赋予了正确的数值。
6 调用putname函数,释放filename
build_open_flags
源码如下
inline int build_open_flags(const struct open_how *how, struct open_flags *op)
{
u64 flags = how->flags;
u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;
int lookup_flags = 0;
int acc_mode = ACC_MODE(flags);
BUILD_BUG_ON_MSG(upper_32_bits(VALID_OPEN_FLAGS),
"struct open_flags doesn't yet handle flags > 32 bits");
/*
* Strip flags that either shouldn't be set by userspace like
* FMODE_NONOTIFY or that aren't relevant in determining struct
* open_flags like O_CLOEXEC.
*/
flags &= ~strip;
/*
* Older syscalls implicitly clear all of the invalid flags or argument
* values before calling build_open_flags(), but openat2(2) checks all
* of its arguments.
*/
if (flags & ~VALID_OPEN_FLAGS)
return -EINVAL;
if (how->resolve & ~VALID_RESOLVE_FLAGS)
return -EINVAL;
/* Scoping flags are mutually exclusive. */
if ((how->resolve & RESOLVE_BENEATH) && (how->resolve & RESOLVE_IN_ROOT))
return -EINVAL;
/* Deal with the mode. */
if (WILL_CREATE(flags)) {
if (how->mode & ~S_IALLUGO)
return -EINVAL;
op->mode = how->mode | S_IFREG;
} else {
if (how->mode != 0)
return -EINVAL;
op->mode = 0;
}
/*
* Block bugs where O_DIRECTORY | O_CREAT created regular files.
* Note, that blocking O_DIRECTORY | O_CREAT here also protects
* O_TMPFILE below which requires O_DIRECTORY being raised.
*/
if ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT))
return -EINVAL;
/* Now handle the creative implementation of O_TMPFILE. */
if (flags & __O_TMPFILE) {
/*
* In order to ensure programs get explicit errors when trying
* to use O_TMPFILE on old kernels we enforce that O_DIRECTORY
* is raised alongside __O_TMPFILE.
*/
if (!(flags & O_DIRECTORY))
return -EINVAL;
if (!(acc_mode & MAY_WRITE))
return -EINVAL;
}
if (flags & O_PATH) {
/* O_PATH only permits certain other flags to be set. */
if (flags & ~O_PATH_FLAGS)
return -EINVAL;
acc_mode = 0;
}
/*
* O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only
* check for O_DSYNC if the need any syncing at all we enforce it's
* always set instead of having to deal with possibly weird behaviour
* for malicious applications setting only __O_SYNC.
*/
if (flags & __O_SYNC)
flags |= O_DSYNC;
op->open_flag = flags;
/* O_TRUNC implies we need access checks for write permissions */
if (flags & O_TRUNC)
acc_mode |= MAY_WRITE;
/* Allow the LSM permission hook to distinguish append
access from general write access. */
if (flags & O_APPEND)
acc_mode |= MAY_APPEND;
op->acc_mode = acc_mode;
op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;
if (flags & O_CREAT) {
op->intent |= LOOKUP_CREATE;
if (flags & O_EXCL) {
op->intent |= LOOKUP_EXCL;
flags |= O_NOFOLLOW;
}
}
if (flags & O_DIRECTORY)
lookup_flags |= LOOKUP_DIRECTORY;
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
if (how->resolve & RESOLVE_NO_XDEV)
lookup_flags |= LOOKUP_NO_XDEV;
if (how->resolve & RESOLVE_NO_MAGICLINKS)
lookup_flags |= LOOKUP_NO_MAGICLINKS;
if (how->resolve & RESOLVE_NO_SYMLINKS)
lookup_flags |= LOOKUP_NO_SYMLINKS;
if (how->resolve & RESOLVE_BENEATH)
lookup_flags |= LOOKUP_BENEATH;
if (how->resolve & RESOLVE_IN_ROOT)
lookup_flags |= LOOKUP_IN_ROOT;
if (how->resolve & RESOLVE_CACHED) {
/* Don't bother even trying for create/truncate/tmpfile open */
if (flags & (O_TRUNC | O_CREAT | __O_TMPFILE))
return -EAGAIN;
lookup_flags |= LOOKUP_CACHED;
}
op->lookup_flags = lookup_flags;
return 0;
}
u64 flags = how->flags;
u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;
第一句没啥说的赋值而已,__FMODE_NONOTIFY的相关源码如下,
#define __FMODE_NONOTIFY ((__force int) FMODE_NONOTIFY)
#define FMODE_NONOTIFY ((__force fmode_t)(1 << 26))
typedef unsigned int __bitwise fmode_t;
__bitwise和__force这两个可参考:https://blog.youkuaiyun.com/RNG_uzi1111111/article/details/140937157?spm=1001.2014.3001.5501
现在就可以当他两不存在就行。
说白了 strip就是两个标志的并,FMODE_NONOTIFY这有啥用还没看出来
案例:
strip的的值为67633152 8进制下就是402000000 能对的上
#define O_ACCMODE 00000003
#define O_RDONLY 00000000
#define O_WRONLY 00000001
#define O_RDWR 00000002
#define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
int lookup_flags = 0;
int acc_mode = ACC_MODE(flags);
这个就是看flag被设置了这四个flag中的几个,然后赋予相应的数值,就是这写法感觉有点怪,下面解释一下
"\004\002\006\006"[(x)&O_ACCMODE]
参考这个 https://xiaoxiami.gitbook.io/c/zhuan-yi-zi-fu-he-kong-bai-fu
总之"\004\002\006\006"就是个整形数组
也就是说
flag | acc_mode |
---|---|
O_ACCMODE | 6 |
O_RDONLY | 4 |
O_WRONLY | 2 |
O_RDWR | 6 |
案例:
flags是557056,8进制下就是2100000 也就是O_LARGEFILE(00100000)和O_CLOEXEC(02000000)的并 ,也就是和O_ACCMODE与操作后就是0,也就是acc_mode为4
BUILD_BUG_ON_MSG(upper_32_bits(VALID_OPEN_FLAGS),
"struct open_flags doesn't yet handle flags > 32 bits");
#define upper_32_bits(n) ((u32)(((n) >> 16) >> 16))
/**
* BUILD_BUG_ON_MSG - break compile if a condition is true & emit supplied
* error message.
* @condition: the condition which the compiler should know is false.
*
* See BUILD_BUG_ON for description.
*/
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
就是检查下VALID_OPEN_FLAGS,有没有高出32位的,有的话停止编译,所以compiletime_assert 宏展开应该是个编译器的宏
/*
* Strip flags that either shouldn't be set by userspace like
* FMODE_NONOTIFY or that aren't relevant in determining struct
* open_flags like O_CLOEXEC.
*/
flags &= ~strip;
翻译说的很明确__FMODE_NONOTIFY 和O_CLOEXEC 不应该是用户设的,所以全设为0
案例
32768的8进制是100000,也就是O_LARGEFILE。原本O_LARGEFILE和O_CLOEXEC的并,现在只剩一个了,调用这个函数的O_CLOEXEC都无效。
/*
* Older syscalls implicitly clear all of the invalid flags or argument
* values before calling build_open_flags(), but openat2(2) checks all
* of its arguments.
*/
if (flags & ~VALID_OPEN_FLAGS)
return -EINVAL;
if (how->resolve & ~VALID_RESOLVE_FLAGS)
return -EINVAL;
/* Scoping flags are mutually exclusive. */
if ((how->resolve & RESOLVE_BENEATH) && (how->resolve & RESOLVE_IN_ROOT))
return -EINVAL;
第一个if是再检查下,因为这函数还会被别的函数( io_openat2)调用,所以为了安全,检查一下。后面两个if对咱们而言没啥意义,因为resolve一直为0,都能顺利通过。
/* Deal with the mode. */
if (WILL_CREATE(flags)) {
if (how->mode & ~S_IALLUGO)
return -EINVAL;
op->mode = how->mode | S_IFREG;
} else {
if (how->mode != 0)
return -EINVAL;
op->mode = 0;
}
看看flags是不是设置了O_CREAT 和 __O_TMPFILE ,设置的话就得看看mode了,并根据how的mode设置op的mode,没设置的检查一下how的mode是不是0,然后设置op的mode为0.
/*
* Block bugs where O_DIRECTORY | O_CREAT created regular files.
* Note, that blocking O_DIRECTORY | O_CREAT here also protects
* O_TMPFILE below which requires O_DIRECTORY being raised.
*/
if ((flags & (O_DIRECTORY | O_CREAT)) == (O_DIRECTORY | O_CREAT))
return -EINVAL;
因为O_DIRECTORY 的作用是确保要打开的是个文件夹。不是O_DIRECTORY 和O_CREAT放一块是创建文件夹。所以如果flags把O_DIRECTORY 和O_CREAT都设置了,就返回。
#define MAY_WRITE 0x00000002
/* Now handle the creative implementation of O_TMPFILE. */
if (flags & __O_TMPFILE) {
/*
* In order to ensure programs get explicit errors when trying
* to use O_TMPFILE on old kernels we enforce that O_DIRECTORY
* is raised alongside __O_TMPFILE.
*/
if (!(flags & O_DIRECTORY))
return -EINVAL;
if (!(acc_mode & MAY_WRITE))
return -EINVAL;
}
看open是不是要创建临时文件,而且为了能获得错误在老的kernel上,必须还得设置O_DIRECTORY 。
acc_mode必须得设置了读写(O_RDWR)或写(O_WRONLY),也就是acc_mode的值必须为2或6才能行,要不然就无效
if (flags & O_PATH) {
/* O_PATH only permits certain other flags to be set. */
if (flags & ~O_PATH_FLAGS)
return -EINVAL;
acc_mode = 0;
}
再检查一下,如果设置了O_PATH 那么flags的设置必须在O_PATH_FLAGS中,要不然无效,并且acc_mode设置为0。
#define O_DSYNC 00010000
#ifndef O_SYNC
#define __O_SYNC 04000000
#define O_SYNC (__O_SYNC|O_DSYNC)
#endif
/*
* O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only
* check for O_DSYNC if the need any syncing at all we enforce it's
* always set instead of having to deal with possibly weird behaviour
* for malicious applications setting only __O_SYNC.
*/
if (flags & __O_SYNC)
flags |= O_DSYNC;
说的很明确,怕有恶意程序只设置一个,保险起见
op->open_flag = flags;
这时候flags就被设置好了。
/* O_TRUNC implies we need access checks for write permissions */
if (flags & O_TRUNC)
acc_mode |= MAY_WRITE;
/* Allow the LSM permission hook to distinguish append
access from general write access. */
if (flags & O_APPEND)
acc_mode |= MAY_APPEND;
op->acc_mode = acc_mode;
这是设置acc_mode,顾名思义,acc_mode是用来存储访问模式的,是读呢还是写呢还是读写呢。但flags中不也存着呢吗?为啥单独拿出来?我个人觉得是因为这样能够把它与其余的flags分开吧,要不然都放一起,逻辑上也不清晰。后面acc_mode会被用来检查进程是否有权限打开文件。
#define LOOKUP_OPEN 0x0100
op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;
看看有设置了跟path相关的位吗,有的话intent等于0 ,没有的话设置为LOOKUP_OPEN(有有个疑问,LOOKUP_OPEN有啥用?),
if (flags & O_CREAT) {
op->intent |= LOOKUP_CREATE;
if (flags & O_EXCL) {
op->intent |= LOOKUP_EXCL;
flags |= O_NOFOLLOW;
}
}
如果flags设置了O_CREAT,op->intent加上LOOKUP_CREATE,同理LOOKUP_EXCL也是,并而外给flags加上O_NOFOLLOW,为啥添加看下面
O_EXCL
Ensure that this call creates the file: if this flag is
specified in conjunction with O_CREAT, and pathname
already exists, then open() fails with the error EEXIST.
When these two flags are specified, symbolic links are not
followed: if pathname is a symbolic link, then open()
fails regardless of where the symbolic link points.
O_NOFOLLOW
If the trailing component (i.e., basename) of pathname is
a symbolic link, then the open fails, with the error
ELOOP.
翻译过来就是设置O_EXCL的话,路径中就不能有链接,但其实系统内部这事归 O_NOFOLLOW管,所以给他加上了(emmm,如果我把这句话注掉是不是就没这限制了)
if (flags & O_DIRECTORY)
lookup_flags |= LOOKUP_DIRECTORY;
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
if (how->resolve & RESOLVE_NO_XDEV)
lookup_flags |= LOOKUP_NO_XDEV;
if (how->resolve & RESOLVE_NO_MAGICLINKS)
lookup_flags |= LOOKUP_NO_MAGICLINKS;
if (how->resolve & RESOLVE_NO_SYMLINKS)
lookup_flags |= LOOKUP_NO_SYMLINKS;
if (how->resolve & RESOLVE_BENEATH)
lookup_flags |= LOOKUP_BENEATH;
if (how->resolve & RESOLVE_IN_ROOT)
lookup_flags |= LOOKUP_IN_ROOT;
if (how->resolve & RESOLVE_CACHED) {
/* Don't bother even trying for create/truncate/tmpfile open */
if (flags & (O_TRUNC | O_CREAT | __O_TMPFILE))
return -EAGAIN;
lookup_flags |= LOOKUP_CACHED;
}
op->lookup_flags = lookup_flags;
return 0;
根据flags和how->resolve设置lookup_flags,然后给op->lookup_flags赋值,很好,我们这里how->resolve都为0。
案例:
刚才说了intent 因为flags到那只乘O_LARGEFILE了,intent是256 也就是LOOKUP_OPEN,最后的lookup_flags怎么设置的忘说了
#define LOOKUP_FOLLOW 0x0001 /* follow links at the end */
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
看 如果没设置O_NOFOLLOW,就设置lookup_flags。好了都能对上
getname
struct filename *
getname(const char __user * filename)
{
return getname_flags(filename, 0, NULL);
}
ok 简单的封装,__user宏相关的看这里:https://blog.youkuaiyun.com/RNG_uzi1111111/article/details/141070859?spm=1001.2014.3001.5501
不看也行,把它忽略掉就行。
getname_flags函数代码如下
struct filename *
getname_flags(const char __user *filename, int flags, int *empty)
{
struct filename *result;
char *kname;
int len;
result = audit_reusename(filename);
if (result)
return result;
result = __getname();
if (unlikely(!result))
return ERR_PTR(-ENOMEM);
/*
* First, try to embed the struct filename inside the names_cache
* allocation
*/
kname = (char *)result->iname;
result->name = kname;
len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
if (unlikely(len < 0)) {
__putname(result);
return ERR_PTR(len);
}
/*
* Uh-oh. We have a name that's approaching PATH_MAX. Allocate a
* separate struct filename so we can dedicate the entire
* names_cache allocation for the pathname, and re-do the copy from
* userland.
*/
if (unlikely(len == EMBEDDED_NAME_MAX)) {
const size_t size = offsetof(struct filename, iname[1]);
kname = (char *)result;
/*
* size is chosen that way we to guarantee that
* result->iname[0] is within the same object and that
* kname can't be equal to result->iname, no matter what.
*/
result = kzalloc(size, GFP_KERNEL);
if (unlikely(!result)) {
__putname(kname);
return ERR_PTR(-ENOMEM);
}
result->name = kname;
len = strncpy_from_user(kname, filename, PATH_MAX);
if (unlikely(len < 0)) {
__putname(kname);
kfree(result);
return ERR_PTR(len);
}
if (unlikely(len == PATH_MAX)) {
__putname(kname);
kfree(result);
return ERR_PTR(-ENAMETOOLONG);
}
}
atomic_set(&result->refcnt, 1);
/* The empty path is special. */
if (unlikely(!len)) {
if (empty)
*empty = 1;
if (!(flags & LOOKUP_EMPTY)) {
putname(result);
return ERR_PTR(-ENOENT);
}
}
result->uptr = filename;
result->aname = NULL;
audit_getname(result);
return result;
}
整体流程:
1 调用audit_reusename,看看是不是跟审计有关,在他那的话,直接用它的
2 调用__getname,申请个内存
3 让result的成员name指向自己的iname那块数组 (变长数组可参考:https://www.cnblogs.com/gexin/p/9116292.html)
4 把字符串复制给result
5 复制成功就把相应的属性设一下
6 没成功的话,作特殊处理
audit_reusename
static inline struct filename *audit_reusename(const __user char *name)
{
if (unlikely(!audit_dummy_context()))
return __audit_reusename(name);
return NULL;
}
先调用audit_dummy_context检查当前进程有没有审计对象,有的话调用__audit_reusename看看这文件是不是这里面的,没有的话后面创建完也会加进去的。
audit_dummy_context
audit_dummy_context代码如下:
static inline bool audit_dummy_context(void)
{
void *p = audit_context();
return !p || *(int *)p;
}
audit_context代码如下
static inline struct audit_context *audit_context(void)
{
return current->audit_context;
}
current为当前进程的指针,audit_context函数就是返回当前进程的审计对象。
audit_dummy_context的逻辑有点不太好看,返回false的条件是审计对象指针必须不为空且它的 dummy成员(dummy是audit_context的第一个成员,所以可以那么写)必须为0才行。应该是0有什么特殊用途,超出探讨范围,目前不做研究。
__audit_reusename
代码如下
struct filename *
__audit_reusename(const __user char *uptr)
{
struct audit_context *context = audit_context();
struct audit_names *n;
list_for_each_entry(n, &context->names_list, list) {
if (!n->name)
continue;
if (n->name->uptr == uptr) {
atomic_inc(&n->name->refcnt);
return n->name;
}
}
return NULL;
}
主要流程是返回当前进程的审计对象,然后跟它的文件列表比较,有的话就范围,没有就算了,返回空。
__getname
extern struct kmem_cache *names_cachep;
#define __getname() kmem_cache_alloc(names_cachep, GFP_KERNEL)
#define __putname(name) kmem_cache_free(names_cachep, (void *)(name))
涉及到内存管理那部分了,只需要知道跟获得了一块内存即可
kname = (char *)result->iname;
result->name = kname;
len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
if (unlikely(len < 0)) {
__putname(result);
return ERR_PTR(len);
}
这块主要是让结构体里面的指针指向自己内部的字符数组。然后把open函数输入的路径参数复制进去。原本想着如果简单一些就讲讲strncpy_from_user函数,结果…算了本来打算的就是研究文件系统。其余的只需要知道功能即可。
if (unlikely(len == EMBEDDED_NAME_MAX)) {
const size_t size = offsetof(struct filename, iname[1]);
kname = (char *)result;
/*
* size is chosen that way we to guarantee that
* result->iname[0] is within the same object and that
* kname can't be equal to result->iname, no matter what.
*/
result = kzalloc(size, GFP_KERNEL);
if (unlikely(!result)) {
__putname(kname);
return ERR_PTR(-ENOMEM);
}
result->name = kname;
len = strncpy_from_user(kname, filename, PATH_MAX);
if (unlikely(len < 0)) {
__putname(kname);
kfree(result);
return ERR_PTR(len);
}
if (unlikely(len == PATH_MAX)) {
__putname(kname);
kfree(result);
return ERR_PTR(-ENAMETOOLONG);
}
}
如果文件路径大于等于最大的名字范围,进行一下特殊处理,
1 计算不带可变字符数组时filename的大小size
2 申请size大小的内存区域
3 把之前申请的filename对象全部用来存储路径
4 如果够的话无所谓,不够的话报错
一般情况不至于
atomic_set(&result->refcnt, 1);
/* The empty path is special. */
if (unlikely(!len)) {
if (empty)
*empty = 1;
if (!(flags & LOOKUP_EMPTY)) {
putname(result);
return ERR_PTR(-ENOENT);
}
}
标明result目前被引用,atomic_set是个原子操作,功能就是设置这个值,然后如果是空路径,特殊处理
1 empty在咱这里肯定是NULL 因为
struct filename *
getname(const char __user * filename)
{
return getname_flags(filename, 0, NULL);
}
咱在外面调用这函数时设置的就是NULL
2 看看flags的LOOKUP_EMPTY标志 如果没设置,那肯定不行,报错
result->uptr = filename;
result->aname = NULL;
audit_getname(result);
return result;
记录一下原始的路径地址,把那个审计对象设置为空,然后加到审计里面去,audit_getname就不讲了,以后有时间会去研究研究。
总结:getname就是把路径数据从用户空间存到内核空间,然后用filename表示它。
get_unused_fd_flags
代码如下
#ifndef RLIMIT_NOFILE
# define RLIMIT_NOFILE 7 /* max number of open files */
#endif
int get_unused_fd_flags(unsigned flags)
{
return __get_unused_fd_flags(flags, rlimit(RLIMIT_NOFILE));
}
rlimit
rlimit定义如下
static inline unsigned long rlimit(unsigned int limit)
{
return task_rlimit(current, limit);
}
只是作了一个封装,current表示当前进程指针。
task_rlimit定义如下
static inline unsigned long task_rlimit(const struct task_struct *task,
unsigned int limit)
{
return READ_ONCE(task->signal->rlim[limit].rlim_cur);
}
READ_ONCE参考链接看上面,总之是读取了索引值为7的rlim对象中的rlim_cur成员。rlim_cur的作用看下面
__get_unused_fd_flags
int __get_unused_fd_flags(unsigned flags, unsigned long nofile)
{
return alloc_fd(0, nofile, flags);
}
这是调用了一下alloc_fd函数,其中nofile的值就是上面的rlim_cur成员的值。
alloc_fd函数定义如下
/*
* allocate a file descriptor, mark it busy.
*/
static int alloc_fd(unsigned start, unsigned end, unsigned flags)
{
struct files_struct *files = current->files;
unsigned int fd;
int error;
struct fdtable *fdt;
spin_lock(&files->file_lock);
repeat:
fdt = files_fdtable(files);
fd = start;
if (fd < files->next_fd)
fd = files->next_fd;
if (fd < fdt->max_fds)
fd = find_next_fd(fdt, fd);
/*
* N.B. For clone tasks sharing a files structure, this test
* will limit the total number of files that can be opened.
*/
error = -EMFILE;
if (fd >= end)
goto out;
error = expand_files(files, fd);
if (error < 0)
goto out;
/*
* If we needed to expand the fs array we
* might have blocked - try again.
*/
if (error)
goto repeat;
if (start <= files->next_fd)
files->next_fd = fd + 1;
__set_open_fd(fd, fdt);
if (flags & O_CLOEXEC)
__set_close_on_exec(fd, fdt);
else
__clear_close_on_exec(fd, fdt);
error = fd;
#if 1
/* Sanity check */
if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
rcu_assign_pointer(fdt->fd[fd], NULL);
}
#endif
out:
spin_unlock(&files->file_lock);
return error;
}
这么一看就清晰了,形参start的值是0,形参end的值就是rlim_cur成员的值,flags就是how->flags(提示:build_open_how函数里面设置的,比较接近用户设置的了)。这函数整体目标就是从当前进程的filetable中找到一个bit位为0的位置,进程结构体中专门有块数据作为位图用的,这块数据的每一位都表示一个file对象,如果为0表示可用,为1表示正在用。
struct files_struct *files = current->files;
unsigned int fd;
int error;
struct fdtable *fdt;
没啥说的,给files赋值,声明一些变量。
spin_lock(&files->file_lock);
锁,保证互斥
fdt = files_fdtable(files);
fd = start;
if (fd < files->next_fd)
fd = files->next_fd;
if (fd < fdt->max_fds)
fd = find_next_fd(fdt, fd);
#define files_fdtable(files) \
rcu_dereference_check_fdtable((files), (files)->fdt)
涉及到了rcu机制,这里只知道是是把files的fdt成员赋值给了fdt即可。
然后从start位置开始查找,如果start小于next_fd成员值,那么就从next_fd成员值开始,这么看next_fd很用可能是用来记载上一个bit为0的位置的值,比方说上个bit为0的位置的值是100,那么说明0-100之间肯定没有啦,只能是100之后才有可能有,所以这样做可以节省点效率。
检查下fd的值是不是小于max_fds,一般情况应该会比它小,然后执行find_next_fd(fdt, fd)找到那个索引值。
find_next_f函数如下
static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
{
unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
if (bitbit >= maxfd)
return maxfd;
if (bitbit > start)
start = bitbit;
return find_next_zero_bit(fdt->open_fds, maxfd, start);
}
#ifdef CONFIG_64BIT
#define BITS_PER_LONG 64
#else
#define BITS_PER_LONG 32
#endif
看,这跟操作系统位数有关,也就是说maxfd值是64,128,192等等。然后maxfd,start都整除BITS_PER_LONG,分别赋值给maxbit和bitbit。就相当于把位图区划分成块了,然后先找哪块有0位0然后再仔细去找。比一个个找有效率。
find_next_zero_bit函数定义如下
/**
* find_next_zero_bit - find the next cleared bit in a memory region
* @addr: The address to base the search on
* @size: The bitmap size in bits
* @offset: The bitnumber to start searching at
*
* Returns the bit number of the next zero bit
* If no bits are zero, returns @size.
*/
static inline
unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
unsigned long offset)
{
if (small_const_nbits(size)) {
unsigned long val;
if (unlikely(offset >= size))
return size;
val = *addr | ~GENMASK(size - 1, offset);
return val == ~0UL ? size : ffz(val);
}
return _find_next_zero_bit(addr, size, offset);
}
small_const_nbits定义如下
#define small_const_nbits(nbits) \
(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG && (nbits) > 0)
__builtin_constant_p是gcc的函数,是判断nbits是否为常量,咱们这个肯定不是阿,所以调用_find_next_zero_bit
_find_next_zero_bit函数如下
unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
unsigned long start)
{
return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
}
这里需要注意的是传进去的第一个参数是~addr[idx]
,是个取反操作
FIND_NEXT_BIT宏定义如下
#define FIND_NEXT_BIT(FETCH, MUNGE, size, start) \
({ \
unsigned long mask, idx, tmp, sz = (size), __start = (start); \
\
if (unlikely(__start >= sz)) \
goto out; \
\
mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start)); \
idx = __start / BITS_PER_LONG; \
\
for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) { \
if ((idx + 1) * BITS_PER_LONG >= sz) \
goto out; \
idx++; \
} \
\
sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz); \
out: \
sz; \
})
宏展开如下
({
unsigned long mask, idx, tmp, sz = (nbits), __start = (start);
if (unlikely(__start >= sz)) goto out;
mask = (BITMAP_FIRST_WORD_MASK(__start));
idx = __start / BITS_PER_LONG;
for (tmp = (~addr[idx]) & mask; !tmp; tmp = (~addr[idx])) {
if ((idx + 1) * BITS_PER_LONG >= sz) goto out;
idx++;
}
sz = min(idx * BITS_PER_LONG + __ffs((tmp)), sz);
out:
sz; })
这么看就清晰了很多,按上面的展开就是这样子,FIND_NEXT_BIT这个宏单就是找bit位为0的位置,FETCH表示long类型数组,MUNGE咱这里没用到,size是总位长,start是开始位。
首先是先比较start和size,一般不会出问题
然后能是算一个mask
#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
计算流程:
1 把0置为1
2 向左移动start位
然后在把__start / BITS_PER_LONG
取结果赋值给idx
进行遍历,之所以初始化时tmp = (FETCH) & mask
,是因为start代表之前的bit位都为1了,从start开始即可,(虽然我觉得有些多此一举,但逻辑上是应该这么考虑),然后直到某一块bit位有0,idx即为那位图的索引,tmp即为那位图的值,计算一下即可,
__ffs函数如下:
static unsigned int generic___ffs(unsigned long word)
{
unsigned int num = 0;
#if BITS_PER_LONG == 64
if ((word & 0xffffffff) == 0) {
num += 32;
word >>= 32;
}
#endif
if ((word & 0xffff) == 0) {
num += 16;
word >>= 16;
}
if ((word & 0xff) == 0) {
num += 8;
word >>= 8;
}
if ((word & 0xf) == 0) {
num += 4;
word >>= 4;
}
if ((word & 0x3) == 0) {
num += 2;
word >>= 2;
}
if ((word & 0x1) == 0)
num += 1;
return num;
}
这函数就是找word中bit为1的位置。流程就是先看看前32位有没有,没有的话往后看,基本上比较清晰,不必多说。
这个宏说清了,这次再去看看find_next_fd函数,就清晰了,
static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
{
unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
if (bitbit >= maxfd)
return maxfd;
if (bitbit > start)
start = bitbit;
return find_next_zero_bit(fdt->open_fds, maxfd, start);
}
先找哪一块有位置,然后再好好找那一块,注意!!!!!!第一次是从full_fds_bits成员找的,第二次是从open_fds中找的,之前我以为是从一个地方找的,看了半天没看明白…
找完以后回头接着看alloc_fd函数
if (fd < fdt->max_fds)
fd = find_next_fd(fdt, fd);
/*
* N.B. For clone tasks sharing a files structure, this test
* will limit the total number of files that can be opened.
*/
error = -EMFILE;
if (fd >= end)
goto out;
error = expand_files(files, fd);
if (error < 0)
goto out;
/*
* If we needed to expand the fs array we
* might have blocked - try again.
*/
if (error)
goto repeat;
if (start <= files->next_fd)
files->next_fd = fd + 1;
找到以后看看有没有超过终点,然后调用expand_files函数
expand_files函数定义如下
static int expand_files(struct files_struct *files, unsigned int nr)
__releases(files->file_lock)
__acquires(files->file_lock)
{
struct fdtable *fdt;
int expanded = 0;
repeat:
fdt = files_fdtable(files);
/* Do we need to expand? */
if (nr < fdt->max_fds)
return expanded;
/* Can we expand? */
if (nr >= sysctl_nr_open)
return -EMFILE;
if (unlikely(files->resize_in_progress)) {
spin_unlock(&files->file_lock);
expanded = 1;
wait_event(files->resize_wait, !files->resize_in_progress);
spin_lock(&files->file_lock);
goto repeat;
}
/* All good, so we try */
files->resize_in_progress = true;
expanded = expand_fdtable(files, nr);
files->resize_in_progress = false;
wake_up_all(&files->resize_wait);
return expanded;
}
其中__releases宏和__acquires又又又涉及到gcc的 attribute,运行时没用,只是检查用的
# define __releases(x) __attribute__((context(x,1,0)))
# define __acquire(x) __context__(x,1)
具体参考这个链接:http://blog.chinaunix.net/uid-14528823-id-4284946.html
files_fdtable上面说了,检查用不用扩展,一般情况不用,直接返回
接着看alloc_fd
/*
* If we needed to expand the fs array we
* might have blocked - try again.
*/
if (error)
goto repeat;
if (start <= files->next_fd)
files->next_fd = fd + 1;
__set_open_fd(fd, fdt);
if (flags & O_CLOEXEC)
__set_close_on_exec(fd, fdt);
else
__clear_close_on_exec(fd, fdt);
error = fd;
#if 1
/* Sanity check */
if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
rcu_assign_pointer(fdt->fd[fd], NULL);
}
#endif
out:
spin_unlock(&files->file_lock);
return error;
流程如下:
1 files->next_fd
记录下当前位置的下一个bit位置
2 调用__set_open_fd
函数,然后看看有没有设置O_CLOEXEC
,没有的话调用__clear_close_on_exec
函数
3 error = fd;
4 然后检查一下fdt->fd[fd]
,这个就是最后要放的位置,rcu_access_pointer
和rcu_assign_pointer
不做介绍,就当rcu_access_pointer
不存在即可,rcu_assign_pointer
就当是一个赋值操作吧
5 解锁返回正确值
下面说下__set_open_fd
,__clear_close_on_exec
和它思路差不多,不再过多赘述
__set_open_fd
函数定义如下:
static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
{
__set_bit(fd, fdt->open_fds);
fd /= BITS_PER_LONG;
if (!~fdt->open_fds[fd])
__set_bit(fd, fdt->full_fds_bits);
}
整体上这个函数就是先把open_fds的第fd位置1,然后除BITS_PER_LONG算出该bit位属于哪一块,判断这一块(先按位取反,然后取非,这样的话只有全部bit都为1才能行)是否全部占满,是的话将full_fds_bits的表示该块的bit位置1。
__set_bit函数定义如下
#define __set_bit(nr, addr) bitop(___set_bit, nr, addr)
bitop定义如下
#define bitop(op, nr, addr) \
((__builtin_constant_p(nr) && \
__builtin_constant_p((uintptr_t)(addr) != (uintptr_t)NULL) && \
(uintptr_t)(addr) != (uintptr_t)NULL && \
__builtin_constant_p(*(const unsigned long *)(addr))) ? \
const##op(nr, addr) : op(nr, addr))
__builtin_constant_p为gcc内置函数,我理解就和sizeof啥的效果差不多,用来判断是否为常量的。就是判断nr是否为常量值,常量值用调用常量值对应的操作函数,非常量值用调用非常量值对应的操作函数。
___set_bit相关定义如下
#define ___set_bit arch___set_bit
#define arch___set_bit generic___set_bit
static __always_inline void
generic___set_bit(unsigned long nr, volatile unsigned long *addr)
{
unsigned long mask = BIT_MASK(nr);
unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
*p |= mask;
}
#define BIT_MASK(nr) (UL(1) << ((nr) % BITS_PER_LONG))
#define BIT_WORD(nr) ((nr) / BITS_PER_LONG)
#define UL(x) (_UL(x))
#define _UL(x) (_AC(x, UL))
#define __AC(X,Y) (X##Y)
BIT_WORD
好说,就是看看是哪一块,BIT_MASK
稍微麻烦点,UL宏主要用来转换类型的,比如UL(1)
会变成1UL
,确保是无符号长整型,然后移动nr%BITS_PER_LONG
位,*p |= mask;
这就是个简单的位操作了。
do_filp_open
代码如下
struct file *do_filp_open(int dfd, struct filename *pathname,
const struct open_flags *op)
{
struct nameidata nd;
int flags = op->lookup_flags;
struct file *filp;
set_nameidata(&nd, dfd, pathname, NULL);
filp = path_openat(&nd, op, flags | LOOKUP_RCU);
if (unlikely(filp == ERR_PTR(-ECHILD)))
filp = path_openat(&nd, op, flags);
if (unlikely(filp == ERR_PTR(-ESTALE)))
filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
restore_nameidata();
return filp;
}
这个dfd
是真坚挺啊,一直是AT_FDCWD
,也没说赋值给某个结构体中。主要流程如下
1 调用set_nameidata
给nd
成员赋值
2 调用path_openat
,打开该文件
3 然后看返回的filp
有问题换换flag接着试试,但nd会有变化吗?
4 调用restore_nameidata
set_nameidata
源码如下
static inline void set_nameidata(struct nameidata *p, int dfd, struct filename *name,
const struct path *root)
{
__set_nameidata(p, dfd, name);
p->state = 0;
if (unlikely(root)) {
p->state = ND_ROOT_PRESET;
p->root = *root;
}
}
流程如下:
1 调用__set_nameidata
函数
2 设置p指向的对象state成员为0
3 因为root形参对应的实参为NULL,所以不进if语句里面
__set_nameidata
__set_nameidata
源码如下
static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
{
struct nameidata *old = current->nameidata;
p->stack = p->internal;
p->depth = 0;
p->dfd = dfd;
p->name = name;
p->path.mnt = NULL;
p->path.dentry = NULL;
p->total_link_count = old ? old->total_link_count : 0;
p->saved = old;
current->nameidata = p;
}
基本流程如下:
1 把当前进程的nameidata保存一下
2 stack 成员指向 internal成员,nameidata结构体如下,也就是internal成员是个saved结构体数组,然后用stack指针指向它,以后好方便操作,这个使用来管符号链接时用的
#define EMBEDDED_LEVELS 2
struct nameidata {
struct path path;
struct qstr last;
struct path root;
struct inode *inode; /* path.dentry.d_inode */
unsigned int flags, state;
unsigned seq, next_seq, m_seq, r_seq;
int last_type;
unsigned depth;
int total_link_count;
struct saved {
struct path link;
struct delayed_call done;
const char *name;
unsigned seq;
} *stack, internal[EMBEDDED_LEVELS];
struct filename *name;
struct nameidata *saved;
unsigned root_seq;
int dfd;
vfsuid_t dir_vfsuid;
umode_t dir_mode;
} __randomize_layout;
3 depth赋值为0
4 dfd赋值给dfd成员,dfd终于有存它的数据结构啦,从这以后就不用单独给dfd留参数了,name成员赋值,name指针就是getname(filename);
那获取的
5 path成员初始化,这个可有老鼻子用了
6 total_link_count成员赋值,当前进程的nameidata成员如果之前有,就延续,没有的话就为0
7 把当前进程的nameidata成员保存到saved中,这就形成了一个链表,
8 当前进程的nameidata成员替换为p
path_openat
代码如下
static struct file *path_openat(struct nameidata *nd,
const struct open_flags *op, unsigned flags)
{
struct file *file;
int error;
file = alloc_empty_file(op->open_flag, current_cred());
if (IS_ERR(file))
return file;
if (unlikely(file->f_flags & __O_TMPFILE)) {
error = do_tmpfile(nd, flags, op, file);
} else if (unlikely(file->f_flags & O_PATH)) {
error = do_o_path(nd, flags, file);
} else {
const char *s = path_init(nd, flags);
while (!(error = link_path_walk(s, nd)) &&
(s = open_last_lookups(nd, file, op)) != NULL)
;
if (!error)
error = do_open(nd, file, op);
terminate_walk(nd);
}
if (likely(!error)) {
if (likely(file->f_mode & FMODE_OPENED))
return file;
WARN_ON(1);
error = -EINVAL;
}
fput(file);
if (error == -EOPENSTALE) {
if (flags & LOOKUP_RCU)
error = -ECHILD;
else
error = -ESTALE;
}
return ERR_PTR(error);
}
整体流程如下:
1 调用alloc_empty_file函数,收集一个file结构体
2 根据flags选择如何打开文件,咱们就看最后一个,do_open的那个
3 没问题就返回
current_cred()
current_cred()
定义如下
#define current_cred() \
rcu_dereference_protected(current->cred, 1)
又是涉及到rcu机制的,总之就认为是返回current->cred
就行
alloc_empty_file
alloc_empty_file定义如下
struct file *alloc_empty_file(int flags, const struct cred *cred)
{
static long old_max;
struct file *f;
int error;
/*
* Privileged users can go above max_files
*/
if (get_nr_files() >= files_stat.max_files && !capable(CAP_SYS_ADMIN)) {
/*
* percpu_counters are inaccurate. Do an expensive check before
* we go and fail.
*/
if (percpu_counter_sum_positive(&nr_files) >= files_stat.max_files)
goto over;
}
f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
if (unlikely(!f))
return ERR_PTR(-ENOMEM);
error = init_file(f, flags, cred);
if (unlikely(error)) {
kmem_cache_free(filp_cachep, f);
return ERR_PTR(error);
}
percpu_counter_inc(&nr_files);
return f;
over:
/* Ran out of filps - report that */
if (get_nr_files() > old_max) {
pr_info("VFS: file-max limit %lu reached\n", get_max_files());
old_max = get_nr_files();
}
return ERR_PTR(-ENFILE);
}
整体流程如下:
1 判断当前已经打开的文件数量是不是到了最大值或者权限是否满足,失败则返回错误信息
2 申请一块内存,存放file结构体
3 初始化file结构体
4 当前已经打开的文件数量加1
get_nr_files
static long get_nr_files(void)
{
return percpu_counter_read_positive(&nr_files);
}
percpu_counter_read_positive
函数定义如下
static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
/* Prevent reloads of fbc->count */
s64 ret = READ_ONCE(fbc->count);
if (ret >= 0)
return ret;
return 0;
}
READ_ONCE宏前面有参考链接
简单的读一下这个count成员
nr_files
是一个全局变量
static struct percpu_counter nr_files __cacheline_aligned_in_smp;
__cacheline_aligned_in_smp
定义如下
#define __cacheline_aligned_in_smp \
__attribute__((__aligned__(INTERNODE_CACHE_BYTES))) \
__page_aligned_data
#define INTERNODE_CACHE_BYTES (1 << INTERNODE_CACHE_SHIFT)
#define INTERNODE_CACHE_SHIFT L1_CACHE_SHIFT
#define L1_CACHE_SHIFT 5
#define __page_aligned_data __section(".data..page_aligned") __aligned(PAGE_SIZE)
# define PAGE_SIZE 4096
#define __section(section) __attribute__((__section__(section)))
__attribute__((__aligned__(INTERNODE_CACHE_BYTES)))
是用来把这个数据在内存中对齐,也就是得是32的倍数
__page_aligned_data
的作用是把nr_files放到叫做".data…page_aligned"的数据段中,参考链接
推测nr_files就是用来记载打开了多少个文件了
capable
这个涉及到权限那块了,不做过多解释
kmem_cache_zalloc
涉及到内存管理那块了,只需要知道能申请快内存存放file结构体,并返回指针即可
init_file
代码如下
static int init_file(struct file *f, int flags, const struct cred *cred)
{
int error;
f->f_cred = get_cred(cred);
error = security_file_alloc(f);
if (unlikely(error)) {
put_cred(f->f_cred);
return error;
}
rwlock_init(&f->f_owner.lock);
spin_lock_init(&f->f_lock);
mutex_init(&f->f_pos_lock);
f->f_flags = flags;
f->f_mode = OPEN_FMODE(flags);
/* f->f_version: 0 */
/*
* We're SLAB_TYPESAFE_BY_RCU so initialize f_count last. While
* fget-rcu pattern users need to be able to handle spurious
* refcount bumps we should reinitialize the reused file first.
*/
atomic_long_set(&f->f_count, 1);
return 0;
}
流程如下:
1 给f_cred成员赋值
2 分配内存给file
3 初始化读写锁、自旋锁、信号量
4 给f_flags成员赋值
5 给f_mode成员赋值
6 设置f_count成员为1
基本上就是初始化各个成员
get_cred相关定义如下
static inline const struct cred *get_cred(const struct cred *cred)
{
return get_cred_many(cred, 1);
}
static inline const struct cred *get_cred_many(const struct cred *cred, int nr)
{
struct cred *nonconst_cred = (struct cred *) cred;
if (!cred)
return cred;
nonconst_cred->non_rcu = 0;
return get_new_cred_many(nonconst_cred, nr);
}
static inline struct cred *get_new_cred_many(struct cred *cred, int nr)
{
atomic_long_add(nr, &cred->usage);
return cred;
}
atomic_long_add
是原子操作相关知识,只需要知道是让usage加1即可。
饶了一大圈就是给usage值加1
security_file_alloc
涉及到安全方面的了,这里不做探究
OPEN_FMODE定义如下
#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
(flag & __FMODE_NONOTIFY)))
#define O_ACCMODE 00000003
#define O_WRONLY 00000001
#define __FMODE_NONOTIFY ((__force int) FMODE_NONOTIFY)
#define FMODE_NONOTIFY ((__force fmode_t)(1 << 26))
咱们设置的O_WRONLY是1,结果应该是2
path_init
代码如下
/* must be paired with terminate_walk() */
static const char *path_init(struct nameidata *nd, unsigned flags)
{
int error;
const char *s = nd->name->name;
/* LOOKUP_CACHED requires RCU, ask caller to retry */
if ((flags & (LOOKUP_RCU | LOOKUP_CACHED)) == LOOKUP_CACHED)
return ERR_PTR(-EAGAIN);
if (!*s)
flags &= ~LOOKUP_RCU;
if (flags & LOOKUP_RCU)
rcu_read_lock();
else
nd->seq = nd->next_seq = 0;
nd->flags = flags;
nd->state |= ND_JUMPED;
nd->m_seq = __read_seqcount_begin(&mount_lock.seqcount);
nd->r_seq = __read_seqcount_begin(&rename_lock.seqcount);
smp_rmb();
if (nd->state & ND_ROOT_PRESET) {
struct dentry *root = nd->root.dentry;
struct inode *inode = root->d_inode;
if (*s && unlikely(!d_can_lookup(root)))
return ERR_PTR(-ENOTDIR);
nd->path = nd->root;
nd->inode = inode;
if (flags & LOOKUP_RCU) {
nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
nd->root_seq = nd->seq;
} else {
path_get(&nd->path);
}
return s;
}
nd->root.mnt = NULL;
/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
error = nd_jump_root(nd);
if (unlikely(error))
return ERR_PTR(error);
return s;
}
/* Relative pathname -- get the starting-point it is relative to. */
if (nd->dfd == AT_FDCWD) {
if (flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
unsigned seq;
do {
seq = read_seqcount_begin(&fs->seq);
nd->path = fs->pwd;
nd->inode = nd->path.dentry->d_inode;
nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
} while (read_seqcount_retry(&fs->seq, seq));
} else {
get_fs_pwd(current->fs, &nd->path);
nd->inode = nd->path.dentry->d_inode;
}
} else {
/* Caller must check execute permissions on the starting path component */
struct fd f = fdget_raw(nd->dfd);
struct dentry *dentry;
if (!f.file)
return ERR_PTR(-EBADF);
if (flags & LOOKUP_LINKAT_EMPTY) {
if (f.file->f_cred != current_cred() &&
!ns_capable(f.file->f_cred->user_ns, CAP_DAC_READ_SEARCH)) {
fdput(f);
return ERR_PTR(-ENOENT);
}
}
dentry = f.file->f_path.dentry;
if (*s && unlikely(!d_can_lookup(dentry))) {
fdput(f);
return ERR_PTR(-ENOTDIR);
}
nd->path = f.file->f_path;
if (flags & LOOKUP_RCU) {
nd->inode = nd->path.dentry->d_inode;
nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
} else {
path_get(&nd->path);
nd->inode = nd->path.dentry->d_inode;
}
fdput(f);
}
/* For scoped-lookups we need to set the root to the dirfd as well. */
if (flags & LOOKUP_IS_SCOPED) {
nd->root = nd->path;
if (flags & LOOKUP_RCU) {
nd->root_seq = nd->seq;
} else {
path_get(&nd->root);
nd->state |= ND_ROOT_GRABBED;
}
}
return s;
}
整体流程如下:
1 先检查flags是不是设置了缓存标志,是的话就返回
2 检查路径名字是不是为空,空的话去掉rcu标志
3 检查flags是不是设置了LOOKUP_RCU(咱研究设置了的那个情况),是的话调用rcu_read_lock()
函数
4 设置flags成员和state成员
5 开启两个读锁
6 调用smp_rmb函数,在多处理器情况下保持数据一致性
7 检查state成员是否设置了 ND_ROOT_PRESET,咱这个肯定没有因为调用时就设置的为NULL
set_nameidata(&nd, dfd, pathname, NULL);
static inline void set_nameidata(struct nameidata *p, int dfd, struct filename *name,
const struct path *root)
{
__set_nameidata(p, dfd, name);
p->state = 0;
if (unlikely(root)) {
p->state = ND_ROOT_PRESET;
p->root = *root;
}
}
8 设置mnt成员为空
9 判断是不是绝对路径,咱想的情况是相对路径那种,直接跳过
10 判断dfd成员是不是为AT_FDCWD,咱这个就是设置的AT_FDCWD,所以进这个判断里面
11 看看flags是不是设置了LOOKUP_RCU,那肯定呀
12 获取当前进程的fs成员
13 调用read_seqcount_begin函数,声明要读取fs里面的数据。
14 给path,inode,seq成员赋值,其中path中的name成员竟然只是上一级的,这是为啥?盲猜跟挂载有关系,证据如下
上图是调试的内核,进程的运行目录
这个是nameidata的path成员中的name成员值,inode目前尚不清楚是哪个值,留做后面观察
15 调用read_seqcount_retry函数,我觉得作用是查看在上锁后数据有没有被写入过,被写过的话就再读一遍,要不然为啥放while里面呢。
16 查看flags是不是设置了 LOOKUP_IS_SCOPED,不讨论这个情况
总结来说这函数其实就是设置nd的path成员和inode成员。里面涉及的函数都属于rcu和smp里的,这里不做探讨
link_path_walk
代码如下
/*
* Name resolution.
* This is the basic name resolution function, turning a pathname into
* the final dentry. We expect 'base' to be positive and a directory.
*
* Returns 0 and nd will have valid dentry and mnt on success.
* Returns error and drops reference to input namei data on failure.
*/
static int link_path_walk(const char *name, struct nameidata *nd)
{
int depth = 0; // depth <= nd->depth
int err;
nd->last_type = LAST_ROOT;
nd->flags |= LOOKUP_PARENT;
if (IS_ERR(name))
return PTR_ERR(name);
while (*name=='/')
name++;
if (!*name) {
nd->dir_mode = 0; // short-circuit the 'hardening' idiocy
return 0;
}
/* At this point we know we have a real path component. */
for(;;) {
struct mnt_idmap *idmap;
const char *link;
u64 hash_len;
int type;
idmap = mnt_idmap(nd->path.mnt);
err = may_lookup(idmap, nd);
if (err)
return err;
hash_len = hash_name(nd->path.dentry, name);
type = LAST_NORM;
if (name[0] == '.') switch (hashlen_len(hash_len)) {
case 2:
if (name[1] == '.') {
type = LAST_DOTDOT;
nd->state |= ND_JUMPED;
}
break;
case 1:
type = LAST_DOT;
}
if (likely(type == LAST_NORM)) {
struct dentry *parent = nd->path.dentry;
nd->state &= ~ND_JUMPED;
if (unlikely(parent->d_flags & DCACHE_OP_HASH)) {
struct qstr this = { { .hash_len = hash_len }, .name = name };
err = parent->d_op->d_hash(parent, &this);
if (err < 0)
return err;
hash_len = this.hash_len;
name = this.name;
}
}
nd->last.hash_len = hash_len;
nd->last.name = name;
nd->last_type = type;
name += hashlen_len(hash_len);
if (!*name)
goto OK;
/*
* If it wasn't NUL, we know it was '/'. Skip that
* slash, and continue until no more slashes.
*/
do {
name++;
} while (unlikely(*name == '/'));
if (unlikely(!*name)) {
OK:
/* pathname or trailing symlink, done */
if (!depth) {
nd->dir_vfsuid = i_uid_into_vfsuid(idmap, nd->inode);
nd->dir_mode = nd->inode->i_mode;
nd->flags &= ~LOOKUP_PARENT;
return 0;
}
/* last component of nested symlink */
name = nd->stack[--depth].name;
link = walk_component(nd, 0);
} else {
/* not the last component */
link = walk_component(nd, WALK_MORE);
}
if (unlikely(link)) {
if (IS_ERR(link))
return PTR_ERR(link);
/* a symlink to follow */
nd->stack[depth++].name = name;
name = link;
continue;
}
if (unlikely(!d_can_lookup(nd->path.dentry))) {
if (nd->flags & LOOKUP_RCU) {
if (!try_to_unlazy(nd))
return -ECHILD;
}
return -ENOTDIR;
}
}
}
看上面的注释,可知这函数主要是通过路径名找到最终的dentry
整体流程如下:
1 先给last_type成员赋值个初始值为LAST_ROOT,last_type主要用来记载当前的这个名字是啥类型,比如路径名是 a/b/c/d,那么后面每次循环 当前的这个名字就为a,b,c,d
LAST_ROOT定义如下
enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
见名知义,这个枚举体成员分别代表普通的、根、点、点点。其实确实也就这四种
2 给flags多加个属性,这有啥用?
3 跳过路径中多余的“/”
4 进入循环
5 调用may_lookup检查权限够不够
6 调用hash_name获取一个值,这个是由hash值和当前的路径名称单元(例如:dir1/dir2的路径名称就是dir1或dir2)的长度组成
7 设置当前路径的类型为LAST_NORM,默认为一般的那种,很合理
8 检查下是不是.或者… 这种情况咱不讨论,只讨论最一般的情况。
9 判断如果类型为LAST_NORM,设置下state成员
10 给last成员赋值
11 跳过这个路径名称单元(例如:原本是dir1/dir2 跳过去就变成/dir2)
12 跳过‘/’字符
13 检查路径名称是不是到头啦
14 如果是的话就给dir_vfsuid、dir_mode、flags赋值
15 不是的话,调用walk_component函数根据nd 去查找dentry并存到path成员中。如果当前的是symbollink 返回link的路径,这种情况暂不考虑
16 检查link
17 检查权限
整体上来说就是根据dentry成员和路径名获取hash值,然后根据hash值和路径名去找dentry,这么反反复复,直到找到最后的上一个的dentry
hash_name
static inline u64 hash_name(const void *salt, const char *name)
{
unsigned long a = 0, b, x = 0, y = (unsigned long)salt;
unsigned long adata, bdata, mask, len;
const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
len = 0;
goto inside;
do {
HASH_MIX(x, y, a);
len += sizeof(unsigned long);
inside:
a = load_unaligned_zeropad(name+len);
b = a ^ REPEAT_BYTE('/');
} while (!(has_zero(a, &adata, &constants) | has_zero(b, &bdata, &constants)));
adata = prep_zero_mask(a, adata, &constants);
bdata = prep_zero_mask(b, bdata, &constants);
mask = create_zero_mask(adata | bdata);
x ^= a & zero_bytemask(mask);
return hashlen_create(fold_hash(x, y), len + find_zero(mask));
}
形参salt就是nd->path.dentry,name就是路径名。
整体流程比较简单:
1 按8个字符往前取
2 如果这8个字符有0或者‘/’ 停止循环
3 计算这个路径单元的长度,并根据这个过程中的数值计算hash值。
4 然后把hash值和长度放到一个long对象中。
关于hash计算的就不讲了,看不懂。着重讲怎么算长度的
困难点在于这几个函数。
WORD_AT_A_TIME_CONSTANTS定义如下
#define WORD_AT_A_TIME_CONSTANTS { REPEAT_BYTE(0x01), REPEAT_BYTE(0x80) }
REPEAT_BYTE定义如下
#define REPEAT_BYTE(x) ((~0ul / 0xff) * (x))
REPEAT_BYTE(x)的返回值就是0x0101010101010101的x倍
所以constants变量的one_bits的值就是0x0101010101010101,high_bits的值就是0x0101010101010101*0x80
作用就是为后面的筛选做铺垫
HASH_MIX的定义如下
#define HASH_MIX(x, y, a) \
( x ^= (a), \
y ^= x, x = rol64(x,12),\
x += y, y = rol64(y,45),\
y *= 9 )
流程就是x与a进行异或操作,结果给x,然后y与x进行异或操作,结果给y,调用rol64结果给x,
x与y相加,结果给x。调用rol64结果给y,然后y再加9。
这一套下来没太看懂,可能是某个hash算法能让hash值分布更均匀?
rol64定义如下
static inline __u64 rol64(__u64 word, unsigned int shift)
{
return (word << (shift & 63)) | (word >> ((-shift) & 63));
}
这比较好理解,就是把word按照第shift位,进行旋转,例如:rol64(0X12345678,12)的结果就是0X4567890ABCDEF123
load_unaligned_zeropad定义如下
/*
* Load an unaligned word from kernel space.
*
* In the (very unlikely) case of the word being a page-crosser
* and the next page not being mapped, take the exception and
* return zeroes in the non-existing part.
*/
static inline unsigned long load_unaligned_zeropad(const void *addr)
{
unsigned long ret;
asm volatile(
"1: mov %[mem], %[ret]\n"
"2:\n"
_ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_ZEROPAD)
: [ret] "=r" (ret)
: [mem] "m" (*(unsigned long *)addr));
return ret;
}
额。。。我看着就是把addr指针指向的值的后面8字节,至于_ASM_EXTABLE_TYPE起什么作用,抱歉在下才疏学浅,如果有知道的麻烦在评论区告知一二,感激不尽。
has_zero定义如下
/* Return nonzero if it has a zero */
static inline unsigned long has_zero(unsigned long a, unsigned long *bits, const struct word_at_a_time *c)
{
unsigned long mask = ((a - c->one_bits) & ~a) & c->high_bits;
*bits = mask;
return mask;
}
这个可以用来检查a是否存在零字节或者是否存在‘/’字符,为零字节或者是否‘/’字符的字节位设置为0x80,其余字节置零,最终结果就是每个字节不是0x80就是0。具体解析参考链接 ,写的非常好
案例
可看见a的低三位字节为‘q’ ‘w’ ‘e’ 然后下一个字符为0了,说明此时路径已经遍历完了。在往下面的字符就没啥意义了。
prep_zero_mask定义如下
static inline unsigned long prep_zero_mask(unsigned long a, unsigned long bits, const struct word_at_a_time *c)
{
return bits;
}
简单的返回bits
create_zero_mask定义如下
static inline unsigned long create_zero_mask(unsigned long bits)
{
bits = (bits - 1) & ~bits;
return bits >> 7;
}
首先明确一下,bits中的每一字节不是0x80就是0,这个函数的作用是把最低位截至到为0x80的中间字节全部设为0XFF。如:0x8080808080000000 结果为0xFFFFFF。因为-1以后0字节全部变为0xFF,一直到第一个0x80,0x80会变为0x7F,也就是该字节除了最高位其余7位全部为1,这样再跟之前的数取反再进行与操作后,上面的不受减1影响,都变为0,下面的都能保留,然后向右移7位,把多余的7位抵消。
这时候有多少个0xFF字节就有多少个剩余字符(除了0或者‘/’字符)了
zero_bytemask定义如下
#define zero_bytemask(mask) (mask)
比较简单
fold_hash定义如下
#define GOLDEN_RATIO_64 0x61C8864680B583EBull
static inline unsigned int fold_hash(unsigned long x, unsigned long y)
{
y ^= x * GOLDEN_RATIO_64;
y *= GOLDEN_RATIO_64;
return y >> 32;
}
一通计算,这到底是涉及到哪块知识呢?
find_zero函数定义如下
static inline long count_masked_bytes(unsigned long mask)
{
return mask*0x0001020304050608ul >> 56;
}
static inline unsigned long find_zero(unsigned long mask)
{
return count_masked_bytes(mask);
}
注意 mask按顺序由若干个0XFF组成。这个函数的作用就是返回mask有多少个0XFF。
当有一个0xff的时候,因为0x0001020304050608ul乘上0xFF,基本上最高位的1会往前进8位,就到了64位的高8位,再向右移动完后,只剩下了1。
同理,当有两个的时候也这样。看这个数也很有规律嘛,从1到6再到8,至于为啥最后不是7,因为0xffffffffffffff×0x0001020304050607ul等于6fefdfcfbfaf9f9,右移完后剩个6。但0xffffffffffffff×0x0001020304050608ul就等于7fefdfcfbfaf9f8。有意思。
hashlen_create定义如下
#define hashlen_create(hash, len) ((u64)(len)<<32 | (u32)(hash))
hash和len各占一半,没什么说的
hashlen_len
hashlen_len定义如下
#define hashlen_len(hashlen) ((u32)((hashlen) >> 32))
很简单,把len取出来。
walk_component
walk_component定义如下
static const char *walk_component(struct nameidata *nd, int flags)
{
struct dentry *dentry;
/*
* "." and ".." are special - ".." especially so because it has
* to be able to know about the current root directory and
* parent relationships.
*/
if (unlikely(nd->last_type != LAST_NORM)) {
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
return handle_dots(nd, nd->last_type);
}
dentry = lookup_fast(nd);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (unlikely(!dentry)) {
dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
}
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
return step_into(nd, flags, dentry);
}
整体流程如下:
1 先判断当前的路径单元名字是不是.或者… 是的话另做处理。
2 调用lookup_fast函数根据nd去查dentry。
3 查不到的话,调用lookup_slow再去查。(这里只讲lookup_fast能查到)
4 判断flag是不是设置的WALK_MORE和depth有没有数
5 调用step_into,如果是symbollink就返回链接的地址,没有返回空。
lookup_fast函数定义如下
static struct dentry *lookup_fast(struct nameidata *nd)
{
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
/*
* Rename seqlock is not required here because in the off chance
* of a false negative due to a concurrent rename, the caller is
* going to fall back to non-racy lookup.
*/
if (nd->flags & LOOKUP_RCU) {
dentry = __d_lookup_rcu(parent, &nd->last, &nd->next_seq);
if (unlikely(!dentry)) {
if (!try_to_unlazy(nd))
return ERR_PTR(-ECHILD);
return NULL;
}
/*
* This sequence count validates that the parent had no
* changes while we did the lookup of the dentry above.
*/
if (read_seqcount_retry(&parent->d_seq, nd->seq))
return ERR_PTR(-ECHILD);
status = d_revalidate(dentry, nd->flags);
if (likely(status > 0))
return dentry;
if (!try_to_unlazy_next(nd, dentry))
return ERR_PTR(-ECHILD);
if (status == -ECHILD)
/* we'd been told to redo it in non-rcu mode */
status = d_revalidate(dentry, nd->flags);
} else {
dentry = __d_lookup(parent, &nd->last);
if (unlikely(!dentry))
return NULL;
status = d_revalidate(dentry, nd->flags);
}
if (unlikely(status <= 0)) {
if (!status)
d_invalidate(dentry);
dput(dentry);
return ERR_PTR(status);
}
return dentry;
}
整体流程如下:
1 给parent赋值为nd->path.dentry;(nd的path成员是在path_init函数中被赋值的)
2 检查nd的的flags是不设置了LOOKUP_RCU,当然是了,这句设置的
filp = path_openat(&nd, op, flags | LOOKUP_RCU);
3 调用__d_lookup_rcu查找dentry
4 调用d_revalidate重新验证一下
5 检查返回值status是否有问题,没问题就返回dentry
__d_lookup_rcu定义如下
struct dentry *__d_lookup_rcu(const struct dentry *parent,
const struct qstr *name,
unsigned *seqp)
{
u64 hashlen = name->hash_len;
const unsigned char *str = name->name;
struct hlist_bl_head *b = d_hash(hashlen_hash(hashlen));
struct hlist_bl_node *node;
struct dentry *dentry;
if (unlikely(parent->d_flags & DCACHE_OP_COMPARE))
return __d_lookup_rcu_op_compare(parent, name, seqp);
hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
unsigned seq;
seq = raw_seqcount_begin(&dentry->d_seq);
if (dentry->d_parent != parent)
continue;
if (d_unhashed(dentry))
continue;
if (dentry->d_name.hash_len != hashlen)
continue;
if (dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0)
continue;
*seqp = seq;
return dentry;
}
return NULL;
}
整体流程如下:
1 根据hashlen截取hash值
2 调用d_hash获得hlist_bl_head表头b
3 遍历这个表,通过比较路径名找到dentry
d_hash定义如下
static unsigned int d_hash_shift __ro_after_init;
static struct hlist_bl_head *dentry_hashtable __ro_after_init;
static inline struct hlist_bl_head *d_hash(unsigned int hash)
{
return dentry_hashtable + (hash >> d_hash_shift);
}
流程
1 将hash值右移d_hash_shift位
2 将移动后的结果值作为索引找到对应的表,该hash表中的key值也是个表
__ro_after_init的意思是初始化后就不能改变。
hlist_bl_for_each_entry_rcu定义如下
#define hlist_bl_for_each_entry_rcu(tpos, pos, head, member) \
for (pos = hlist_bl_first_rcu(head); \
pos && \
({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
pos = rcu_dereference_raw(pos->next))
宏展开如下
for (node = hlist_bl_first_rcu(b); node && ({ dentry = hlist_bl_entry(node, typeof(*dentry), d_hash); 1; }); node = rcu_dereference_raw(node->next));
相关变量声明如下
struct hlist_bl_head *b = d_hash(hashlen_hash(hashlen));
struct hlist_bl_node *node;
struct dentry *dentry;
struct dentry {
...
struct hlist_bl_node d_hash; /* lookup hash list */
...
}
整体流程如下
1 获得表中第一个节点
2 判断节点是不是为空
3 非空的话根据这个节点的地址计算得到dentry,该dentry包含了该节点
4 循环完一次后获取下一个节点
hlist_bl_first_rcu和rcu_dereference_raw都涉及到了rcu机制,这里不讲
hlist_bl_entry定义及相关宏如下
#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,member)
#define container_of(ptr, type, member) ({ \
void *__mptr = (void *)(ptr); \
static_assert(__same_type(*(ptr), ((type *)0)->member) || \
__same_type(*(ptr), void), \
"pointer type mismatch in container_of()"); \
((type *)(__mptr - offsetof(type, member))); })
#define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
#define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
#define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
只是简单的封装了一下container_of
container_of流程如下
1 先把ptr转为void类型指针__mptr
2 判断ptr指针指向的对象类型和member成员是不是冲突,然后判断ptr指针指向的对象类型和void是不是冲突,冲突及报编译错误
3 根据偏移计算type指针
static_assert是c++11的特性,参考链接
简单来说就是在编译阶段进行判断,如果为0,报错误信息
__builtin_types_compatible_p 是gcc的内置函数,判断两个类型是否兼容,简单理解为是否一样,具体参考链接
dentry_string_cmp定义如下
/*
* NOTE! 'cs' and 'scount' come from a dentry, so it has a
* aligned allocation for this particular component. We don't
* strictly need the load_unaligned_zeropad() safety, but it
* doesn't hurt either.
*
* In contrast, 'ct' and 'tcount' can be from a pathname, and do
* need the careful unaligned handling.
*/
static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char *ct, unsigned tcount)
{
unsigned long a,b,mask;
for (;;) {
a = read_word_at_a_time(cs);
b = load_unaligned_zeropad(ct);
if (tcount < sizeof(unsigned long))
break;
if (unlikely(a != b))
return 1;
cs += sizeof(unsigned long);
ct += sizeof(unsigned long);
tcount -= sizeof(unsigned long);
if (!tcount)
return 0;
}
mask = bytemask_from_count(tcount);
return unlikely(!!((a ^ b) & mask));
}
注释解释了为什么调用load_unaligned_zeropad,给b赋值。
流程比较简单,在循环中,按8个字节进行比较。然后将剩余不到8个的通过位操作比较一下,相同返回0,不同返回1。
read_word_at_a_time定义如下
static inline bool kasan_check_read(const volatile void *p, unsigned int size)
{
return true;
}
static __no_kasan_or_inline unsigned long read_word_at_a_time(const void *addr)
{
kasan_check_read(addr, 1);
return *(unsigned long *)addr;
}
简单的赋值转换。
bytemask_from_count定义如下
#define bytemask_from_count(cnt) (~(~0ul << (cnt)*8))
功能就是将前((cnt)*8)位都置为1
d_revalidate定义如下
static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
{
if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
return dentry->d_op->d_revalidate(dentry, flags);
else
return 1;
}
流程比较简单,只是判断了下dentry->d_flags是否设置了DCACHE_OP_REVALIDATE如果有,就调用dentry->d_op->d_revalidate指针指向的函数。
step_into定义如下
static const char *step_into(struct nameidata *nd, int flags,
struct dentry *dentry)
{
struct path path;
struct inode *inode;
int err = handle_mounts(nd, dentry, &path);
if (err < 0)
return ERR_PTR(err);
inode = path.dentry->d_inode;
if (likely(!d_is_symlink(path.dentry)) ||
((flags & WALK_TRAILING) && !(nd->flags & LOOKUP_FOLLOW)) ||
(flags & WALK_NOFOLLOW)) {
/* not a symlink or should not follow */
if (nd->flags & LOOKUP_RCU) {
if (read_seqcount_retry(&path.dentry->d_seq, nd->next_seq))
return ERR_PTR(-ECHILD);
if (unlikely(!inode))
return ERR_PTR(-ENOENT);
} else {
dput(nd->path.dentry);
if (nd->path.mnt != path.mnt)
mntput(nd->path.mnt);
}
nd->path = path;
nd->inode = inode;
nd->seq = nd->next_seq;
return NULL;
}
if (nd->flags & LOOKUP_RCU) {
/* make sure that d_is_symlink above matches inode */
if (read_seqcount_retry(&path.dentry->d_seq, nd->next_seq))
return ERR_PTR(-ECHILD);
} else {
if (path.mnt == nd->path.mnt)
mntget(path.mnt);
}
return pick_link(nd, &path, inode, flags);
}
整体流程如下:
1 在函数handle_mounts中设置path
2 给inode赋值
3 判断dentry是否是symbollink,(咱讨论不是的)
4 给nd的path和inode赋值
5 返回空
依据如下:
handle_mounts定义如下
static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
struct path *path)
{
bool jumped;
int ret;
path->mnt = nd->path.mnt;
path->dentry = dentry;
if (nd->flags & LOOKUP_RCU) {
unsigned int seq = nd->next_seq;
if (likely(__follow_mount_rcu(nd, path)))
return 0;
// *path and nd->next_seq might've been clobbered
path->mnt = nd->path.mnt;
path->dentry = dentry;
nd->next_seq = seq;
if (!try_to_unlazy_next(nd, dentry))
return -ECHILD;
}
ret = traverse_mounts(path, &jumped, &nd->total_link_count, nd->flags);
if (jumped) {
if (unlikely(nd->flags & LOOKUP_NO_XDEV))
ret = -EXDEV;
else
nd->state |= ND_JUMPED;
}
if (unlikely(ret)) {
dput(path->dentry);
if (path->mnt != nd->path.mnt)
mntput(path->mnt);
}
return ret;
}
这函数应该是处理挂载点的,但咱们涉及不到那部分,咱只涉及前半部分
整体流程如下:
1 给path赋值
2 判断nd->flags
3 调用__follow_mount_rcu函数
4 返回0
依据如下
调用handle_mounts时,if判断完,直接返回到step_into函数。所以根本没走底下的traverse_mounts函数。
open_last_lookups
static const char *open_last_lookups(struct nameidata *nd,
struct file *file, const struct open_flags *op)
{
struct dentry *dir = nd->path.dentry;
int open_flag = op->open_flag;
bool got_write = false;
struct dentry *dentry;
const char *res;
nd->flags |= op->intent;
if (nd->last_type != LAST_NORM) {
if (nd->depth)
put_link(nd);
return handle_dots(nd, nd->last_type);
}
if (!(open_flag & O_CREAT)) {
if (nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
dentry = lookup_fast(nd);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (likely(dentry))
goto finish_lookup;
if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
return ERR_PTR(-ECHILD);
} else {
/* create side of things */
if (nd->flags & LOOKUP_RCU) {
if (!try_to_unlazy(nd))
return ERR_PTR(-ECHILD);
}
audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
/* trailing slashes? */
if (unlikely(nd->last.name[nd->last.len]))
return ERR_PTR(-EISDIR);
}
if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
got_write = !mnt_want_write(nd->path.mnt);
/*
* do _not_ fail yet - we might not need that or fail with
* a different error; let lookup_open() decide; we'll be
* dropping this one anyway.
*/
}
if (open_flag & O_CREAT)
inode_lock(dir->d_inode);
else
inode_lock_shared(dir->d_inode);
dentry = lookup_open(nd, file, op, got_write);
if (!IS_ERR(dentry)) {
if (file->f_mode & FMODE_CREATED)
fsnotify_create(dir->d_inode, dentry);
if (file->f_mode & FMODE_OPENED)
fsnotify_open(file);
}
if (open_flag & O_CREAT)
inode_unlock(dir->d_inode);
else
inode_unlock_shared(dir->d_inode);
if (got_write)
mnt_drop_write(nd->path.mnt);
if (IS_ERR(dentry))
return ERR_CAST(dentry);
if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) {
dput(nd->path.dentry);
nd->path.dentry = dentry;
return NULL;
}
finish_lookup:
if (nd->depth)
put_link(nd);
res = step_into(nd, WALK_TRAILING, dentry);
if (unlikely(res))
nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
return res;
}
看着多,其实真正运行的没多少。
整体流程如下:
1 nd->flags与op->intent进行或操作,op->intent的值只有0或者LOOKUP_OPEN(当设置O_PATH时才会是LOOKUP_OPEN),所以nd->flags不变
2 判断nd->last_type是不是正规文件,而不是.或者…
3 判断最后一个路径单元是不是软链接或者文件夹
4 查找最后一个路径单元,获得dentry
5 判断是不是查到了
6 跳到finish_lookup
7 调用step_into,返回最后一个路径单元链接的路径,正常情况下该路径为空
do_open
代码如下
/*
* Handle the last step of open()
*/
static int do_open(struct nameidata *nd,
struct file *file, const struct open_flags *op)
{
struct mnt_idmap *idmap;
int open_flag = op->open_flag;
bool do_truncate;
int acc_mode;
int error;
if (!(file->f_mode & (FMODE_OPENED | FMODE_CREATED))) {
error = complete_walk(nd);
if (error)
return error;
}
if (!(file->f_mode & FMODE_CREATED))
audit_inode(nd->name, nd->path.dentry, 0);
idmap = mnt_idmap(nd->path.mnt);
if (open_flag & O_CREAT) {
if ((open_flag & O_EXCL) && !(file->f_mode & FMODE_CREATED))
return -EEXIST;
if (d_is_dir(nd->path.dentry))
return -EISDIR;
error = may_create_in_sticky(idmap, nd,
d_backing_inode(nd->path.dentry));
if (unlikely(error))
return error;
}
if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
return -ENOTDIR;
do_truncate = false;
acc_mode = op->acc_mode;
if (file->f_mode & FMODE_CREATED) {
/* Don't check for write permission, don't truncate */
open_flag &= ~O_TRUNC;
acc_mode = 0;
} else if (d_is_reg(nd->path.dentry) && open_flag & O_TRUNC) {
error = mnt_want_write(nd->path.mnt);
if (error)
return error;
do_truncate = true;
}
error = may_open(idmap, &nd->path, acc_mode, open_flag);
if (!error && !(file->f_mode & FMODE_OPENED))
error = vfs_open(&nd->path, file);
if (!error)
error = security_file_post_open(file, op->acc_mode);
if (!error && do_truncate)
error = handle_truncate(idmap, file);
if (unlikely(error > 0)) {
WARN_ON(1);
error = -EINVAL;
}
if (do_truncate)
mnt_drop_write(nd->path.mnt);
return error;
}
注释都说了,这是最后一步。加油!
整体流程如下:
1 首先根据file->f_mode判断是不是FMODE_OPENED或 FMODE_CREATED,咱这个场景设置得到的是FMODE_WRITE
2 调用complete_walk将nd的一些值清理一下
3 调用audit_inode进行审计相关方面操作
4 调用mnt_idmap(nd->path.mnt);获得idmap
5 调用may_open进行权限检查
6 调用vfs_open打开该文件
complete_walk
定义如下
static int complete_walk(struct nameidata *nd)
{
struct dentry *dentry = nd->path.dentry;
int status;
if (nd->flags & LOOKUP_RCU) {
/*
* We don't want to zero nd->root for scoped-lookups or
* externally-managed nd->root.
*/
if (!(nd->state & ND_ROOT_PRESET))
if (!(nd->flags & LOOKUP_IS_SCOPED))
nd->root.mnt = NULL;
nd->flags &= ~LOOKUP_CACHED;
if (!try_to_unlazy(nd))
return -ECHILD;
}
if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
/*
* While the guarantee of LOOKUP_IS_SCOPED is (roughly) "don't
* ever step outside the root during lookup" and should already
* be guaranteed by the rest of namei, we want to avoid a namei
* BUG resulting in userspace being given a path that was not
* scoped within the root at some point during the lookup.
*
* So, do a final sanity-check to make sure that in the
* worst-case scenario (a complete bypass of LOOKUP_IS_SCOPED)
* we won't silently return an fd completely outside of the
* requested root to userspace.
*
* Userspace could move the path outside the root after this
* check, but as discussed elsewhere this is not a concern (the
* resolved file was inside the root at some point).
*/
if (!path_is_under(&nd->path, &nd->root))
return -EXDEV;
}
if (likely(!(nd->state & ND_JUMPED)))
return 0;
if (likely(!(dentry->d_flags & DCACHE_OP_WEAK_REVALIDATE)))
return 0;
status = dentry->d_op->d_weak_revalidate(dentry, nd->flags);
if (status > 0)
return 0;
if (!status)
status = -ESTALE;
return status;
}
整体流程如下:
1 判断nd->flags是不是设置了LOOKUP_RCU
2 如果nd->state 没设置 ND_ROOT_PRESET和nd->flags 没设置 LOOKUP_IS_SCOPED,讲nd->root.mnt置空
nd->state是在set_nameidata和path_init函数中设置的
nd->flags的值调试得知:
也就是设置的LOOKUP_OPEN | LOOKUP_FOLLOW | LOOKUP_RCU
是在这:
nd->flag的数据流程大致如下:
1 do_sys_open函数 open_how how = build_open_how(flags, mode);
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW; 注:LOOKUP_FOLLOW是在这设置的
2 do_sys_openat2函数 int fd = build_open_flags(how, &op);
3 do_filp_open 函数 int flags = op->lookup_flags; filp = path_openat(&nd, op, flags | LOOKUP_RCU); 注:LOOKUP_RCU是这儿
4 path_init 函数 nd->flags = flags;
5 open_last_lookups函数 nd->flags |= op->intent; 注:LOOKUP_OPEN是这儿
3 消除掉LOOKUP_CACHED位,本来也没设置
4 调用try_to_unlazy去清理一些nd值
5 检查nd->flags 是否设置LOOKUP_IS_SCOPED (咱这情景没有)
6 检查nd->state 是否设置 ND_JUMPED,没设置返回
try_to_unlazy函数定义如下
static bool try_to_unlazy(struct nameidata *nd)
{
struct dentry *parent = nd->path.dentry;
BUG_ON(!(nd->flags & LOOKUP_RCU));
if (unlikely(!legitimize_links(nd)))
goto out1;
if (unlikely(!legitimize_path(nd, &nd->path, nd->seq)))
goto out;
if (unlikely(!legitimize_root(nd)))
goto out;
leave_rcu(nd);
BUG_ON(nd->inode != parent->d_inode);
return true;
out1:
nd->path.mnt = NULL;
nd->path.dentry = NULL;
out:
leave_rcu(nd);
return false;
}
整体流程如下:
1 调用legitimize_links,跟链接有关,不详细说明
2 调用legitimize_path,跟链接有关,不详细说明
3 调用legitimize_root,跟链接有关,不详细说明
4 调用leave_rcu,猜测时是退出rcu机制了,应该是后面不在修改nd的值了
leave_rcu定义如下
static void leave_rcu(struct nameidata *nd)
{
nd->flags &= ~LOOKUP_RCU;
nd->seq = nd->next_seq = 0;
rcu_read_unlock();
}
整理流程比较简单:
1 取消设置nd->flags的LOOKUP_RCU
2 将相关顺序锁置0
3 调用rcu_read_unlock,解锁rcu
mnt_idmap和may_open
涉及到了多核,不在赘述
may_open
涉及了权限检查,不在赘述
vfs_open
定义如下
int vfs_open(const struct path *path, struct file *file)
{
int ret;
file->f_path = *path;
ret = do_dentry_open(file, NULL);
if (!ret) {
/*
* Once we return a file with FMODE_OPENED, __fput() will call
* fsnotify_close(), so we need fsnotify_open() here for
* symmetry.
*/
fsnotify_open(file);
}
return ret;
}
整体流程比较简单:
1 给path赋值 注:前面通过lookup_fast找到的dentry,通过在handle_mounts函数中将dentry赋值给了path->dentry,然后这里再将path给file->path
2调用do_dentry_open函数打开文件
3打开成功调用fsnotify_open函数,通知一下,涉及到了事件机制,不在这里赘述
do_dentry_open
static int do_dentry_open(struct file *f,
int (*open)(struct inode *, struct file *))
{
static const struct file_operations empty_fops = {};
struct inode *inode = f->f_path.dentry->d_inode;
int error;
path_get(&f->f_path);
f->f_inode = inode;
f->f_mapping = inode->i_mapping;
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
f->f_sb_err = file_sample_sb_err(f);
if (unlikely(f->f_flags & O_PATH)) {
f->f_mode = FMODE_PATH | FMODE_OPENED;
f->f_op = &empty_fops;
return 0;
}
if ((f->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) {
i_readcount_inc(inode);
} else if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) {
error = file_get_write_access(f);
if (unlikely(error))
goto cleanup_file;
f->f_mode |= FMODE_WRITER;
}
/* POSIX.1-2008/SUSv4 Section XSI 2.9.7 */
if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode))
f->f_mode |= FMODE_ATOMIC_POS;
f->f_op = fops_get(inode->i_fop);
if (WARN_ON(!f->f_op)) {
error = -ENODEV;
goto cleanup_all;
}
error = security_file_open(f);
if (error)
goto cleanup_all;
error = break_lease(file_inode(f), f->f_flags);
if (error)
goto cleanup_all;
/* normally all 3 are set; ->open() can clear them if needed */
f->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
if (!open)
open = f->f_op->open;
if (open) {
error = open(inode, f);
if (error)
goto cleanup_all;
}
f->f_mode |= FMODE_OPENED;
if ((f->f_mode & FMODE_READ) &&
likely(f->f_op->read || f->f_op->read_iter))
f->f_mode |= FMODE_CAN_READ;
if ((f->f_mode & FMODE_WRITE) &&
likely(f->f_op->write || f->f_op->write_iter))
f->f_mode |= FMODE_CAN_WRITE;
if ((f->f_mode & FMODE_LSEEK) && !f->f_op->llseek)
f->f_mode &= ~FMODE_LSEEK;
if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
f->f_mode |= FMODE_CAN_ODIRECT;
f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
f->f_iocb_flags = iocb_flags(f);
file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
return -EINVAL;
/*
* XXX: Huge page cache doesn't support writing yet. Drop all page
* cache for this file before processing writes.
*/
if (f->f_mode & FMODE_WRITE) {
/*
* Paired with smp_mb() in collapse_file() to ensure nr_thps
* is up to date and the update to i_writecount by
* get_write_access() is visible. Ensures subsequent insertion
* of THPs into the page cache will fail.
*/
smp_mb();
if (filemap_nr_thps(inode->i_mapping)) {
struct address_space *mapping = inode->i_mapping;
filemap_invalidate_lock(inode->i_mapping);
/*
* unmap_mapping_range just need to be called once
* here, because the private pages is not need to be
* unmapped mapping (e.g. data segment of dynamic
* shared libraries here).
*/
unmap_mapping_range(mapping, 0, 0, 0);
truncate_inode_pages(mapping, 0);
filemap_invalidate_unlock(inode->i_mapping);
}
}
return 0;
cleanup_all:
if (WARN_ON_ONCE(error > 0))
error = -EINVAL;
fops_put(f->f_op);
put_file_access(f);
cleanup_file:
path_put(&f->f_path);
f->f_path.mnt = NULL;
f->f_path.dentry = NULL;
f->f_inode = NULL;
return error;
}
这函数就是pen函数的核心函数
整体流程如下:
1 首先给file的部分成员赋值,大部分来自前面查找到的dentry中的d_inode成员,(因为在虚拟文件系统中inode代表了真正的文件,大部分文件信息都在这里面)
2 判断是不是设置了O_PATH
3 查看是设置了读操作还是写操作,读操作就将记录读者数量的i_readcount成员加1,写操作就得调用 file_get_write_access函数检查此时能不能写入
4调用 fops_get函数获取inode的i_fop成员给f->f_op成员
5 进行最后的安全权限检查
6 获取 f->f_op的open函数
7 调用open函数,这句代码就是整个open函数的核心,整个open函数都是为这句代码做准备,前面的一切都是虚拟文件系统做的事,这个函数是要打开的文件所在的具体文件系统所要做的事情。它连接了两者,执行完这句代码后,后面的操作都是做一些清理工作了。
8 给f->f_mode 添加上已打开的标识
9 如果是读取加上FMODE_CAN_READ的标识,意味着能读取了,同理写入也一样
10 调用file_ra_state_init初始化file关于内存的部分
11 如果是写入,调用smp_mb();涉及到多核问题
if (filemap_nr_thps(inode->i_mapping)) {
struct address_space *mapping = inode->i_mapping;
filemap_invalidate_lock(inode->i_mapping);
/*
* unmap_mapping_range just need to be called once
* here, because the private pages is not need to be
* unmapped mapping (e.g. data segment of dynamic
* shared libraries here).
*/
unmap_mapping_range(mapping, 0, 0, 0);
truncate_inode_pages(mapping, 0);
filemap_invalidate_unlock(inode->i_mapping);
}
12 这部分代码没太细看,大体猜测一下,应该是查看当前文件有没有缓存,有的话清理一下,希望有懂的大佬指点迷津
terminate_walk
static void terminate_walk(struct nameidata *nd)
{
drop_links(nd);
if (!(nd->flags & LOOKUP_RCU)) {
int i;
path_put(&nd->path);
for (i = 0; i < nd->depth; i++)
path_put(&nd->stack[i].link);
if (nd->state & ND_ROOT_GRABBED) {
path_put(&nd->root);
nd->state &= ~ND_ROOT_GRABBED;
}
} else {
leave_rcu(nd);
}
nd->depth = 0;
nd->path.mnt = NULL;
nd->path.dentry = NULL;
}
整体流程如下:
1 调用drop_links函数,清空nd里面的link信息
2 判断nd->flags是否设置了LOOKUP_RCU,(在try_to_unlazy函数中调用过一次leave_rcu函数,将LOOKUP_RCU清空了)
3 调用path_put,将nd->path的引用数量减一
4 查看depth成员是否大于0,大于的话将stack栈中的path都减一。
5 检查是否设置了
6 将depth置0
7 将nd->path.mnt 和 nd->path.dentry置空
drop_links函数
static inline void do_delayed_call(struct delayed_call *call)
{
if (call->fn)
call->fn(call->arg);
}
static inline void clear_delayed_call(struct delayed_call *call)
{
call->fn = NULL;
}
static void drop_links(struct nameidata *nd)
{
int i = nd->depth;
while (i--) {
struct saved *last = nd->stack + i;
do_delayed_call(&last->done);
clear_delayed_call(&last->done);
}
}
整体流程如下:
1 出栈
2 调用do_delayed_call,调用fn指向的函数
3 调用clear_delayed_call,将fn置空
restore_nameidata
static void restore_nameidata(void)
{
struct nameidata *now = current->nameidata, *old = now->saved;
current->nameidata = old;
if (old)
old->total_link_count = now->total_link_count;
if (now->stack != now->internal)
kfree(now->stack);
}
整体流程如下:
1 将之前的nameidata成员重新给当前进程
2 将now->total_link_count赋给old->total_link_count
3 清空新的nd的栈
fd_install
void fd_install(unsigned int fd, struct file *file)
{
struct files_struct *files = current->files;
struct fdtable *fdt;
if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
return;
rcu_read_lock_sched();
if (unlikely(files->resize_in_progress)) {
rcu_read_unlock_sched();
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
BUG_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
return;
}
/* coupled with smp_wmb() in expand_fdtable() */
smp_rmb();
fdt = rcu_dereference_sched(files->fdt);
BUG_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
rcu_read_unlock_sched();
}
这个函数大部分代码都是rcu保持数据一致性的,就一句真正有用,就是rcu_assign_pointer(fdt->fd[fd], file);
将file放到file表中的fd位置。
rcu_assign_pointer定义如下
#define rcu_assign_pointer(p, v) do { (p) = (v); } while (0)
简单的赋值,至于为什么是do while(0),这样和直接大括号有啥区别,参考链接
putname
void putname(struct filename *name)
{
if (IS_ERR(name))
return;
if (WARN_ON_ONCE(!atomic_read(&name->refcnt)))
return;
if (!atomic_dec_and_test(&name->refcnt))
return;
if (name->name != name->iname) {
__putname(name->name);
kfree(name);
} else
__putname(name);
}
基本上就是把name所占用的内存释放
整体流程如下:
1 检查是否指针指向的是否为错误信息
2 检查name的引用是否为0
3 将引用值减一
4 name->name 与 name->iname是否相等,在咱们这情景中相等
5 调用__putname清理之前申请的内存
__putname定义如下
#define __putname(name) kmem_cache_free(names_cachep, (void *)(name))
到这里,open函数代码解析完毕,由于涉及的东西较多,难免会有疏漏和错误,欢迎大家指正
遗留问题
这些都是在写的过程中遇到的不明白的地方,特此记录一下,留着以后慢慢解决
build_open_flags
1 FMODE_NONOTIFY这有啥用还没看出来
2 compiletime_assert在写一个blog
3 为啥设置__O_TMPFILE还得设置O_DIRECTORY 要不然不行,兼容啥老版本?
4 LOOKUP_OPEN的作用
5 (emmm,如果我把这句话 flags |= O_NOFOLLOW; 注掉是不是就没这限制了)
getname
1 审计机制以后可以了解一下
2 atomic_set(&result->refcnt, 1); 这个原子操作以后可以了解一下
3 整体流程 第6不 没成功的话,光看这一层有点看不懂
4 __getname()也可以研究研究 slab之前在书上看到过
5 原子操作以后得琢磨琢磨,跟抢占有啥关系
get_unused_fd_flags
1读取了索引值为7的rlim对象中的rlim_cur成员有啥用?(位图的end值)
2spin_lock(&files->file_lock);
3rcu机制以后可以了解一下rcu_access_pointer和rcu_assign_pointer
do_filp_open
1 restore_nameidata函数得看一看,有啥作用?到时候补充在do_filp_open整体流程那
2 __set_nameidata里的p->stack = p->internal;是管符号链接用的吗?
3 nameidata的path成员很有用吧?
4 nameidata的total_link_count成员有啥有?
5 nameidata的saved呢?又有啥用?研究明白后把__set_nameidata的整体流程好好梳理一遍
6 然后看返回的filp
有问题换换flag接着试试,但nd会有变化吗?
path_init
1 给nameidata的path赋的啥值清楚了,但inode目前尚不清楚是哪个值,留做后面观察
link_path_walk
1给flags多加个属性,这有啥用?
load_unaligned_zeropad
1 _ASM_EXTABLE_TYPE的作用