硬核解析Linux文件系统底层【VFS文件系统】-优快云博客

本文链接：https://blog.youkuaiyun.com/include_it_dog/article/details/125150827

本文详细介绍了Linux的虚拟文件系统（VFS）及其核心组件，包括super_block、dentry、inode和file结构。VFS作为文件系统接口，统一了各种文件系统的访问方式。文章还讲解了文件系统在硬盘中的存储结构，如Ext2的bootblock、superblock、GDT等，并阐述了文件挂载、打开和读写的过程。通过对VFS的理解，读者能更好地掌握Linux系统中文件操作的底层原理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Linux VFS文件系统

Linux一直强调一切皆文件，除了普通的文件和目录，块设备、管道、socket等都被Linux视为文件。所以理解了文件系统，对于深入理解Linux非常重要，接下来就让我们从设计和源码双角度了解Linux文件系统。

1. VFS的基本概念

Linux内核通过虚拟文件系统（Virtual File System，VFS）管理文件系统。VFS是Linux内核文件系统的一个极其重要的基础设施，VFS为所有的文件系统提供了统一的接口，对每个具体文件系统的访问要通过VFS定义的接口来实现，所有Linux的文件系统也必须按照VFS定义的方式来实现。

VFS本身只存在于内存中，需要将硬盘中的文件系统抽象到内存中，这个过程是通过几个重要的结构实现的，分别是：super_block,dentry,inode,file，通过这些结构，可以将硬盘中的文件系统抽象到内存中，通过对dentry,inode的管理来实现对文件系统的一些操作。

在这里插入图片描述

1.1 super_block

struct super_block {
    struct list_head	s_list;
    unsigned long        s_blocksize;
    unsigned char        s_blocksize_bits;
    ……/*省略超级块的链表、设备号等代码*/
    unsigned long long        s_maxbytes;        /* Max file size */
    struct file_system_type        *s_type;
    struct super_operations        *s_op;
    unsigned long        s_magic;
    struct dentry        *s_root;
    struct list_head        s_inodes;        /* all inodes */
    struct list_head         s_dirty;        /* dirty inodes */
    struct block_device        *s_bdev;
    void         *s_fs_info;        /* Filesystem private info */
};

超级块（super_block）代表了整个文件系统本身，超级块一般对应文件系统自身在磁盘上的超级块结构

s_list链接超级块链表
s_blocksize指定了文件系统块大小
s_maxbytes指定了文件系统中最大文件的尺寸
s_type是指向file_system_type结构的指针，file_system_type是表示文件系统类型的结构体，每种文件系统类型中都有一个链表，指向所有属于该类型的文件系统的超级块。
s_magic是魔术数字，每个文件系统都有一个魔术数字
s_root是指向文件系统根dentry的指针
s_inodes指向文件系统内所有的inode，通过它可以遍历inode对象
s_dirty指向所有dirty的inode对象
s_bdev指向文件系统存在的块设备指针

s_op存储的函数指针

函数	功能
alloc_inode(sb)	为一个inode对象分配空间
destroy_inode(inode)	销毁一个inode对象
read_inode(inode)	从磁盘中读取inode数据，填充作为传入参数的inode对象
write_inode(inode, flag)	使用内存中的inode信息更新磁盘中的inode信息
delete_inode(inode)	删除内存中的inode对象同时删除磁盘上的inode

1.2 dentry

dentry记录着文件名，上级目录等信息，文件系统的树状结构主要就是由dentry链接表示出来的，需要注意的是，dentry并不实际保存具体文件信息

struct dentry {
    ……/省略dentry锁、标志等代码/
    struct inode *d_inode;        /* Where the name belongs to - NULL is negative */
    /*
    * The next three fields are touched by __d_lookup.  Place them here
    * so they all fit in a cache line.
    */
    struct hlist_node d_hash;        /* lookup hash list */
    struct dentry *d_parent;                /* parent directory */
    struct qstr d_name;
    /*
    * d_child and d_rcu can share memory
    */
    union {
        struct list_head d_child;        /* child of parent list */
        struct rcu_head d_rcu;
    } d_u;
    struct list_head d_alias;    /* inode alias list */
    struct list_head d_subdirs;        /* our children */
    struct dentry_operations *d_op;
    struct super_block *d_sb;        /* The root of the dentry tree */
    int d_mounted;
};

d_inode指向一个inode结构，这个inode和dentry共同描述了一个普通文件或者目录文件
d_subdirs子项的链表头
d_child是dentry自身的链表头，需要连接到父dentry的d_subdirs成员
d_parent指向父dentry结构
d_hash是链接到dentry cache的hash链表
d_name保存文件或者目录的名字
d_mounted只来指示dentry是否是一个挂载点，如果是，则不为0
d_alias：一个有效的dentry必然与一个inode关联，但是一个inode可以对应多个dentry，因为一个文件可以被链接到其他文件，所以，这个dentry就是通过这个字段链接到属于自己的inode结构中的i_dentry链表中的。

d_op中存储的函数指针

函数	功能
d_revalidate(dentry, nameidata)	判断当前dentry对象是仍然有效，应该是dentry cache中使用的
d_hash(dentry, name)	计算哈希值，应该也是dentry cache中使用的
d_delete(dentry)	在d_count为0时，删除dentry，默认的VFS函数什么都不做

1.3 inode

inode代表一个文件。inode保存了文件的大小、创建时间、文件的块大小等参数，以及对文件的读写函数、文件读写缓存等信息。

struct inode {
    struct list_head        i_list;
    struct list_head        i_sb_list;
    struct list_head        i_dentry;
    unsigned long        i_ino;
    atomic_t        i_count;
    loff_t        i_size;
    unsigned int        i_blkbits;
    struct inode_operations        *i_op;
    const struct file_operations        *i_fop;        /* former ->i_op->default_file_ops */
    struct address_space        *i_mapping;
    struct block_device        *i_bdev;
    ……/*省略锁等代码*/
};

成员i_list、i_sb_list、i_dentry分别是三个链表头。成员i_list用于链接描述inode当前状态的链表。成员i_sb_list用于链接到超级块中的inode链表。当创建一个新的inode的时候，成员i_list要链接到inode_in_use这个链表，表示inode处于使用状态，同时成员i_sb_list也要链接到文件系统超级块的s_inodes链表头。由于一个文件可以对应多个dentry，这些dentry都要链接到成员i_dentry这个链表头。
成员i_ino是inode的号，而i_count是inode的引用计数。成员i_size是以字节为单位的文件长度。
成员i_blkbits是文件块的位数。
成员i_fop是一个struct file_operations类型的指针。文件的读写函数和异步io函数都在这个结构中提供。每一个具体的文件系统，基本都要提供各自的文件操作函数。
i_mapping是一个重要的成员。这个结构目的是缓存文件的内容，对文件的读写操作首先要在i_mapping包含的缓存里寻找文件的内容。如果有缓存，对文件的读就可以直接从缓存中获得，而不用再去物理硬盘读取，从而大大加速了文件的读操作。写操作也要首先访问缓存，写入到文件的缓存。然后等待合适的机会，再从缓存写入硬盘。后面我们将分析文件的具体读写，在此处只需要理解基本的作用即可。
成员i_bdev是指向块设备的指针。这个块设备就是文件所在的文件系统所绑定的块设备。

i_op中存储的函数指针

函数	功能
create(dir, dentry, mode, nameidata)	创建一个inode
lookup(dir, dentry, nameidata)	在一个目录文件中查找和dentry包含的文件名匹配的inode
link(old_dentry, dir, new_dentry)	创建一个指向new_dentry的硬链接，保存在old_dentry中，该old_dentry和new_dentry指向同一个inode，即同一个文件。
symlink(dir, dentry, symname)	创建一个新的inode，该inode是一个软连接文件，指向参数dentry
mkdir(dir, dentry, mode)	为dentry创建一个目录文件的inode

1.4 file

file数据结构用于存储进程和打开的文件之间交互的信息，这个信息只会存在于内核空间中，在磁盘上没有对应的信息。

struct file {
    union {
		struct llist_node	fu_llist;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
	struct inode		*f_inode;	/* cached value */
	const struct file_operations	*f_op;
    unsigned int              f_flags;/* 文件标志，如O_RDONLY、O_NONBLOCK、O_SYNC */
    fmode_t                 f_mode;/* 文件读/写模式，FMODE_READ、FMODE_WRITE */
    ……/*省略部分代码*/
    loff_t        f_pos;
    struct fown_struct        f_owner;
    unsigned int        f_uid, f_gid;
    struct file_ra_state        f_ra;
    struct address_space        *f_mapping;
};

struct path {
	struct vfsmount *mnt;
	struct dentry *dentry;
}

f_inode指向该file对应的dentry
f_path.dentry指向该file对应的dentry
f_path.mnt指向file所在的文件系统（文件系统挂载的数据结构）
f_uid,f_gid分别表示文件的用户id和用户组id
f_op存储函数指针

函数功能
open(inode, file) 打开文件
llseek(file, offset, origin) 移动文件指针
read(file, buf, count, offset) 读文件
write(file, buf, count, offset) 写文件

函数	功能
open(inode, file)	打开文件
llseek(file, offset, origin)	移动文件指针
read(file, buf, count, offset)	读文件
write(file, buf, count, offset)	写文件

1.5 file_system_type

struct file_system_type {
        const char *name;
        int fs_flags;
        struct super_block *(*get_sb)(struct file_system_type *,
                int, char *, void *, struct vfsmount *);
        void (*kill_sb) (struct super_block *);
        struct module *owner;
        struct file_system_type *next;
        struct list_head fs_supers;
        struct lock_class_key s_lock_key;
        struct lock_class_key s_umount_key;
};

name保存文件系统类型的名称
get_sb在文件系统mount时，内核调用get_sb()来初始化信息并设置超级块
kill_sb在文件系统umount时清除信息
next，在 fs/filesystems.c 文件中有一个全局变量
```
static struct file_system_type *file_systems;
```
该变量是已知file_system_type链表的头， register_filesystem() 会将一个file_system_type加入到链表, unregister_filesystem() 将一个file_system_type从链表中移除，next就是用来链接链表中的file_system_type
fs_supers就是该file_system_type超级块链表的头部

1.5 overview

在这里插入图片描述

2. 文件在磁盘中的存储结构

前面说的VFS虚拟文件系统都处于内存中，接下来介绍文件系统在硬盘这种硬件中是如何组织的

在Linux系统中，硬盘设备对应/dev/sda这样的设备文件，诸如此类的块设备代表整块硬盘，分区是对整块硬盘的进一步划分，在Linux系统中由硬盘的名称加数字表示，如/dev/sda1, /dev/sda2。内核将分区用块设备呈现，如同每个分区都是一整块硬盘，分区信息存放在硬盘上的分区表中。
在这里插入图片描述
分区表有很多种，比较典型的一种叫主引导记录（Master Boot Record，以下简称MBR）。另一种逐渐普及的叫做全局唯一标识符分区表（Globally Unique Identifier Partition Table，以下简称GPT）。MBR最多只能有4个主分区，如果需要更多分区，要将一个分区设置为扩展分区，然后将扩展分区划分为数个逻辑分区。

以Ext2为例，每个Ext2分区文件系统数据结构的第一个部分都是boot block，boot block 是芯片设计厂家在LPC2000系列微控制器内部固化的一段代码，用户无法对其修改和删除。这段代码在芯片复位后首先被运行，其功能主要是判断运行那个存储器上的程序、检查用户代码是否有效、判断芯片是否被加密、芯片的在应用以及在系统编程功能，在系统启动时起到引导作用。Boot Block之后是连续的块组。
在这里插入图片描述

Super Block，每个块组中的Super Block与VFS架构中的超级块（super_block）相对应，包含的是文件系统（不单纯是本块组）的重要信息，比如 inode 总个数、块总个数、每个块组的 inode 个数、每个块组的块个数等等
Group Descriptors(GDT)，包含文件系统中各个块组的状态，比如块组中空闲块和 inode 的数目等，每个块组都包含了文件系统中所有块组的组描述符信息
Block Bitmap中记录着Data Blocks中哪个数据块已经被占用，哪个数据块没有被占用
inode Bitmap判断此位置映射的inode号有没有被其他文件所占用，占用了这个位就会设为1，否者为0，这些位图对应了inode table中的某一个位置
inode Table，包含了块组中所有的 inode，inode 用于保存文件系统中与各个文件和目录相关的所有元数据
Data Blocks，数据块，包含文件的有用数据。

每个块组里有很多重复的信息，比如超级块和块组描述符表，这两个都是全局信息，而且非常的重要，这么做是有两个原因：

如果系统崩溃破坏了超级块或块组描述符，有关文件系统结构和内容的所有信息都会丢失。如果有冗余的副本，该信息是可能恢复的。
通过使文件和管理数据尽可能接近，减少了磁头寻道和旋转，这可以提高文件系统的性能。

3.文件挂载

文件系统在能够正常使用之前都必须经历挂载的过程

挂载是为了什么？Linux系统本身也有一个文件目录树，把新文件系统的dentry树绑定到系统本来的dentry树上并建立链接，这样就可以通过原先的系统树遍历到新的文件系统的dentry树了，这就是mount的过程

执行文件系统的mount指令，要指定一个源文件系统和一个目的文件系统。同时要为目的文件系统指定一个目录，源文件系统就挂载到目的文件系统的这个目录下，这个目录被称为挂载点。一般源文件系统要制定设备名，这个设备就是源文件系统所存在的设备。

先介绍一下vfsmount

struct vfsmount {
	struct dentry *mnt_root;	/* root of the mounted tree */
	struct super_block *mnt_sb;	/* pointer to superblock */
	int mnt_flags;
	struct user_namespace *mnt_userns;
} __randomize_layout;

vfsmount结构描述的是一个独立文件系统的挂载信息，每个不同挂载点对应一个独立的vfsmount结构，属于同一文件系统的所有目录和文件隶属于同一个vfsmount，该vfsmount结构对应于该文件系统顶层目录，即挂载目录

文件系统挂载通过系统调用sys_mount来执行，系统调用sys_mount又会调用do_mount整体过程如下：

do_mount->path_mount->do_new_mount->do_new_mount_fc->…

long do_mount(const char *dev_name, const char __user *dir_name,
		const char *type_page, unsigned long flags, void *data_page)
{
	struct path path;
	int ret;

	ret = user_path_at(AT_FDCWD, dir_name, LOOKUP_FOLLOW, &path);
	if (ret)
		return ret;
	ret = path_mount(dev_name, &path, type_page, flags, data_page);
	path_put(&path);
	return ret;
}
...

具体代码定义于内核fs/namespace.c文件内。

4. 文件打开

在Linux内核最早0.0.1的版本中，进程打开文件的纪律方式非常简单，直接将进程打开的文件放在struct task_struct所定义的定长数组struct file*中。

struct task_struct {
        //...
        struct file * filp[NR_OPEN];
        fd_set close_on_exec;
        // ...
}

这样子虽然简洁，但是限制了每个进程可以打开的文件数不能超过NR_OPEN的设置。

从Linux 1.1.11版本开始，Linux从task_struct中分离出了fs_struct、files_struct、mm_struct，并在1.3.22版本中将其设置为指针形式，这与我们现在看到的task_struct就非常类似了，但是files_struct中fd仍然是定长数组。

// include/linux/sched.h of linux-1.3.21

/* Open file table structure */
struct files_struct {
        int count;
        fd_set close_on_exec;
        struct file * fd[NR_OPEN];
};

struct task_struct {
        // ...

/* filesystem information */
        struct fs_struct *fs;
/* open file information */
        struct files_struct *files;
/* memory management info */
        struct mm_struct *mm;

        // ...
};

2.1.90 把 files_struct 的fd成员从定长数组改成了动态数组，这样每个进程就能同时打开很多文件了，为编写高并发的网络服务扫清了一大障碍。

/*
 * Open file table structure
 */
struct files_struct {
        atomic_t count;
+       int max_fds;
+       struct file ** fd;      /* current fd array */
        fd_set close_on_exec;   // changed to fd_set* in 2.2.12
        fd_set open_fds;
-       struct file * fd[NR_OPEN];
};

2.6.14 引入了 struct fdtable 作为 files_struct 的间接成员，把 fd、max_fds、close_on_exec 等成员fdtable。这么做是为了方便采用 RCU，让 fdtable 可以整体替换。至今依然采用这种设计。

// include/linux/fdtable.h of linux-2.6.37

struct fdtable {
        unsigned int max_fds;
        struct file __rcu **fd;      /* current fd array */
        fd_set *close_on_exec;
        fd_set *open_fds;
        struct rcu_head rcu;
        struct fdtable *next;
};

/*
 * Open file table structure
 */
struct files_struct {
  /*
   * read mostly part
   */
        atomic_t count;
        struct fdtable __rcu *fdt;
        struct fdtable fdtab;
  /*
   * written part on a separate cache line in SMP
   */
        spinlock_t file_lock ____cacheline_aligned_in_smp;
        int next_fd;
        struct embedded_fd_set close_on_exec_init;
        struct embedded_fd_set open_fds_init;
        struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

在这里插入图片描述

打开一个文件，需要使用内核提供的系统调用sys_open，具体调用过程如下

sys_open->do_sys_open->do_filp_open->path_openat->link_path_walk->walk_component->…

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
	if (force_o_largefile())
		flags |= O_LARGEFILE;

	return do_sys_open(AT_FDCWD, filename, flags, mode);
}

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
	struct open_flags op;
    //检查给定的 flags 参数是否有效，并处理不同的 flags 和 mode 条件
	int fd = build_open_flags(flags, mode, &op);
	struct filename *tmp;

	if (fd)
		return fd;
	//将文件路径从用户空间拷贝到内核空间
	tmp = getname(filename);
	if (IS_ERR(tmp))
		return PTR_ERR(tmp);
	//从当前进程的打开文件表（files_struct）中找到一个空闲的表项，该表项的下标即为“打开文件号”
	fd = get_unused_fd_flags(flags);
	if (fd >= 0) {
        //创建进程与文件的链接，或者说创建file结构体代表该读写文件的上下文
		struct file *f = do_filp_open(dfd, tmp, &op);
		if (IS_ERR(f)) {
            //如果发生了错误，释放已分配的 fd 文件描述符
			put_unused_fd(fd);
            //释放已分配的 struct file
			fd = PTR_ERR(f);
		} else {
			fsnotify_open(f);
            //将这个file结构体的指针填入当前进程的打开文件表中
			fd_install(fd, f);
		}
	}
	putname(tmp);
	return fd;
}

//用来保存路径遍历时候当前目录状态
struct nameidata {
    struct path path;
    struct qstr last;
    struct path root;
    struct inode    *inode; /* path.dentry.d_inode */
    unsigned int    flags;
    unsigned    seq;
    int     last_type;
    unsigned    depth;
    char *saved_names[MAX_NESTED_LINKS + 1];
};

struct file *do_filp_open(int dfd, struct filename *pathname,
		const struct open_flags *op)
{
	struct nameidata nd;
	int flags = op->lookup_flags;
	struct file *filp;
	
    //为nd赋值
	set_nameidata(&nd, dfd, pathname);
    //使用 RCU 查找(rcu-walk)方式
	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
	if (unlikely(filp == ERR_PTR(-ECHILD)))
        //使用 REF 查找(ref-walk)方式
		filp = path_openat(&nd, op, flags);
	if (unlikely(filp == ERR_PTR(-ESTALE)))
		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
	restore_nameidata();
	return filp;
}

static struct file *path_openat(struct nameidata *nd,
			const struct open_flags *op, unsigned flags)
{
	struct file *file;
	int error;
	//获取一个空的 file 描述符, 设置 file 描述符的查找标志。
	file = alloc_empty_file(op->open_flag, current_cred());
	if (IS_ERR(file))
		return file;
	//如果是要创建一个临时文件
	if (unlikely(file->f_flags & __O_TMPFILE)) {
		error = do_tmpfile(nd, flags, op, file);
	} else if (unlikely(file->f_flags & O_PATH)) {
		error = do_o_path(nd, flags, file);
	} else {
        // 路径初始化，确定查找的起始目录，初始化结构体 nameidata 的成员 path。
		const char *s = path_init(nd, flags);
        // 调用函数 link_path_walk 解析文件路径的每个分量，最后一个分量除外。
  		// 调用函数 do_last，解析文件路径的最后一个分量，并且打开文件。
		while (!(error = link_path_walk(s, nd)) &&
			(error = do_last(nd, file, op)) > 0) {
            // 如果最后一个分量是符号链接，调用 trailing_symlink 函数进行处理
    		// 读取符号链接文件的数据，新的文件路径是符号链接链接文件的数据，然后继续 while循环，解析新的文件路径。
			nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
			s = trailing_symlink(nd);
		}
		terminate_walk(nd);
	}
	if (likely(!error)) {
		if (likely(file->f_mode & FMODE_OPENED))
			return file;
		WARN_ON(1);
		error = -EINVAL;
	}
	fput(file);
	if (error == -EOPENSTALE) {
		if (flags & LOOKUP_RCU)
			error = -ECHILD;
		else
			error = -ESTALE;
	}
	return ERR_PTR(error);
}

/*
*link_path_walk()完成的工作是逐级解析file路径，直到解析到最后一级路径，最终会将filename保存到 *nameidata的last成员以供do_last()处理最后的file open动作
*/
static int link_path_walk(const char *name, struct nameidata *nd)
{
	int depth = 0; // depth <= nd->depth
	int err;

	nd->last_type = LAST_ROOT;
	nd->flags |= LOOKUP_PARENT;
	if (IS_ERR(name))
		return PTR_ERR(name);
    //如果是根目录，跳过‘/’
	while (*name=='/')
		name++;
    //如果路径只包含‘/’,搜索完成，返回。
	if (!*name) {
		nd->dir_mode = 0; // short-circuit the 'hardening' idiocy
		return 0;
	}

	/* At this point we know we have a real path component. */
	for(;;) {
		struct user_namespace *mnt_userns;
		const char *link;
		u64 hash_len;
		int type;
		
		mnt_userns = mnt_user_ns(nd->path.mnt);
		err = may_lookup(mnt_userns, nd);
		if (err)
			return err;
		
        //计算当前目录的hash_len，这个变量高4 byte是当前目录name字串长度，低4byte是当前目录（路径）的hash值
        //hash值的计算是基于当前目录的父目录dentry（nd->path.dentry）来计算的，所以它跟其目录（路径）dentry是关联的
		hash_len = hash_name(nd->path.dentry, name);

		type = LAST_NORM;
		if (name[0] == '.') switch (hashlen_len(hash_len)) {
			case 2:
                //如果以..开头
				if (name[1] == '.') {
					type = LAST_DOTDOT;
					nd->state |= ND_JUMPED;
				}
				break;
			case 1:
                //如果以.开头
				type = LAST_DOT;
		}
		if (likely(type == LAST_NORM)) {
			struct dentry *parent = nd->path.dentry;
			nd->state &= ~ND_JUMPED;
			if (unlikely(parent->d_flags & DCACHE_OP_HASH)) {
				struct qstr this = { { .hash_len = hash_len }, .name = name };
				err = parent->d_op->d_hash(parent, &this);
				if (err < 0)
					return err;
				hash_len = this.hash_len;
				name = this.name;
			}
		}
		//更新nd->last, 保存上一级目录信息
		nd->last.hash_len = hash_len;
		nd->last.name = name;
		nd->last_type = type;

		name += hashlen_len(hash_len);
		if (!*name)
			goto OK;
		/*
		 * If it wasn't NUL, we know it was '/'. Skip that
		 * slash, and continue until no more slashes.
		 */
		do {
			name++;
		} while (unlikely(*name == '/'));
		if (unlikely(!*name)) {
OK:
			/* pathname or trailing symlink, done */
			if (!depth) {
				nd->dir_uid = i_uid_into_mnt(mnt_userns, nd->inode);
				nd->dir_mode = nd->inode->i_mode;
				nd->flags &= ~LOOKUP_PARENT;
				return 0;
			}
			/* last component of nested symlink */
			name = nd->stack[--depth].name;
			link = walk_component(nd, 0);
		} else {
			/* not the last component */
			link = walk_component(nd, WALK_MORE);
		}
		if (unlikely(link)) {
			if (IS_ERR(link))
				return PTR_ERR(link);
			/* a symlink to follow */
			nd->stack[depth++].name = name;
			name = link;
			continue;
		}
		if (unlikely(!d_can_lookup(nd->path.dentry))) {
			if (nd->flags & LOOKUP_RCU) {
				if (!try_to_unlazy(nd))
					return -ECHILD;
			}
			return -ENOTDIR;
		}
	}
}

更详细代码见fs/namei.c

5. 文件读

读一个文件需要系统调用sys_read，具体过程如下：

sys_read->ksys_read->vfs_read->new_sync_read->call_read_iter->…

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
	return ksys_read(fd, buf, count);
}

ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
{
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos, *ppos = file_ppos(f.file);
		if (ppos) {
			pos = *ppos;
			ppos = &pos;
		}
		ret = vfs_read(f.file, buf, count, ppos);
		if (ret >= 0 && ppos)
			f.file->f_pos = pos;
		fdput_pos(f);
	}
	return ret;
}

ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
{
	ssize_t ret;

	if (!(file->f_mode & FMODE_READ))
		return -EBADF;
	if (!(file->f_mode & FMODE_CAN_READ))
		return -EINVAL;
	if (unlikely(!access_ok(buf, count)))
		return -EFAULT;

	ret = rw_verify_area(READ, file, pos, count);
	if (ret)
		return ret;
	if (count > MAX_RW_COUNT)
		count =  MAX_RW_COUNT;

	if (file->f_op->read)
		ret = file->f_op->read(file, buf, count, pos);
	else if (file->f_op->read_iter)
		ret = new_sync_read(file, buf, count, pos);
	else
		ret = -EINVAL;
	if (ret > 0) {
		fsnotify_access(file);
		add_rchar(current, ret);
	}
	inc_syscr(current);
	return ret;
}

static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
{
	struct iovec iov = { .iov_base = buf, .iov_len = len };
	struct kiocb kiocb;
	struct iov_iter iter;
	ssize_t ret;

	init_sync_kiocb(&kiocb, filp);
	kiocb.ki_pos = (ppos ? *ppos : 0);
	iov_iter_init(&iter, READ, &iov, 1, len);

	ret = call_read_iter(filp, &kiocb, &iter);
	BUG_ON(ret == -EIOCBQUEUED);
	if (ppos)
		*ppos = kiocb.ki_pos;
	return ret;
}

具体代码见fs/read_write.c

6. 文件写

文件写需要调用sys_write，具体过程如下：

sys_write->ksys_write->vfs_write->new_sync_write->call_write_iter->…

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
		size_t, count)
{
	return ksys_write(fd, buf, count);
}

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos, *ppos = file_ppos(f.file);
		if (ppos) {
			pos = *ppos;
			ppos = &pos;
		}
		ret = vfs_write(f.file, buf, count, ppos);
		if (ret >= 0 && ppos)
			f.file->f_pos = pos;
		fdput_pos(f);
	}

	return ret;
}

ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
	ssize_t ret;

	if (!(file->f_mode & FMODE_WRITE))
		return -EBADF;
	if (!(file->f_mode & FMODE_CAN_WRITE))
		return -EINVAL;
	if (unlikely(!access_ok(buf, count)))
		return -EFAULT;

	ret = rw_verify_area(WRITE, file, pos, count);
	if (ret)
		return ret;
	if (count > MAX_RW_COUNT)
		count =  MAX_RW_COUNT;
	file_start_write(file);
	if (file->f_op->write)
		ret = file->f_op->write(file, buf, count, pos);
	else if (file->f_op->write_iter)
		ret = new_sync_write(file, buf, count, pos);
	else
		ret = -EINVAL;
	if (ret > 0) {
		fsnotify_modify(file);
		add_wchar(current, ret);
	}
	inc_syscw(current);
	file_end_write(file);
	return ret;
}

static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos)
{
	struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len };
	struct kiocb kiocb;
	struct iov_iter iter;
	ssize_t ret;

	init_sync_kiocb(&kiocb, filp);
	kiocb.ki_pos = (ppos ? *ppos : 0);
	iov_iter_init(&iter, WRITE, &iov, 1, len);

	ret = call_write_iter(filp, &kiocb, &iter);
	BUG_ON(ret == -EIOCBQUEUED);
	if (ret > 0 && ppos)
		*ppos = kiocb.ki_pos;
	return ret;
}