VFS系统调用的实现

最新推荐文章于 2025-07-25 11:19:55 发布

原创最新推荐文章于 2025-07-25 11:19:55 发布 · 3.8k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#struct #file #数据结构 #null #path #access

疯狂内核之文件系统专栏收录该内容

22 篇文章

订阅专栏

至此，虚拟文件系统中主要的数据结构，以及最重要的path_lookup()函数我们都讲完了，那么就可以来讨论VFS的系统调用是如何实现的了。

当然，一篇博文讨论所有的系统调用是不现实的，所以我们不打算对所有VFS系统调用的实现进行讨论，这里只讲讲“开、读、写、关”四大VFS系统调用，而且是粗略的流程，请重点关注VFS的数据结构怎样互相用。至于其中的细节，由于涉及到页缓存、回收页框、mmap、块设备驱动等，别急，相关博文会来个详细分解的。

我们重新考虑一下在本单元开始所提到的例子，用户发出了一条shell命令：把/floppy/TEST中的MS-DOS文件拷贝到/tmp/test中的Ext2文件中。命令shell调用一个外部程序（如cp），我们假定cp执行下列代码片段：

    inf = open("/floppy/TEST", O_RDONLY, 0);
    outf = open("/tmp/test", O_WRONLY | O_CREAT | O_TRUNC, 0600);
    do {
        len = read(inf, buf, 4096);
        write(outf, buf, len);
    } while (len);
    close(outf);
    close(inf);

实际上，真正的cp程序的代码要更复杂些，因为它还必须检查由每个系统调用返回的可能的出错码。在我们的例子中，我们只把注意力集中在拷贝操作的“正常”行为上。

1 open()系统调用

open()系统调用的服务例程为sys_open()函数，该函数接收的参数为：要打开文件的路径名filename，访问模式的一些标志flags，以及如果该文件被创建所需要的许可权位掩码mode。如果该系统调用成功，就返回一个文件描述符，也就是指向文件对象的指针数组current->files->fd中分配给新文件的索引；否则，返回-1。

在我们的例子中，open()被调用两次；第一次是为读（O_RDONLY标志）而打开/floppy/TEST，第二次是为写（O_WRONLY标志）而打开/tmp/test。如果/tmp/test不存在，则该文件被创建（O_CREAT标志），文件主对该文件具有独占的读写访问权限（在第三个参数中的八进制数0600）。

相反，如果该文件已经存在，则从头开始重写它（O_TRUNC标志）。下面列出了open()系统调用的所有标志：
O_RDONLY：为读而打开
O_WRONLY：为写而打开
O_RDWR：为读和写而打开
O_CREAT：如果文件不存在，就创建它
O_EXCL：对于O_CREAT标志，如果文件已经存在，则失败
O_NOCTTY：从不把文件看作控制终端
O_TRUNC：截断文件（删除所有现有的内容）
O_APPEND：总是在文件末尾写
O_NONBLOCK：没有系统调用在文件上阻塞
O_NDELAY：与O_NONBLOCK相同
O_SYNC：同步写（阻塞，直到物理写终止）
FASYNC：通过信号发出I/O事件通告
O_DIRECT：直接I/O传送（不使用缓存）
O_LARGEFILE：大型文件（长度大于2GB）
O_DIRECTORY：如果文件不是一个目录，则失败
O_NOFOLLOW：不解释路径名中尾部的符号链接
O_NOATIME：不更新索引节点的上次访问时间

下面来描述一下sys_open()函数的操作。它执行如下操作：

1. 调用getname()从进程地址空间读取该文件的路径名，想查看细节请看博文“文件系统安装预备知识”。

2. 调用get_unused_fd()在current->files->fd中查找一个空的位置。相应的索引（新文件描述符）存放在fd局部变量中。
int get_unused_fd(void)
{
     struct files_struct * files = current->files;
     int fd, error;
     struct fdtable *fdt;

error = -EMFILE;
spin_lock(&files->file_lock);

repeat:
     fdt = files_fdtable(files);
      fd = find_next_zero_bit(fdt->open_fds->fds_bits,
                    fdt->max_fdset,
                    files->next_fd);

     /*
      * N.B. For clone tasks sharing a files structure, this test
      * will limit the total number of files that can be opened.
      */
     if (fd >= current->signal->rlim[RLIMIT_NOFILE].rlim_cur)
          goto out;

     /* Do we need to expand the fd array or fd set? */
     error = expand_files(files, fd);
     if (error < 0)
          goto out;

     if (error) {
          /*
            * If we needed to expand the fs array we
           * might have blocked - try again.
           */
          error = -EMFILE;
          goto repeat;
     }

     FD_SET(fd, fdt->open_fds);
     FD_CLR(fd, fdt->close_on_exec);
     files->next_fd = fd + 1;
#if 1
     /* Sanity check */
     if (fdt->fd[fd] != NULL) {
          printk(KERN_WARNING "get_unused_fd: slot %d not NULL!/n", fd);
          fdt->fd[fd] = NULL;
     }
#endif
     error = fd;

out:
spin_unlock(&files->file_lock);
return error;
}

3. 调用do_filp_open()函数，传递给它的参数为路径名、访问模式标志以及许可权位掩码。
static struct file *do_filp_open(int dfd, const char *filename, int flags,
                     int mode)
{
     int namei_flags, error;
     struct nameidata nd;

/* 把访问模式标志拷贝到namei_flags标志中，但是，用特殊的格式对访问模式标志。
* O_RDONLY、O_WRONLY和O_RDWR进行编码：如果文件访问需要读特权，
* 那么只设置namei_flags标志的下标为0的位（最低位）；
* 类似地，如果文件访问需要写特权，就只设置下标为1的位。
* 注意，不可能在open()系统调用中不指定文件访问的读或写特权；
* 不过，这种情况在涉及符号链接的路径名查找中则是有意义的。 */
     namei_flags = flags;
     if ((namei_flags+1) & O_ACCMODE)
          namei_flags++;
/* 调用open_namei()，传递给它的参数为dfd（AT_FDCWD）、路径名、修改的访问模式标志以及局部nameidata数据结构的地址。 */
     error = open_namei(dfd, filename, namei_flags, mode, &nd);
     if (!error)
          return nameidata_to_filp(&nd, flags);

return ERR_PTR(error);
}

这个函数会调用到一个open_namei，这个函数执行以下流程：

如果访问模式标志中没有设置O_CREAT，则不设置LOOKUP_PARENT标志而设置LOOKUP_OPEN标志后开始查找操作。此外，只有O_NOFOLLOW被清零，才设置LOOKUP_FOLLOW标志，而只有设置了O＿DIRECTORY标志，才设置LOOKUP_DIRECTORY标志。

如果在访问模式标志中设置了O_CREAT，则以LOOKUP_PARENT、LOOKUP_OPEN和LOOKUP_CREATE标志的设置开始查找操作。一旦path_lookup()函数成功返回，则检查请求的文件是否已存在。如果不存在，则调用父索引节点的create方法分配一个新的磁盘索引节点。

open_namei()函数也在查找操作确定的文件上执行几个安全检查。例如，该函数检查与已找到的目录项对象关联的索引节点是否存在、它是否是一个普通文件，以及是否允许当前进程根据访问模式标志访问它。如果文件也是为写打开的，则该函数检查文件是否被其他进程加锁。

最后，open_namei调用path_lookup_open或path_lookup_create把filename对应的目录项存放在局部nameidata数据结构nd中，这两个函数都会调用到__path_lookup_intent_open：
static int __path_lookup_intent_open(int dfd, const char *name,
          unsigned int lookup_flags, struct nameidata *nd,
          int open_flags, int create_mode)
{
     /* 分配一个file对象，并初始化一些字段，如权限等 */
     struct file *filp = get_empty_filp();
     int err;

     if (filp == NULL)
          return -ENFILE;
     nd->intent.open.file = filp;
     nd->intent.open.flags = open_flags;
     nd->intent.open.create_mode = create_mode;
     err = do_path_lookup(dfd, name, lookup_flags|LOOKUP_OPEN, nd);
     if (IS_ERR(nd->intent.open.file)) {
          if (err == 0) {
               err = PTR_ERR(nd->intent.open.file);
               path_release(nd);
          }
     } else if (err != 0)
          release_open_intent(nd);
     return err;
}

struct file *get_empty_filp(void)
{
     struct task_struct *tsk;
     static int old_max;
     struct file * f;

     /*
      * Privileged users can go above max_files
      */
     if (get_nr_files() >= files_stat.max_files && !capable(CAP_SYS_ADMIN)) {
          /*
           * percpu_counters are inaccurate. Do an expensive check before
           * we go and fail.
           */
          if (percpu_counter_sum(&nr_files) >= files_stat.max_files)
               goto over;
     }

     f = kmem_cache_alloc(filp_cachep, GFP_KERNEL);
     if (f == NULL)
          goto fail;

     percpu_counter_inc(&nr_files);
     memset(f, 0, sizeof(*f));
     if (security_file_alloc(f))
          goto fail_sec;

     tsk = current;
     INIT_LIST_HEAD(&f->f_u.fu_list);
     atomic_set(&f->f_count, 1);
     rwlock_init(&f->f_owner.lock);
     f->f_uid = tsk->fsuid;
     f->f_gid = tsk->fsgid;
     eventpoll_init_file(f);
     /* f->f_version: 0 */
     return f;

over:
     /* Ran out of filps - report that */
     if (get_nr_files() > old_max) {
          printk(KERN_INFO "VFS: file-max limit %d reached/n",
                         get_max_files());
          old_max = get_nr_files();
     }
     goto fail;

fail_sec:
file_free(f);
fail:
return NULL;
}

在调用open_namei后，do_filp_open调用nameidata_to_filp(&nd, flags)来将nd转换成file对象：
struct file *nameidata_to_filp(struct nameidata *nd, int flags)
{
struct file *filp;

     /* Pick up the filp from the open intent */
     filp = nd->intent.open.file;
     /* Has the filesystem initialised the file for us? */
     if (filp->f_dentry == NULL)
          filp = __dentry_open(nd->dentry, nd->mnt, flags, filp, NULL);
     else
          path_release(nd);
     return filp;
}

注意，文件对象不需要在*nameidata_to_filp分配，因为在open_namei中，执行到__path_lookup_intent_open之后已经分配了，由nd->intent.open.file指向。

nameidata_to_filp会调用__dentry_open()函数，传递给它的参数为访问模式标志、目录项对象的地址以及由查找操作确定的已安装文件系统对象：
static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
                         int flags, struct file *f,
                         int (*open)(struct inode *, struct file *))
{
     struct inode *inode;
     int error;

     f->f_flags = flags;
     f->f_mode = ((flags+1) & O_ACCMODE) | FMODE_LSEEK |
                    FMODE_PREAD | FMODE_PWRITE;
     inode = dentry->d_inode;
     if (f->f_mode & FMODE_WRITE) {
          error = get_write_access(inode);
          if (error)
               goto cleanup_file;
     }

     f->f_mapping = inode->i_mapping;
     f->f_dentry = dentry;
     f->f_vfsmnt = mnt;
     f->f_pos = 0;
     f->f_op = fops_get(inode->i_fop);
     file_move(f, &inode->i_sb->s_files);

     if (!open && f->f_op)
          open = f->f_op->open;
     if (open) {
          error = open(inode, f);
          if (error)
               goto cleanup_all;
     }

f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);

file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);

     /* NB: we're sure to have correct a_ops only after f_op->open */
     if (f->f_flags & O_DIRECT) {
          if (!f->f_mapping->a_ops ||
              ((!f->f_mapping->a_ops->direct_IO) &&
              (!f->f_mapping->a_ops->get_xip_page))) {
               fput(f);
               f = ERR_PTR(-EINVAL);
          }
     }

return f;

cleanup_all:
     fops_put(f->f_op);
     if (f->f_mode & FMODE_WRITE)
          put_write_access(inode);
     file_kill(f);
     f->f_dentry = NULL;
     f->f_vfsmnt = NULL;
cleanup_file:
     put_filp(f);
     dput(dentry);
     mntput(mnt);
     return ERR_PTR(error);
}

该函数依次执行下列操作：
i. 根据传递给open()系统调用的访问模式标志初始化文件对象的f_flags和f_mode字段。
ii. 根据作为参数传递来的目录项对象的地址和已安装文件系统对象的地址初始化文件对象的f_fentry和f_vfsmnt字段。
iii. 重点步骤：把f_op字段设置为相应索引节点对象i_fop字段的内容。这就为进一步的文件操作建立起所有的方法。
iv. 把文件对象插入到文件系统超级块的s_files字段所指向的打开文件的链表。
v. 如果文件操作的open方法被定义，则调用它。
vi. 调用file_ra_state_init()初始化预读的数据结构（参见第十六章）。
vii. 如果O_DIRECT标志被设置，则检查直接I/O操作是否可以作用于文件（参见第十六章）。
viii. 返回文件对象的地址。

最后*do_filp_open返回文体对象的地址。

4. 回到do_sys_open，把current->files->fd[fd]置为由dentry_open()返回的文件对象的地址：
fsnotify_open(f->f_dentry);
fd_install(fd, f);

void fastcall fd_install(unsigned int fd, struct file * file)
{
     struct files_struct *files = current->files;
     struct fdtable *fdt;
     spin_lock(&files->file_lock);
     fdt = files_fdtable(files);
     BUG_ON(fdt->fd[fd] != NULL);
     rcu_assign_pointer(fdt->fd[fd], file);
     spin_unlock(&files->file_lock);
}

5. 返回fd。

2 read()和write()系统调用

让我们再回到cp例子的代码。open(）系统调用返回两个文件描述符，分别存放在inf和outf变量中。然后，程序开始循环。在每次循环中，/floppy/TEST文件的一部分被拷贝到本地缓冲区（read ()系统调用）中，然后，这个本地缓冲区中的数据又被拷贝到/tmp/test文件（write()系统调用）。

read()和write()系统调用非常相似。它们都需要三个参数：一个文件描述符fd，个内存区的地址buf（该缓冲区包含要传送的数据），以及一个数count（指定应该传送多少字节）。当然，read()把数据从文件传送到缓冲区，而write()执行相反的操作。两个系统调用都返回所成功传送的字节数，或者发送一个错误条件的信号并返回-1。

返回值小于count并不意味着发生了错误。即使请求的字节没有都被传送，也总是允许内核终止系统调用，因此用户应用程序必须检查返回值并重新发出系统调用（如果必要）。

一般会有这几种典型情况下返回小于count的值：当从管道或终端设备读取时，当读到文件的末尾时，或者当系统调用被信号中断时。文件结束条件（EOF）很容易从read()的空返回值中判断出来。这个条件不会与因信号引起的异常终止混淆在一起，因为如果读取数据之前read()被一个信号中断，则发生一个错误。

读或写操作总是发生在由当前文件指针所指定的文件偏移处（文件对象的f_pos字段）。两个系统调用都通过把所传送的字节数加到文件指针上而更新文件指针。

简而言之，sys_read()（read()的服务例程）和sys_write()（write()）的服务例程）几乎都执行相同的步骤：

1. 调用fget_light()从fd获取当前进程相应文件对象的地址file：
struct file fastcall *fget_light(unsigned int fd, int *fput_needed)
{
     struct file *file;
     struct files_struct *files = current->files;
……
     file = fcheck_files(files, fd);
……
     return file;
}
static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd)
{
     struct file * file = NULL;
     struct fdtable *fdt = files_fdtable(files);

     if (fd < fdt->max_fds)
          file = rcu_dereference(fdt->fd[fd]);
     return file;
}

2. 如果file->f_mode中的标志不允许所请求的访问（读或写操作），则返回一个错误码-EBADF。

3. 如果文件对象没有read()或aio_read()（write()或aio_write()）文件操作，则返回一个错误码-EINVAL。

4. 调用access_ok()粗略地检查buf和count参数（参见博文“文件系统安装预备知识”）。

5. 调用rw_verify_area()对要访问的文件部分检查是否有冲突的强制锁。如果有，则返回一个错误码，如果该锁已经被F_SETLKW命令请求，那么就挂起当前进程。

6. 调用file->f_op->read或file->f_op->write方法（如果已定义）来传送数据；否则，调用file->f_op->aio_read或file->f_op->aio_write方法。所有这些方法都返回实际传送的字节数。另一方面的作用是，文件指针被适当地更新。

7. 调用fput_light()释放文件对象。

8. 返回实际传送的字节数。