xv6 risc-v file system 笔记

最新推荐文章于 2025-09-12 10:54:37 发布

原创最新推荐文章于 2025-09-12 10:54:37 发布 · 1.4k 阅读

4 ·

CC 4.0 BY-SA版权

6.s081 专栏收录该内容

17 篇文章

订阅专栏

本文深入探讨了xv6 RISC-V文件系统的inode，包括磁盘和内存中的inode结构，inode的锁机制，以及inode的获取、分配和释放。详细介绍了icache的作用，强调其在同步多进程访问inode中的角色，同时也讨论了文件描述符层的工作原理。

disk一层，没有涉及到

buffer cache 和 logging 已经分别写了博客

所以本文从inode讲起：

inode

首先看文件系统的结构：

inode有两层含义：磁盘上的inode和内存中的inode，磁盘中的inode数据结构如下：一系列dinode就保存在上图中inodes区里

dinode的大小都相同，因此给出数字n（实际上就是inum），很容易定位到第n个inode

// On-disk inode structure
struct dinode {
  short type;           // File type
  short major;          // Major device number (T_DEVICE only)
  short minor;          // Minor device number (T_DEVICE only)
  short nlink;          // Number of links to inode in file system
  uint size;            // Size of file (bytes)
  uint addrs[NDIRECT+1];   // Data block addresses
                        
};

前面几个属性的意图，在注释里都有很好的说明。nlink的link，可以理解为有多少个硬链接（包括文件本身）指向该inode

至于addrs，则是保存了inode内容的数据块的地址

内核将活跃或者说有指针指向它的inode保存在内存里，他相对磁盘上的inode，增加了一些额外属性：

// in-memory copy of an inode
struct inode {
  uint dev;           // Device number
  uint inum;          // Inode number
  int ref;            // Reference count
  struct sleeplock lock; // protects everything below here
  int valid;          // inode has been read from disk?

  short type;         // copy of disk inode
  short major;
  short minor;
  short nlink;
  uint size;
  uint addrs[NDIRECT+1];
};

dev和inum指明当前inode对应磁盘上哪一个inode（dinode）

ref表明有多少个指针指向本inode

因为对inode的操作可能用时较长，所以这里用的睡眠锁

valid则指示当前inode是否包含对应dinode的数据（即dinode的成员变量）

在inode的代码中，有四个锁/类似锁（主要指计数）的机制，来保护下面的invariant：

icache.lock:保证 1.磁盘上的一个inode最多对应缓存中的一个inode，2. 内存中inode的ref字段记录指向这个inode的指针数
inode.lock保证了对inode的独占访问
如果inode的ref大于0，就不会被evict，或者说从cache里被回收利用
inode的nlink如果大于0，就不会被释放

update：2020年11月9日

fs.c中inode部分代码的开头注释有一段是对icache.lock和ip->lock的描述，之前没有注意到：

从ip->ref（指示对这个inode的ref的数量，包括open files和cwd）能判断一个inode buf应不应该在icache里

ip-dev,ip->num表明icache entry缓存的是哪一个inode

综上两点，当使用上述三个field时，需要持有icache.lock

而ip->lock则保护除上述三个属性之外的field

// Find the inode with number inum on device dev
// and return the in-memory copy. Does not lock
// the inode and does not read it from disk.
static struct inode*
iget(uint dev, uint inum)
{
  struct inode *ip, *empty;

  acquire(&icache.lock);

  // Is the inode already cached?
  empty = 0;
  for(ip = &icache.inode[0]; ip < &icache.inode[NINODE]; ip++){
    if(ip->ref > 0 && ip->dev == dev && ip->inum == inum){
      ip->ref++;
      release(&icache.lock);
      return ip;
    }
    if(empty == 0 && ip->ref == 0)    // Remember empty slot.
      empty = ip;
  }

  // Recycle an inode cache entry.
  if(empty == 0)
    panic("iget: no inodes");

  ip = empty;
  ip->dev = dev;
  ip->inum = inum;
  ip->ref = 1;
  ip->valid = 0;
  release(&icache.lock);

  return ip;
}

iget函数是给定dev（即设备或者说磁盘号）和inum，在icache中查找inode，如果已经缓存了，那么ref++，直接返回指向这个inode的指针

如果没有缓存，就看看icache里有没有ref为0的，相当于找到一个空槽，准备回收利用，将这个cache entry用来保存我们要查找的这个inode，当然此时这个cache entry，或者是内存中的inode，还没有对应（dev和inum对应）磁盘上inode的数据（dinode的），valid为0

要使用之前，必须先ilock，一则是获取inode的锁，保证独占使用，二则是从磁盘上读取对应dinode的内容到内存

这里有一个点值得注意：iget和锁(ilock)在设计上是分离的，iget返回的不是带锁的inode，这一点有其意义，在dirlookup中，查找dirent对应inode时使用了iget，要注意到一个文件夹的dirent包括.,..，即本文件夹和上级文件夹，如果iget要获取锁，那么可能会死锁

TODO 这里可能需要补充

icache的作用主要不是缓存，因为如果需要的话，buffer cache会缓存他

（这里又要强调buffer cache作为下层的作用了，inode本身也是磁盘块上的内容，buffer cache会对所有类型的磁盘内容（块）进行缓存）

icache的作用主要是同步多个进程对inode的access

update：2020年11月10日

原来这里写过icache的作用，但是可能因为这里只是照抄书上的解释，所以我其实并没有深刻的理解：

首先，icache为什么主要起的不是缓存的作用？

很简单，因为这里的缓存是指将磁盘上的dinode读取到内存中，可是我们对磁盘的读取从来都是一块块的，即通过buffer cache层，用bread读，即我们每次都会将包含我们要找的dinode的那一个块缓存到内存中

其次，什么叫同步对inode的访问？

首先，xv6是一个并发，多核的环境，因此，可能存在多个进程同时访问一个inode，而这些访问并不是原子的，所以我们需要通过锁，使得对inode的访问是独占的。buffer cache一层，也同样有这个作用

既然inode在块上，而buffer cache可以同步对块的访问，那为什么还需要inode这一层来同步对inode的访问？

是为了更细粒度的控制？因为一个块上可能有很多inode？TODO

此外，icache的设计是write through，即修改过inode后，要马上使用iupdate将其写回到磁盘（这里的写回当然也要先经过日志）

// Allocate an inode on device dev.
// Mark it as allocated by  giving it type type.
// Returns an unlocked but allocated and referenced inode.
struct inode*
ialloc(uint dev,short type)
{
  int inum;
  struct buf *bp;
  struct dinode *dip;

  for(inum = 1; inum < sb.ninodes; inum++){
    bp = bread(dev, IBLOCK(inum, sb));
    dip = (struct dinode*)bp->data + inum%IPB;
    if(dip->type == 0){  // a free inode
      memset(dip, 0, sizeof(*dip));
      dip->type = type;
      log_write(bp);   // mark it allocated on the disk
      brelse(bp);
      return iget(dev, inum);
    }
    brelse(bp);
  }
  panic("ialloc: no inodes");
}

update 2020年11月9日：

补充一下对扫描磁盘上inode的解释：

我们用bread从磁盘上读，永远是读一个block，而不是单单一个inode，所以我们首先通过IBLOCK，算出当前inode所在的block num ，然后读出该块

已知这个块在磁盘的布局上属于inode，bp->data存放的是inodes，强转型为dinode*,通过inum%IPB(Inodes per block），可以得到inum对应的inode在该block上是第几个inode，然后加上inum%IPB，此时指针就指向了我们要找的inode

要分配一个inode，这里是扫描磁盘上的inode区，如果对应dinode.type为0，那么说明这个inode是空闲的，可以使用，首先要修改type，表明该inode不再空闲，并将这种变化回写到磁盘。然后我们调用iget，返回刚找到的inode（dinode）对应的内存中的inode，iget一般会回收一个icache item，然后将该inode保存刚才申请的inode（dinode）的信息

在ialloc，因为用bread得到的包含dinode的buffer是上锁的，所以上述操作不会出错

// Drop a reference to an in-memory inode.
// If that was the last reference, the inode cache entry can
// be recycled.
// If that was the last reference and the inode has no links
// to it, free the inode (and its content) on disk.
// All calls to iput() must be inside a transaction in
// case it has to free the inode.
void
iput(struct inode *ip)
{
  acquire(&icache.lock);

  if(ip->ref == 1 && ip->valid && ip->nlink == 0){
    // inode has no links and no other references: truncate and free.

    // ip->ref == 1 means no other process can have ip locked,
    // so this acquiresleep() won't block (or deadlock).
    acquiresleep(&ip->lock);

    release(&icache.lock);

    itrunc(ip);
    ip->type = 0;
    iupdate(ip);
    ip->valid = 0;

    releasesleep(&ip->lock);

    acquire(&icache.lock);
  }

  ip->ref--;
  release(&icache.lock);
}

iput用来减少一个inode的ref（即指针指向数），如果ref为1，如果ref为1（本次iput之后变为0）并且nlink为0，那么本次减少ref之后，就应该释放该inode（主要是将对应的bitmap置零，inode的addrs置零，inode的type置零，并写回到磁盘）

如果仅仅是ref为0，那么对应的icache item可以被回收利用了

此外，因为iput可能要释放inode（操作磁盘），所以对iput的调用应该放在事务里（beginop~endop)

iput中对锁的使用：

你可能会这样想，iput在获取inode的锁之后，释放了inode，但是可能另一个进程，正在等待inode的锁，而没有想到该inode已经被释放了，所以会出错

实际上，这是不会发生的，因为我们已经检查过了，ref为1（就是执行iput的这个进程），所以不存在另一个进程持有对该inode的指针

此外，在iput进行时，如果崩溃，可能导致：nlink为0，ref为1，但是发生崩溃，重启后，指向该inode的指针当然丢失了，然后此时该dinode仍然是分配状态（type不为0），但nlink却为0，也就是这块内存处于浪费的状态：nlink为0，所以没有人能用它，处于分配状态，所以他也不会被回收利用

解决这个问题有两种办法：

在崩溃后重启时，扫描所有的inode，将所有inode中nlink为0，但是type不为0的，重新释放
不用扫描所有inode，但是在磁盘上记录那些nlink为0，但是ref不为0的（就是他们，会在崩溃后，处于上面说的浪费状态），那么崩溃重启后，我们只需要将这个列表里的重新释放

（xv6这两种策略都没有实现，因此，xv6可能面临磁盘inode区空间耗尽的风险）

inode的内容

inode.addrs是本节的主要内容，也是实验中要支持big file需要修改的部分

inode的内容由这些块保存：12个直接块，然后是256个间接块，bmap是给定inode和inode内容的块号（内容offset/块大小），返回对应块的地址（块号）

有了bmap，readi和writei就能很轻松的读取inode内容offset对应的磁盘块

（总记得在lab fs中详细介绍过bmap 但是发现没有，不过这个也不是很复杂，就贴一下代码吧）

// The content (data) associated with each inode is stored
// in blocks on the disk. The first NDIRECT block numbers
// are listed in ip->addrs[].  The next NINDIRECT blocks are
// listed in block ip->addrs[NDIRECT].

// Return the disk block address of the nth block in inode ip.
// If there is no such block, bmap allocates one.
static uint
bmap(struct inode *ip, uint bn)
{
  uint addr, *a;
  struct buf *bp;

  if(bn < NDIRECT){
    if((addr = ip->addrs[bn]) == 0)
      ip->addrs[bn] = addr = balloc(ip->dev);
    return addr;
  }
  bn -= NDIRECT;

  if(bn < NINDIRECT){
    // Load indirect block, allocating if necessary.
    if((addr = ip->addrs[NDIRECT]) == 0)
      ip->addrs[NDIRECT] = addr = balloc(ip->dev);
    // addr是间接块
    bp = bread(ip->dev, addr);
    a = (uint*)bp->data;
    if((addr = a[bn]) == 0){
      a[bn] = addr = balloc(ip->dev);
      log_write(bp);
    }
    brelse(bp);
    return addr;
  }else{
    bn-=NINDIRECT;
    // 首先看二级表是否存在，如果不存在，就分配一个
    uint dublyIndirect;
    if((dublyIndirect = ip->addrs[NDIRECT+1]) == 0){
      // 注意这里修改的是ip(inode)->addrs，修改的不是buffer，所以不需要log_write
      dublyIndirect = ip->addrs[NDIRECT+1] = balloc(ip->dev);
    }
    // 把这个二级表读出来
    bp=bread(ip->dev,dublyIndirect);
    a=(uint*)bp->data;
    // 算出bn对应一级表的index，然后再取一级表
    uint singlyIndirect=bn/NINDIRECT;
    if((addr=a[singlyIndirect])==0){
      addr=a[singlyIndirect]=balloc(ip->dev);
      log_write(bp);
    }
    brelse(bp);
    struct buf *p = bread(ip->dev,a[singlyIndirect]);
    int i=bn%NINDIRECT;
    a=(uint*)p->data;
    if((addr=a[i])==0){
      addr=a[i]=balloc(p->dev);
      log_write(p);
    }
    brelse(p);
    return addr;
  }

  panic("bmap: out of range");
}

目录层

目录是inode的一种，类型为T_DIR，他的内容是一系列的dirent，xv6使用readi来读出一个个dirent

dirlookup，就是查找给定目录的一个dirent，根据name匹配

dirlink则是在给定目录下创建一个dirent，首先找一个空闲的dirent，如果没有，就在最后写一个dirent

路径名层

这里的核心函数是namex：根据给出的路径，一层层的获取当前层目录的inode，上锁，注意，如果没有跳出循环，说明还没到最后一级

所以此时的inode必须是目录，如果nameiparent为1，即我们现在要拿的文件上一级目录的inode，并且现在已经到了倒数第二级，那么直接返回目标文件所在目录的inode，然后打开下一级，直到拿到目标文件的inode

// Look up and return the inode for a path name.
// If parent != 0, return the inode for the parent and copy the final
// path element into name, which must have room for DIRSIZ bytes.
// Must be called inside a transaction since it calls iput().
static struct inode*
namex(char *path, int nameiparent, char *name)
{
  struct inode *ip, *next;
  // 根据绝对/相对路径 设置ip
  if(*path == '/')
    ip = iget(ROOTDEV, ROOTINO);
  else
    ip = idup(myproc()->cwd);
  // skipelem的基本作用是，将一个按层次给出的路径字符串，如 /a/b/c
  // 将下一个元素复制到name，如此时的下一个元素就是a，然后让path指向下下一个元素，path变成b/c
  while((path = skipelem(path, name)) != 0){
    ilock(ip);
    if(ip->type != T_DIR){
      iunlockput(ip);
      return 0;
    }
    // path此时指向的是name的下一个元素，如果nameiparent为1，那么直接返回
    if(nameiparent && *path == '\0'){
      // Stop one level early.
      iunlock(ip);
      return ip;
    }
    //parent为0，获取name（实际上是当前文件夹里一个dirent的名称）对应的inode
    if((next = dirlookup(ip, name, 0)) == 0){
      iunlockput(ip);
      return 0;
    }
    iunlockput(ip);
    // 进入到下一级
    ip = next;
  }
  // TODO 什么情况下会运行到这里？
  if(nameiparent){
    iput(ip);
    return 0;
  }
  return ip;
}

namex设计成每次只持有当前inode的锁，因为dirlookup可能要长时间操作磁盘，这样可以提高并行性

同时，因为持有了当前inode的锁，所以不会有其他进程修改当前目录的结构（dirent）

此外，查找 . 仍然是一个挑战：因为.和当前的inode相同，而已经持有锁，再获取，会死锁，这里的解决方案是，在进入下一级之前，先释放当前inode的锁，此外，虽然我们拿到了下一级的inode，但是因为iget和ilock是分离的，所以此时我们并没有试图持有下一级的锁，所以不会死锁

文件描述符层

这就是文件描述符在xv6中的结构

struct file {
  enum { FD_NONE, FD_PIPE, FD_INODE, FD_DEVICE } type;
  int ref; // reference count
  char readable;
  char writable;
  struct pipe *pipe; // FD_PIPE
  struct inode *ip;  // FD_INODE and FD_DEVICE
  uint off;          // FD_INODE and FD_DEVICE
  short major;       // FD_DEVICE
  short minor;       // FD_DEVICE
};

每个进程都有自己的打开文件表（file数组），file是一个inode或者pipe的wrapper，加上一个offset，每次open都会分配一个file：多个进程打开同一个文件，因此会有不同的offset

文件描述符是每个进程一套的，这一点从每个进程的0，1，2都是stdin,stdout,stderr就可以知道，另外也可以写个程序验证一下：都打开一个文件，然后打印文件描述符，一般两个都会是3

// Per-process state
struct proc {
....
  struct file *ofile[NOFILE];  // Open files
.....
};

// Allocate a file descriptor for the given file.
// Takes over file reference from caller on success.
static int
fdalloc(struct file *f)
{
  int fd;
  struct proc *p = myproc();

  for(fd = 0; fd < NOFILE; fd++){
    // 从当前进程ofile中，找一个没有打开的fd，分配出去
    // 将该fd与作为参数的文件关联起来
    if(p->ofile[fd] == 0){
      p->ofile[fd] = f;
      return fd;
    }
  }
  return -1;
}

这就是上述的打开文件表，从这里，我们也能看出fd和对应文件的关系：在打开文件表中，fd是文件的index

uint64
sys_dup(void)
{
  struct file *f;
  int fd;
  // 首先取得old fd关联的文件
  if(argfd(0, 0, &f) < 0)
    return -1;
  // 为该文件再分配一个与之关联的fd：p->ofile[x],p->ofile[y]都为f
  if((fd=fdalloc(f)) < 0)
    return -1;
  // 增加f的ref
  filedup(f);
  return fd;
}

// Create a new process, copying the parent.
// Sets up child kernel stack to return as if from fork() system call.
int
fork(void)
{
 ....

  // increment reference counts on open file descriptors.
  for(i = 0; i < NOFILE; i++)
    if(p->ofile[i])
      np->ofile[i] = filedup(p->ofile[i]);
  np->cwd = idup(p->cwd);
....
}

从dup和fork的实现可知，经过fork和dup得到的新文件描述符，和老的，是共享offset的