Chapter 12. The Virtual Filesystem
five standard Unix file:
1:regular files,2.directories,3.symbolic links,4.Device files,5. pipes
12.1. The Role of the Virtual Filesystem (VFS)
Filesystems supported by the VFS may be grouped into three main classes:
1:Disk-based filesystems
2:Network filesystems
3:Special filesystems
12.1.1. The Common File Model
Figure 12-2. Interaction between processes and VFS objects
The superblock object
The inode object
The file object
12.1.2. System Calls Handled by the VFS
Table 12-1. Some system calls handled by the VFS
System call name Description
mount( ) umount( ) umount2( ) Mount/unmount filesystems
sysfs( ) Get filesystem information
statfs( ) fstatfs( ) statfs64( ) fstatfs64( )ustat( ) Get filesystem statistics
chroot( ) pivot_root( ) Change root directory
chdir( ) fchdir( ) getcwd( ) Manipulate current directory
mkdir( ) rmdir( ) Create and destroy directories
getdents( ) getdents64( ) readdir( ) link( ) Manipulate directory entries
unlink( ) rename( ) lookup_dcookie( )
readlink( ) symlink( ) Manipulate soft links
chown( ) fchown( ) lchown( ) chown16( )
fchown16( ) lchown16( ) Modify file owner
chmod( ) fchmod( ) utime( ) Modify file attributes
stat( ) fstat( ) lstat( ) access( ) oldstat( ) oldfstat()
oldlstat( ) stat64( ) lstat64( ) fstat64( ) Read file status
open( ) close( ) creat( ) umask( ) Open, close, and create files
dup( ) dup2( ) fcntl( ) fcntl64( ) Manipulate file descriptors
select( ) poll( ) Wait for events on a set of file descriptors
truncate( ) ftruncate( ) truncate64( ) ftruncate64( ) Change file size
lseek( ) _llseek( ) Change file pointer
read( ) write( ) readv( ) writev( ) sendfile( ) Carry out file I/O operations
sendfile64( ) readahead( )
io_setup( ) io_submit( ) io_getevents( ) Asynchronous I/O (allows multiple outstanding read and write requests)
io_cancel( ) io_destroy( )
pread64( ) pwrite64( ) Seek file and access it
mmap( ) mmap2( ) munmap( ) madvise( ) mincore( ) Handle file memory mapping
remap_file_pages( )
fdatasync( ) fsync( ) sync( ) msync( ) Synchronize file data
flock( ) Manipulate file lock
setxattr( ) lsetxattr( ) fsetxattr( ) getxattr( ) lgetxattr( )
fgetxattr( ) listxattr( ) llistxattr( ) flistxattr( ) removexattr( ) Manipulate file extended attributes
lremovexattr( ) fremovexattr( )
12.2. VFS Data Structures
12.2.1. Superblock Objects
Table 12-2. The fields of the superblock object
Type Field Description
struct list_head s_list Pointers for superblock list
dev_t s_dev Device identifier
unsigned long s_blocksize Block size in bytes
unsigned long s_old_blocksize Block size in bytes as reported by the underlying block device driver
unsigned char s_blocksize_bits Block size in number of bits
unsigned char s_dirt Modified (dirty) flag
unsigned long long s_maxbytes Maximum size of the files
struct file_system_type * s_type Filesystem type
struct super_operations * s_op Superblock methods
struct dquot_operations * dq_op Disk quota handling methods
struct quotactl_ops * s_qcop Disk quota administration methods
struct export_operations * s_export_op Export operations used by network filesystems
unsigned long s_flags Mount flags
unsigned long s_magic Filesystem magic number
struct dentry * s_root Dentry object of the filesystem's root directory
struct rw_semaphore s_umount Semaphore used for unmounting
struct semaphore s_lock Superblock semaphore
int s_count Reference counter
int s_syncing Flag indicating that inodes of the superblock are being synchronized
int s_need_sync_fs Flag used when synchronizing the superblock's mounted filesystem
atomic_t s_active Secondary reference coun
void * s_security Pointer to superblock security structure
struct xattr_handler ** s_xattr Pointer to superblock extended attribute structure
struct list_head s_inodes List of all inodes
struct list_head s_dirty List of modified inodes
struct list_head s_io List of inodes waiting to be written to disk
struct hlist_head s_anon List of anonymous dentries for handling remote network filesystems
struct list_head s_files List of file objects
struct block_device* s_bdev Pointer to the block device driver descriptor
struct list_head s_instances Pointers for a list of superblock objects of a given filesystem type
(see the later section "Filesystem Type Registration")
struct quota_info s_dquot Descriptor for disk quota
int s_frozen Flag used when freezing the filesystem (forcing it to aconsistent state)
wait_queue_head_t s_wait_unfrozen Wait queue where processes sleep until the filesystem is unfrozen
char[] s_id Name of the block device containing the superblock
void * s_fs_info Pointer to superblock information of a specific filesystem
struct semaphore s_vfs_rename_sem Semaphore used by VFS when renaming files across directories
u32 s_time_gran Timestamp's granularity (in nanoseconds
super_operations ->
alloc_inode(sb)
Allocates space for an inode object, including the space required for filesystem-specific data.
destroy_inode(inode)
Destroys an inode object, including the filesystem-specific data
read_inode(inode)
Fills the fields of the inode object passed as the parameter with the data on disk; the i_ino
field of the inode object identifies the specific filesystem inode on the disk to be read.
dirty_inode(inode)
Invoked when the inode is marked as modified (dirty). Used by filesystems such as ReiserFS
and Ext3 to update the filesystem journal on disk.
write_inode(inode, flag)
Updates a filesystem inode with the contents of the inode object passed as the parameter; the
i_ino field of the inode object identifies the filesystem inode on disk that is concerned. The
flag parameter indicates whether the I/O operation should be synchronous.
put_inode(inode)
Invoked when the inode is released its reference counter is decreased to perform filesystemspecific operations.
drop_inode(inode)
Invoked when the inode is about to be destroyed that is, when the last user releases the inode;
filesystems that implement this method usually make use of generic_drop_inode( ). This
function removes every reference to the inode from the VFS data structures and, if the inode
no longer appears in any directory, invokes the delete_inode superblock method to delete the
inode from the filesystem.
delete_inode(inode)
Invoked when the inode must be destroyed. Deletes the VFS inode in memory and the file data
and metadata on disk.
put_super(super)
Releases the superblock object passed as the parameter (because the corresponding
filesystem is unmounted).
write_super(super)
Updates a filesystem superblock with the contents of the object indicated.
sync_fs(sb, wait)
Invoked when flushing the filesystem to update filesystem-specific data structures on disk
(used by journaling filesystems ).
write_super_lockfs(super)
Blocks changes to the filesystem and updates the superblock with the contents of the object
indicated. This method is invoked when the filesystem is frozen, for instance by the Logical
Volume Manager (LVM) driver.
unlockfs(super)
Undoes the block of filesystem updates achieved by the write_super_lockfs superblock
method.
statfs(super, buf)
Returns statistics on a filesystem by filling the buf buffer.
remount_fs(super, flags, data)
Remounts the filesystem with new options (invoked when a mount option must be changed).
clear_inode(inode)
Invoked when a disk inode is being destroyed to perform filesystem-specific operations.
umount_begin(super)
Aborts a mount operation because the corresponding unmount operation has been started
(used only by network filesystems ).
show_options(seq_file, vfsmount)
Used to display the filesystem-specific options
quota_read(super, type, data, size, offset)
Used by the quota system to read data from the file that specifies the limits for this filesystem.[*]
quota_write(super, type, data, size, offset)
Used by the quota system to write data into the file that specifies the limits for this filesystem.
12.2.2. Inode Objects
Table 12-3. The fields of the inode object
Type Field Description
struct hlist_node i_hash Pointers for the hash list
struct list_head i_list Pointers for the list that describes the inode's current state
struct list_head i_sb_list Pointers for the list of inodes of the superblock
struct list_head i_dentry The head of the list of dentry objects referencing this inode
unsigned long i_ino inode number
atomic_t i_count Usage counter
umode_t i_mode File type and access rights
unsigned int i_nlink Number of hard links
uid_t i_uid Owner identifier
gid_t i_gid Group identifier
dev_t i_rdev Real device identifier
loff_t i_size File length in bytes
struct timespec i_atime Time of last file access
struct timespec i_mtime Time of last file write
struct timespec i_ctime Time of last inode change
unsigned int i_blkbits Block size in number of bits
unsigned long i_blksize Block size in bytes
unsigned long i_version Version number, automatically increased after each use
unsigned long i_blocks Number of blocks of the file
unsigned short i_bytes Number of bytes in the last block of the file
unsigned char i_sock Nonzero if file is a socket
spinlock_t i_lock Spin lock protecting some fields of the inode
struct semaphore i_sem inode semaphore
struct rw_semaphore i_alloc_sem Read/write semaphore protecting against race conditions in direct I/O file operations
struct inode_operations * i_op inode operations
struct file_operations * i_fop Default file operations
struct super_block * i_sb Pointer to superblock object
struct file_lock * i_flock Pointer to file lock list
struct address_space* i_mapping Pointer to an address_space object (see Chapter 15)
struct address_space i_data address_space object of the file
struct dquot * [] i_dquot inode disk quotas
struct list_head i_devices Pointers for a list of inodes relative to a specific character or block device (see Chapter 13)
struct pipe_inode_info * i_pipe Used if the file is a pipe (see Chapter 19)
struct block_device * i_bdev Pointer to the block device driver
struct cdev * i_cdev Pointer to the character device driver int i_cindex Index of the device
file within a group of minor numbers
_ _u32 i_generation inode version number (used by some filesystems)
unsigned long i_dnotify_mask Bit mask of directory notify events
struct dnotify_struct * i_dnotify Used for directory notifications
unsigned long i_state inode state flags
unsigned long dirtied_when Dirtying time (in ticks) of the inode
unsigned int i_flags Filesystem mount flags
atomic_t i_writecount Usage counter for writing processes
void * i_security Pointer to inode's security structure
void * u.generic_ip Pointer to private data
seqcount_t i_size_seqcount Sequence counter used in SMP systems to get consistent values for i_size
2:The methods associated with an inode object are also called inode operations
12.2.3. File Objects
A file object describes how a process interacts with a file it has opened,
The object is created when the file is opened and consists of a file structure
Table 12-4. The fields of the file object
Type Field Description
struct list_head f_list Pointers for generic file object list
struct dentry * f_dentry dentry object associated with the file
struct vfsmount * f_vfsmnt Mounted filesystem containing the file
file_operations * f_op Pointer to file operation table
atomic_t f_count File object's reference counter
unsigned int f_flags Flags specified when opening the file
mode_t f_mode Process access mode
int f_error Error code for network write operation
loff_t f_pos Current file offset (file pointer)
struct fown_struct f_owner Data for I/O event notification via signals
unsigned int f_uid User's UID
unsigned int f_gid User group ID
struct file_ra_state f_ra File read-ahead state (see Chapter 16)
size_t f_maxcount Maximum number of bytes that can be read or written with a single operation (currently set to 231-1)
unsigned long f_version Version number, automatically increased after each use
void * f_security Pointer to file object's security structure
void * private_data Pointer to data specific for a filesystem or a device driver
struct list_head f_ep_links Head of the list of event poll waiters for this file
spinlock_t f_ep_lock Spin lock protecting the f_ep_links list
struct address_space* f_mapping Pointer to file's address space object (see Chapter 15)
file operations:
llseek(file, offset, origin)
Updates the file pointer.
read(file, buf, count, offset)
Reads count bytes from a file starting at position *offset; the value *offset (which usually
corresponds to the file pointer) is then increased.
aio_read(req, buf, len, pos)
Starts an asynchronous I/O operation to read len bytes into buf from file position pos
(introduced to support the io_submit( ) system call).
write(file, buf, count, offset)
Writes count bytes into a file starting at position *offset; the value *offset (which usually
corresponds to the file pointer) is then increased.
aio_write(req, buf, len, pos)
Starts an asynchronous I/O operation to write len bytes from buf to file position pos.
readdir(dir, dirent, filldir)
Returns the next directory entry of a directory in dirent; the filldir parameter contains the
address of an auxiliary function that extracts the fields in a directory entry.
poll(file, poll_table)
Checks whether there is activity on a file and goes to sleep until something happens on it.
ioctl(inode, file, cmd, arg)
Sends a command to an underlying hardware device. This method applies only to device files.
unlocked_ioctl(file, cmd, arg)
Similar to the ioctl method, but it does not take the big kernel lock (see the section "The Big
Kernel Lock" in Chapter 5). It is expected that all device drivers and all filesystems will
implement this new method instead of the ioctl method.
compat_ioctl(file, cmd, arg)
Method used to implement the ioctl() 32-bit system call by 64-bit kernels.
mmap(file, vma)
Performs a memory mapping of the file into a process address space (see the section "Memory
Mapping" in Chapter 16).
open(inode, file)
Opens a file by creating a new file object and linking it to the corresponding inode object (see
the section "The open( ) System Call" later in this chapter).
flush(file)
Called when a reference to an open file is closed. The actual purpose of this method is
filesystem-dependent.
release(inode, file)
Releases the file object. Called when the last reference to an open file is closedthat is, when
the f_count field of the file object becomes 0.
fsync(file, dentry, flag)
Flushes the file by writing all cached data to disk.
aio_fsync(req, flag)
Starts an asynchronous I/O flush operation.
fasync(fd, file, on)
Enables or disables I/O event notification by means of signals.
lock(file, cmd, file_lock)
Applies a lock to the file (see the section "File Locking" later in this chapter).
readv(file, vector, count, offset)
Reads bytes from a file and puts the results in the buffers described by vector; the number of
buffers is specified by count.
writev(file, vector, count, offset)
Writes bytes into a file from the buffers described by vector; the number of buffers is specified by count.
sendfile(in_file, offset, count, file_send_actor, out_file)
Transfers data from in_file to out_file (introduced to support the sendfile( ) system call).
sendpage(file, page, offset, size, pointer, fill)
Transfers data from file to the page cache's page; this is a low-level method used by
sendfile( ) and by the networking code for sockets.
get_unmapped_area(file, addr, len, offset, flags)
Gets an unused address range to map the file.
check_flags(flags)
Method invoked by the service routine of the fcntl( ) system call to perform additional checks
when setting the status flags of a file (F_SETFL command). Currently used only by the NFS
network filesystem.
dir_notify(file, arg)
Method invoked by the service routine of the fcntl( ) system call when establishing a
directory change notification (F_NOTIFY command). Currently used only by the Common
Internet File System (CIFS ) network filesystem.
flock(file, flag, lock)
Used to customize the behavior of the flock() system call. No official Linux filesystem makes
use of this method
12.2.4. dentry Objects(directory entry object)
Table 12-5. The fields of the dentry object
Type Field Description
atomic_t d_count Dentry object usage counter
unsigned int d_flags Dentry cache flags
spinlock_t d_lock Spin lock protecting the dentry object
struct inode * d_inode Inode associated with filename
struct dentry * d_parent Dentry object of parent directory
struct qstr d_name Filename
struct list_head d_lru Pointers for the list of unused dentries
struct list_head d_child For directories, pointers for the list of directory dentries in the same parent directory
struct list_head d_subdirs For directories, head of the list of subdirectory dentries
struct list_head d_alias Pointers for the list of dentries associated with the same inode (alias)
unsigned long d_time Used by d_revalidate method
struct dentry_operations* d_op Dentry methods
struct super_block * d_sb Superblock object of the file
void * d_fsdata Filesystem-dependent data
struct rcu_head d_rcu The RCU descriptor used when reclaiming the dentry object
(see the section "Read-Copy Update (RCU)" in Chapter 5)
struct dcookie_struct * d_cookie Pointer to structure used by kernel profilers
struct hlist_node d_hash Pointer for list in hash table entry
int d_mounted For directories, counter for the number of filesystems mounted on this dentry
unsigned char[] d_iname Space for short filename
the dentry_operations structure, whose address is stored in the d_op field.
d_revalidate(dentry, nameidata)
Determines whether the dentry object is still valid before using it for translating a file
pathname. The default VFS function does nothing, although network filesystems may specify
their own functions.
d_hash(dentry, name)
Creates a hash value; this function is a filesystem-specific hash function for the dentry hash
table. The dentry parameter identifies the directory containing the component. The name
parameter points to a structure containing both the pathname component to be looked up and
the value produced by the hash function.
d_compare(dir, name1, name2)
Compares two filenames ; name1 should belong to the directory referenced by dir. The default
VFS function is a normal string match. However, each filesystem can implement this method in
its own way. For instance, MS-DOS does not distinguish capital from lowercase letters.
d_delete(dentry)
Called when the last reference to a dentry object is deleted (d_count becomes 0). The default
VFS function does nothing.
d_release(dentry)
Called when a dentry object is going to be freed (released to the slab allocator). The default
VFS function does nothing.
d_iput(dentry, ino)
Called when a dentry object becomes "negative"that is, it loses its inode. The default VFS
function invokes iput( ) to release the inode object.
12.2.5. The dentry Cache
1:The addresses of the first and last elements of the LRU list are stored in the next and
prev fields of the dentry_unused variable of type list_head. The d_lru field of the dentry object
contains pointers to the adjacent dentries in the list.
2:Each "in use" dentry object is inserted into a doubly linked list specified by the i_dentry field of the
corresponding inode object (because each inode could be associated with several hard links, a list is
required). The d_alias field of the dentry object stores the addresses of the adjacent elements in the list.
3:The hash table is implemented by means of a dentry_hashtable array
12.2.6. Files Associated with a Process
fs_struct
Table 12-6. The fields of the fs_struct structure
Type Field Description
atomic_t count Number of processes sharing this table
rwlock_t lock Read/write spin lock for the table fields
int umask Bit mask used when opening the file to set the file permissions
struct dentry * root Dentry of the root directory
struct dentry* pwd Dentry of the current working directory
struct dentry* altroot Dentry of the emulated root directory (always NULL for the 80 x 86 architecture)
struct vfsmount * rootmnt Mounted filesystem object of the root directory
struct vfsmount * pwdmnt Mounted filesystem object of the current working directory
struct vfsmount * altrootmnt Mounted filesystem object of the emulated root directory (always NULL for the 80 x 86 architecture)
files_struct
Table 12-7. The fields of the files_struct structure
Type Field Description
atomic_t count Number of processes sharing this table
rwlock_t file_lock Read/write spin lock for the table fields
int max_fds Current maximum number of file objects
int max_fdset Current maximum number of file descriptors
int next_fd Maximum file descriptors ever allocated plus 1
struct file ** fd Pointer to array of file object pointers
fd_set * close_on_exec Pointer to file descriptors to be closed on exec( )
fd_set * open_fds Pointer to open file descriptors
fd_set close_on_exec_init Initial set of file descriptors to be closed on exec( )
fd_set open_fds_init Initial set of file descriptors
struct file *[] fd_array Initial array of file object pointers
Figure 12-3. The fd array
fget( )/fget_light( )
fput( )/fput_light( )
12.3. Filesystem Types
12.3.1. Special Filesystems
Table 12-8. Most common special filesystems
Name Mount point Description
bdev none Block devices (see Chapter 13)
binfmt_misc any Miscellaneous executable formats (see Chapter 20)
devpts /dev/pts Pseudoterminal support (Open Group's Unix98 standard)
&