Linux VFS文件系统分析3(基于Linux6.6)---VFS与进程描述符相关介绍
一、概述
在 Linux 操作系统中,进程描述符(Process Descriptor) 和 文件系统相关参数 之间存在紧密的关联,尤其是当进程执行与文件系统交互的操作(如打开文件、读写文件、改变工作目录等)时。以下是概述 Linux 进程描述符与文件系统相关参数之间的关联:
1. 进程描述符(task_struct
)与文件系统
每个进程在 Linux 内核中都有一个对应的 task_struct
结构体,作为该进程的描述符。进程描述符包含了进程运行所需的各种信息,包括进程状态、调度信息、内存映射等。而文件系统相关参数通常通过以下几种方式与进程描述符相关联:
当前工作目录(pwd
)和根目录(root
)
每个进程有自己的当前工作目录和根目录,这两个目录由文件系统的 dentry
(目录项)结构表示。它们与进程描述符紧密相关,并通过 task_struct
结构体中的以下字段来管理:
-
pwd
(当前工作目录):每个进程都有一个与之关联的当前工作目录(current->fs->pwd
),这是该进程在文件系统中的起始目录。当进程调用系统调用如chdir
或fchdir
修改工作目录时,它实际上会更新task_struct
中的pwd
字段。 -
root
(根目录):每个进程有一个根目录(current->fs->root
),它指定了该进程访问的文件系统的根目录。在多重挂载或 chroot 环境中,进程的根目录可以与系统的实际根目录不同。
struct fs_struct {
struct dentry *pwd; // 当前工作目录
struct dentry *root; // 根目录
};
这些目录项(pwd
和 root
)实际上是 dentry
结构,它们通过内核中的 VFS(虚拟文件系统)层与具体的文件系统实现相关联。
文件描述符表(files_struct
)
每个进程还维护着一个文件描述符表(files_struct
),它是进程用来管理打开文件的一个重要数据结构。文件描述符表中的每个文件描述符(file
结构)都与文件系统紧密关联,因为它们指向文件系统中的实际文件。
files_struct
:这是每个进程的一个结构,包含了进程当前打开的文件和文件描述符。file
:每个文件描述符在files_struct
中都有一个对应的file
结构。这个结构体中包含了与文件操作相关的文件系统信息,如文件的inode
、文件的偏移量、文件操作函数(如read
,write
,open
)等。
struct files_struct {
struct file **fd; // 文件描述符数组
unsigned int max_fds; // 最大文件描述符数
unsigned int next_fd; // 下一个可用的文件描述符
};
2. 文件系统的操作与进程的交互
打开文件(open
)
当进程调用 open
系统调用打开一个文件时,内核会在文件系统中查找文件并为该文件创建一个 file
结构。然后,它会将该文件描述符添加到进程的文件描述符表中。每个 file
结构包含了对文件的引用(如 inode)、文件偏移量和当前文件的操作函数(如 read
或 write
)。
- 在
open
系统调用的过程中,内核会遍历文件系统的 dentry 缓存,通过d_lookup
查找目录项,并通过文件的inode
获取文件的元数据。
读取/写入文件(read
, write
)
当进程执行文件的读写操作时,内核会根据文件描述符找到对应的 file
结构,进而获取该文件在文件系统中的 inode
,执行相应的文件操作(如读取文件内容或修改文件内容)。这些操作通常通过 VFS 层来实现,VFS 会根据文件的 inode 类型(例如普通文件、目录、字符设备等)调用文件系统中相应的读写函数。
- 例如,
read
系统调用会检查文件描述符的file_operations
中的read
函数,并调用相应的文件系统函数来处理实际的数据读取。
路径解析与文件定位
每当进程需要打开文件时,内核需要解析路径,找到文件的实际位置。路径解析通常通过 dentry
结构来实现。dentry
缓存是 Linux 文件系统路径查找的核心,它通过缓存文件名与 inode
的映射关系加速文件定位。
- 进程通过
getcwd()
获取当前工作目录。 - 当进程执行
open()
或其他文件操作时,内核会解析路径,使用d_lookup()
查找文件的dentry
,并通过inode
获取文件的元数据。
3. 进程的虚拟文件系统(VFS)视图
VFS 是一个抽象层,它为不同的文件系统提供统一的接口,使得用户程序和不同类型的文件系统之间的交互变得透明。VFS 通过 superblock
(超级块)、inode
、dentry
等数据结构将操作系统与具体文件系统实现连接。
-
进程与 VFS 交互:每个进程通过
task_struct
中的fs_struct
访问当前工作目录和根目录。这些目录项实际上是文件系统中的dentry
结构,VFS 会将它们解析为具体的文件路径并进行操作。 -
进程与文件系统的关系:进程可以通过文件描述符表中的文件描述符来访问文件,每个文件描述符都关联到一个
file
结构,而file
结构又与文件系统中的inode
和dentry
紧密相连。
4. 文件系统的挂载与进程隔离
不同的进程可以在不同的文件系统环境下运行,特别是在多重挂载(如 chroot
)的情况下,进程的文件系统视图可能与系统的实际文件系统视图不同。每个进程可以有自己的根目录(root
)和当前工作目录(pwd
),这些目录由文件系统的 dentry
结构表示。通过挂载不同的文件系统,进程可以在一个隔离的文件系统环境中运行。
二、进程相关的结构体说明
Linux内核中,进程相关的结构体为进程描述符(struct task_struct),进程描述符与文件系统相关的变量,对其他变量不做扩展分析。如下所示,进程描述符中涉及文件系统的有当前目录对应的文件系统相关的结构体struct fs_struct,以及文件描述符struct files_struct。
include/linux/sched.h
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
* For reasons of header soup (see current_thread_info()), this
* must be the first element of task_struct.
*/
struct thread_info thread_info;
#endif
unsigned int __state;
#ifdef CONFIG_PREEMPT_RT
/* saved state for "spinlock sleepers" */
unsigned int saved_state;
#endif
/*
* This begins the randomizable portion of task_struct. Only
* scheduling-critical items should be added above here.
*/
randomized_struct_fields_start
void *stack;
refcount_t usage;
/* Per task flags (PF_*), defined further below: */
unsigned int flags;
unsigned int ptrace;
#ifdef CONFIG_SMP
int on_cpu;
struct __call_single_node wake_entry;
unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
struct task_struct *last_wakee;
/*
* recent_used_cpu is initially set as the last CPU used by a task
* that wakes affine another task. Waker/wakee relationships can
* push tasks around a CPU where each wakeup moves to the next one.
* Tracking a recently used CPU allows a quick search for a recently
* used CPU that may be idle.
*/
int recent_used_cpu;
int wake_cpu;
#endif
int on_rq;
int prio;
int static_prio;
int normal_prio;
unsigned int rt_priority;
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
const struct sched_class *sched_class;
#ifdef CONFIG_SCHED_CORE
struct rb_node core_node;
unsigned long core_cookie;
unsigned int core_occupation;
#endif
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
#endif
#ifdef CONFIG_UCLAMP_TASK
/*
* Clamp values requested for a scheduling entity.
* Must be updated with task_rq_lock() held.
*/
struct uclamp_se uclamp_req[UCLAMP_CNT];
/*
* Effective clamp values used for a scheduling entity.
* Must be updated with task_rq_lock() held.
*/
struct uclamp_se uclamp[UCLAMP_CNT];
#endif
struct sched_statistics stats;
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
#endif
#ifdef CONFIG_BLK_DEV_IO_TRACE
unsigned int btrace_seq;
#endif
unsigned int policy;
int nr_cpus_allowed;
const cpumask_t *cpus_ptr;
cpumask_t *user_cpus_ptr;
cpumask_t cpus_mask;
void *migration_pending;
#ifdef CONFIG_SMP
unsigned short migration_disabled;
#endif
unsigned short migration_flags;
#ifdef CONFIG_PREEMPT_RCU
int rcu_read_lock_nesting;
union rcu_special rcu_read_unlock_special;
struct list_head rcu_node_entry;
struct rcu_node *rcu_blocked_node;
#endif /* #ifdef CONFIG_PREEMPT_RCU */
#ifdef CONFIG_TASKS_RCU
unsigned long rcu_tasks_nvcsw;
u8 rcu_tasks_holdout;
u8 rcu_tasks_idx;
int rcu_tasks_idle_cpu;
struct list_head rcu_tasks_holdout_list;
#endif /* #ifdef CONFIG_TASKS_RCU */
#ifdef CONFIG_TASKS_TRACE_RCU
int trc_reader_nesting;
int trc_ipi_to_cpu;
union rcu_special trc_reader_special;
struct list_head trc_holdout_list;
struct list_head trc_blkd_node;
int trc_blkd_cpu;
#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
struct sched_info sched_info;
struct list_head tasks;
#ifdef CONFIG_SMP
struct plist_node pushable_tasks;
struct rb_node pushable_dl_tasks;
#endif
struct mm_struct *mm;
struct mm_struct *active_mm;
int exit_state;
int exit_code;
int exit_signal;
/* The signal sent when the parent dies: */
int pdeath_signal;
/* JOBCTL_*, siglock protected: */
unsigned long jobctl;
/* Used for emulating ABI behavior of previous Linux versions: */
unsigned int personality;
/* Scheduler bits, serialized by scheduler locks: */
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
/* Force alignment to the next boundary: */
unsigned :0;
/* Unserialized, strictly 'current' */
/*
* This field must not be in the scheduler word above due to wakelist
* queueing no longer being serialized by p->on_cpu. However:
*
* p->XXX = X; ttwu()
* schedule() if (p->on_rq && ..) // false
* smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
* deactivate_task() ttwu_queue_wakelist())
* p->on_rq = 0; p->sched_remote_wakeup = Y;
*
* guarantees all stores of 'current' are visible before
* ->sched_remote_wakeup gets used, so it can be in this word.
*/
unsigned sched_remote_wakeup:1;
#ifdef CONFIG_RT_MUTEXES
unsigned sched_rt_mutex:1;
#endif
/* Bit to tell LSMs we're in execve(): */
unsigned in_execve:1;
unsigned in_iowait:1;
#ifndef TIF_RESTORE_SIGMASK
unsigned restore_sigmask:1;
#endif
#ifdef CONFIG_MEMCG
unsigned in_user_fault:1;
#endif
#ifdef CONFIG_LRU_GEN
/* whether the LRU algorithm may apply to this access */
unsigned in_lru_fault:1;
#endif
#ifdef CONFIG_COMPAT_BRK
unsigned brk_randomized:1;
#endif
#ifdef CONFIG_CGROUPS
/* disallow userland-initiated cgroup migration */
unsigned no_cgroup_migration:1;
/* task is frozen/stopped (used by the cgroup freezer) */
unsigned frozen:1;
#endif
#ifdef CONFIG_BLK_CGROUP
unsigned use_memdelay:1;
#endif
#ifdef CONFIG_PSI
/* Stalled due to lack of memory */
unsigned in_memstall:1;
#endif
#ifdef CONFIG_PAGE_OWNER
/* Used by page_owner=on to detect recursion in page tracking. */
unsigned in_page_owner:1;
#endif
#ifdef CONFIG_EVENTFD
/* Recursion prevention for eventfd_signal() */
unsigned in_eventfd:1;
#endif
#ifdef CONFIG_IOMMU_SVA
unsigned pasid_activated:1;
#endif
#ifdef CONFIG_CPU_SUP_INTEL
unsigned reported_split_lock:1;
#endif
#ifdef CONFIG_TASK_DELAY_ACCT
/* delay due to memory thrashing */
unsigned in_thrashing:1;
#endif
unsigned long atomic_flags; /* Flags requiring atomic access. */
struct restart_block restart_block;
pid_t pid;
pid_t tgid;
#ifdef CONFIG_STACKPROTECTOR
/* Canary value for the -fstack-protector GCC feature: */
unsigned long stack_canary;
#endif
/*
* Pointers to the (original) parent process, youngest child, younger sibling,
* older sibling, respectively. (p->father can be replaced with
* p->real_parent->pid)
*/
/* Real parent process: */
struct task_struct __rcu *real_parent;
/* Recipient of SIGCHLD, wait4() reports: */
struct task_struct __rcu *parent;
/*
* Children/sibling form the list of natural children:
*/
struct list_head children;
struct list_head sibling;
struct task_struct *group_leader;
/*
* 'ptraced' is the list of tasks this task is using ptrace() on.
*
* This includes both natural children and PTRACE_ATTACH targets.
* 'ptrace_entry' is this task's link on the p->parent->ptraced list.
*/
struct list_head ptraced;
struct list_head ptrace_entry;
/* PID/PID hash table linkage. */
struct pid *thread_pid;
struct hlist_node pid_links[PIDTYPE_MAX];
struct list_head thread_group;
struct list_head thread_node;
struct completion *vfork_done;
/* CLONE_CHILD_SETTID: */
int __user *set_child_tid;
/* CLONE_CHILD_CLEARTID: */
int __user *clear_child_tid;
/* PF_KTHREAD | PF_IO_WORKER */
void *worker_private;
u64 utime;
u64 stime;
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
u64 utimescaled;
u64 stimescaled;
#endif
u64 gtime;
struct prev_cputime prev_cputime;
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
struct vtime vtime;
#endif
#ifdef CONFIG_NO_HZ_FULL
atomic_t tick_dep_mask;
#endif
/* Context switch counts: */
unsigned long nvcsw;
unsigned long nivcsw;
/* Monotonic time in nsecs: */
u64 start_time;
/* Boot based time in nsecs: */
u64 start_boottime;
/* MM fault and swap info: this can arguably be seen as either mm-specific or thread-specific: */
unsigned long min_flt;
unsigned long maj_flt;
/* Empty if CONFIG_POSIX_CPUTIMERS=n */
struct posix_cputimers posix_cputimers;
#ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
struct posix_cputimers_work posix_cputimers_work;
#endif
/* Process credentials: */
/* Tracer's credentials at attach: */
const struct cred __rcu *ptracer_cred;
/* Objective and real subjective task credentials (COW): */
const struct cred __rcu *real_cred;
/* Effective (overridable) subjective task credentials (COW): */
const struct cred __rcu *cred;
#ifdef CONFIG_KEYS
/* Cached requested key. */
struct key *cached_requested_key;
#endif
/*
* executable name, excluding path.
*
* - normally initialized setup_new_exec()
* - access it with [gs]et_task_comm()
* - lock it with task_lock()
*/
char comm[TASK_COMM_LEN];
struct nameidata *nameidata;
#ifdef CONFIG_SYSVIPC
struct sysv_sem sysvsem;
struct sysv_shm sysvshm;
#endif
#ifdef CONFIG_DETECT_HUNG_TASK
unsigned long last_switch_count;
unsigned long last_switch_time;
#endif
/* Filesystem information: */
struct fs_struct *fs;
/* Open file information: */
struct files_struct *files;
#ifdef CONFIG_IO_URING
struct io_uring_task *io_uring;
#endif
/* Namespaces: */
struct nsproxy *nsproxy;
/* Signal handlers: */
struct signal_struct *signal;
struct sighand_struct __rcu *sighand;
sigset_t blocked;
sigset_t real_blocked;
/* Restored if set_restore_sigmask() was used: */
sigset_t saved_sigmask;
struct sigpending pending;
unsigned long sas_ss_sp;
size_t sas_ss_size;
unsigned int sas_ss_flags;
struct callback_head *task_works;
#ifdef CONFIG_AUDIT
#ifdef CONFIG_AUDITSYSCALL
struct audit_context *audit_context;
#endif
kuid_t loginuid;
unsigned int sessionid;
#endif
struct seccomp seccomp;
struct syscall_user_dispatch syscall_dispatch;
/* Thread group tracking: */
u64 parent_exec_id;
u64 self_exec_id;
/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
spinlock_t alloc_lock;
/* Protection of the PI data structures: */
raw_spinlock_t pi_lock;
struct wake_q_node wake_q;
#ifdef CONFIG_RT_MUTEXES
/* PI waiters blocked on a rt_mutex held by this task: */
struct rb_root_cached pi_waiters;
/* Updated under owner's pi_lock and rq lock */
struct task_struct *pi_top_task;
/* Deadlock detection and priority inheritance handling: */
struct rt_mutex_waiter *pi_blocked_on;
#endif
#ifdef CONFIG_DEBUG_MUTEXES
/* Mutex deadlock detection: */
struct mutex_waiter *blocked_on;
#endif
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
int non_block_count;
#endif
#ifdef CONFIG_TRACE_IRQFLAGS
struct irqtrace_events irqtrace;
unsigned int hardirq_threaded;
u64 hardirq_chain_key;
int softirqs_enabled;
int softirq_context;
int irq_config;
#endif
#ifdef CONFIG_PREEMPT_RT
int softirq_disable_cnt;
#endif
#ifdef CONFIG_LOCKDEP
# define MAX_LOCK_DEPTH 48UL
u64 curr_chain_key;
int lockdep_depth;
unsigned int lockdep_recursion;
struct held_lock held_locks[MAX_LOCK_DEPTH];
#endif
#if defined(CONFIG_UBSAN) && !defined(CONFIG_UBSAN_TRAP)
unsigned int in_ubsan;
#endif
/* Journalling filesystem info: */
void *journal_info;
/* Stacked block device info: */
struct bio_list *bio_list;
/* Stack plugging: */
struct blk_plug *plug;
/* VM state: */
struct reclaim_state *reclaim_state;
struct io_context *io_context;
#ifdef CONFIG_COMPACTION
struct capture_control *capture_control;
#endif
/* Ptrace state: */
unsigned long ptrace_message;
kernel_siginfo_t *last_siginfo;
struct task_io_accounting ioac;
#ifdef CONFIG_PSI
/* Pressure stall state */
unsigned int psi_flags;
#endif
#ifdef CONFIG_TASK_XACCT
/* Accumulated RSS usage: */
u64 acct_rss_mem1;
/* Accumulated virtual memory usage: */
u64 acct_vm_mem1;
/* stime + utime since last update: */
u64 acct_timexpd;
#endif
#ifdef CONFIG_CPUSETS
/* Protected by ->alloc_lock: */
nodemask_t mems_allowed;
/* Sequence number to catch updates: */
seqcount_spinlock_t mems_allowed_seq;
int cpuset_mem_spread_rotor;
int cpuset_slab_spread_rotor;
#endif
#ifdef CONFIG_CGROUPS
/* Control Group info protected by css_set_lock: */
struct css_set __rcu *cgroups;
/* cg_list protected by css_set_lock and tsk->alloc_lock: */
struct list_head cg_list;
#endif
#ifdef CONFIG_X86_CPU_RESCTRL
u32 closid;
u32 rmid;
#endif
#ifdef CONFIG_FUTEX
struct robust_list_head __user *robust_list;
#ifdef CONFIG_COMPAT
struct compat_robust_list_head __user *compat_robust_list;
#endif
struct list_head pi_state_list;
struct futex_pi_state *pi_state_cache;
struct mutex futex_exit_mutex;
unsigned int futex_state;
#endif
#ifdef CONFIG_PERF_EVENTS
struct perf_event_context *perf_event_ctxp;
struct mutex perf_event_mutex;
struct list_head perf_event_list;
#endif
#ifdef CONFIG_DEBUG_PREEMPT
unsigned long preempt_disable_ip;
#endif
#ifdef CONFIG_NUMA
/* Protected by alloc_lock: */
struct mempolicy *mempolicy;
short il_prev;
short pref_node_fork;
#endif
#ifdef CONFIG_NUMA_BALANCING
int numa_scan_seq;
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
int numa_preferred_nid;
unsigned long numa_migrate_retry;
/* Migration stamp: */
u64 node_stamp;
u64 last_task_numa_placement;
u64 last_sum_exec_runtime;
struct callback_head numa_work;
/*
* This pointer is only modified for current in syscall and
* pagefault context (and for tasks being destroyed), so it can be read
* from any of the following contexts:
* - RCU read-side critical section
* - current->numa_group from everywhere
* - task's runqueue locked, task not running
*/
struct numa_group __rcu *numa_group;
/*
* numa_faults is an array split into four regions:
* faults_memory, faults_cpu, faults_memory_buffer, faults_cpu_buffer
* in this precise order.
*
* faults_memory: Exponential decaying average of faults on a per-node
* basis. Scheduling placement decisions are made based on these
* counts. The values remain static for the duration of a PTE scan.
* faults_cpu: Track the nodes the process was running on when a NUMA
* hinting fault was incurred.
* faults_memory_buffer and faults_cpu_buffer: Record faults per node
* during the current scan window. When the scan completes, the counts
* in faults_memory and faults_cpu decay and these values are copied.
*/
unsigned long *numa_faults;
unsigned long total_numa_faults;
/*
* numa_faults_locality tracks if faults recorded during the last
* scan window were remote/local or failed to migrate. The task scan
* period is adapted based on the locality of the faults with different
* weights depending on whether they were shared or private faults
*/
unsigned long numa_faults_locality[3];
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_RSEQ
struct rseq __user *rseq;
u32 rseq_len;
u32 rseq_sig;
/*
* RmW on rseq_event_mask must be performed atomically
* with respect to preemption.
*/
unsigned long rseq_event_mask;
#endif
#ifdef CONFIG_SCHED_MM_CID
int mm_cid; /* Current cid in mm */
int last_mm_cid; /* Most recent cid in mm */
int migrate_from_cpu;
int mm_cid_active; /* Whether cid bitmap is active */
struct callback_head cid_work;
#endif
struct tlbflush_unmap_batch tlb_ubc;
/* Cache last used pipe for splice(): */
struct pipe_inode_info *splice_pipe;
struct page_frag task_frag;
#ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info *delays;
#endif
#ifdef CONFIG_FAULT_INJECTION
int make_it_fail;
unsigned int fail_nth;
#endif
/*
* When (nr_dirtied >= nr_dirtied_pause), it's time to call
* balance_dirty_pages() for a dirty throttling pause:
*/
int nr_dirtied;
int nr_dirtied_pause;
/* Start of a write-and-pause period: */
unsigned long dirty_paused_when;
#ifdef CONFIG_LATENCYTOP
int latency_record_count;
struct latency_record latency_record[LT_SAVECOUNT];
#endif
/*
* Time slack values; these are used to round up poll() and
* select() etc timeout values. These are in nanoseconds.
*/
u64 timer_slack_ns;
u64 default_timer_slack_ns;
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
unsigned int kasan_depth;
#endif
#ifdef CONFIG_KCSAN
struct kcsan_ctx kcsan_ctx;
#ifdef CONFIG_TRACE_IRQFLAGS
struct irqtrace_events kcsan_save_irqtrace;
#endif
#ifdef CONFIG_KCSAN_WEAK_MEMORY
int kcsan_stack_depth;
#endif
#endif
#ifdef CONFIG_KMSAN
struct kmsan_ctx kmsan_ctx;
#endif
#if IS_ENABLED(CONFIG_KUNIT)
struct kunit *kunit_test;
#endif
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
/* Index of current stored address in ret_stack: */
int curr_ret_stack;
int curr_ret_depth;
/* Stack of return addresses for return function tracing: */
struct ftrace_ret_stack *ret_stack;
/* Timestamp for last schedule: */
unsigned long long ftrace_timestamp;
/*
* Number of functions that haven't been traced
* because of depth overrun:
*/
atomic_t trace_overrun;
/* Pause tracing: */
atomic_t tracing_graph_pause;
#endif
#ifdef CONFIG_TRACING
/* Bitmask and counter of trace recursion: */
unsigned long trace_recursion;
#endif /* CONFIG_TRACING */
#ifdef CONFIG_KCOV
/* See kernel/kcov.c for more details. */
/* Coverage collection mode enabled for this task (0 if disabled): */
unsigned int kcov_mode;
/* Size of the kcov_area: */
unsigned int kcov_size;
/* Buffer for coverage collection: */
void *kcov_area;
/* KCOV descriptor wired with this task or NULL: */
struct kcov *kcov;
/* KCOV common handle for remote coverage collection: */
u64 kcov_handle;
/* KCOV sequence number: */
int kcov_sequence;
/* Collect coverage from softirq context: */
unsigned int kcov_softirq;
#endif
#ifdef CONFIG_MEMCG
struct mem_cgroup *memcg_in_oom;
gfp_t memcg_oom_gfp_mask;
int memcg_oom_order;
/* Number of pages to reclaim on returning to userland: */
unsigned int memcg_nr_pages_over_high;
/* Used by memcontrol for targeted memcg charge: */
struct mem_cgroup *active_memcg;
#endif
#ifdef CONFIG_BLK_CGROUP
struct gendisk *throttle_disk;
#endif
#ifdef CONFIG_UPROBES
struct uprobe_task *utask;
#endif
#if defined(CONFIG_BCACHE) || defined(CONFIG_BCACHE_MODULE)
unsigned int sequential_io;
unsigned int sequential_io_avg;
#endif
struct kmap_ctrl kmap_ctrl;
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long task_state_change;
# ifdef CONFIG_PREEMPT_RT
unsigned long saved_state_change;
# endif
#endif
struct rcu_head rcu;
refcount_t rcu_users;
int pagefault_disabled;
#ifdef CONFIG_MMU
struct task_struct *oom_reaper_list;
struct timer_list oom_reaper_timer;
#endif
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
#endif
#ifdef CONFIG_THREAD_INFO_IN_TASK
/* A live task holds one reference: */
refcount_t stack_refcount;
#endif
#ifdef CONFIG_LIVEPATCH
int patch_state;
#endif
#ifdef CONFIG_SECURITY
/* Used by LSM modules for access restriction: */
void *security;
#endif
#ifdef CONFIG_BPF_SYSCALL
/* Used by BPF task local storage */
struct bpf_local_storage __rcu *bpf_storage;
/* Used for BPF run context */
struct bpf_run_ctx *bpf_ctx;
#endif
#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
unsigned long lowest_stack;
unsigned long prev_lowest_stack;
#endif
#ifdef CONFIG_X86_MCE
void __user *mce_vaddr;
__u64 mce_kflags;
u64 mce_addr;
__u64 mce_ripv : 1,
mce_whole_page : 1,
__mce_reserved : 62;
struct callback_head mce_kill_me;
int mce_count;
#endif
#ifdef CONFIG_KRETPROBES
struct llist_head kretprobe_instances;
#endif
#ifdef CONFIG_RETHOOK
struct llist_head rethooks;
#endif
#ifdef CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH
/*
* If L1D flush is supported on mm context switch
* then we use this callback head to queue kill work
* to kill tasks that are not running on SMT disabled
* cores
*/
struct callback_head l1d_flush_kill;
#endif
#ifdef CONFIG_RV
/*
* Per-task RV monitor. Nowadays fixed in RV_PER_TASK_MONITORS.
* If we find justification for more monitors, we can think
* about adding more or developing a dynamic method. So far,
* none of these are justified.
*/
union rv_task_monitor rv[RV_PER_TASK_MONITORS];
#endif
#ifdef CONFIG_USER_EVENTS
struct user_event_mm *user_event_mm;
#endif
int wait_res_type;
union {
struct folio *wait_folio;
struct bio *wait_bio;
};
unsigned long wait_moment;
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
*/
randomized_struct_fields_end
/* CPU-specific state of this task: */
struct thread_struct thread;
/*
* WARNING: on x86, 'thread_struct' contains a variable-sized
* structure. It *MUST* be at the end of 'task_struct'.
*
* Do not put anything below here!
*/
};
struct fs_struct说明
该结构体主要包括该进程执行时,对应用户所在的根目录项,以及当前目录项(使用struct path变量表示)。
include/linux/fs_struct.h
struct fs_struct {
int users;
spinlock_t lock;
seqcount_spinlock_t seq;
int umask;
int in_exec;
struct path root, pwd;
} __randomize_layout;
而struct path中包括struct vfsmount类型变量,表征该目录项所在文件系统的超级块以及根dentry的变量。
include/linux/path.h
struct path {
struct vfsmount *mnt;
struct dentry *dentry;
} __randomize_layout;
include/linux/mount.h
struct vfsmount {
struct dentry *mnt_root; /* root of the mounted tree */
struct super_block *mnt_sb; /* pointer to superblock */
int mnt_flags;
struct mnt_idmap *mnt_idmap;
} __randomize_layout;
针对root、pwd变量,可通过系统调用chdir、chroot接口进行修改。这两个接口在进行权限相关的检测后,最终会调用 set_fs_root、set_fs_pwd进行修改。该两个接口的定义如下。
fs/fs_struct.c
/*
* Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values.
* It can block.
*/
void set_fs_root(struct fs_struct *fs, const struct path *path)
{
struct path old_root;
path_get(path);
spin_lock(&fs->lock);
write_seqcount_begin(&fs->seq);
old_root = fs->root;
fs->root = *path;
write_seqcount_end(&fs->seq);
spin_unlock(&fs->lock);
if (old_root.dentry)
path_put(&old_root);
}
fs/fs_struct.c
/*
* Replace the fs->{pwdmnt,pwd} with {mnt,dentry}. Put the old values.
* It can block.
*/
void set_fs_pwd(struct fs_struct *fs, const struct path *path)
{
struct path old_pwd;
path_get(path);
spin_lock(&fs->lock);
write_seqcount_begin(&fs->seq);
old_pwd = fs->pwd;
fs->pwd = *path;
write_seqcount_end(&fs->seq);
spin_unlock(&fs->lock);
if (old_pwd.dentry)
path_put(&old_pwd);
}
struct files_struct说明
其中主要包括fd_array、fdtab这两个变量,其中fdtab中存储了当前已打开文件数、struct file类型的指针数组、已打开文件对应的位图等变量。
include/linux/fdtable.h
/*
* Open file table structure
*/
struct files_struct {
/*
* read mostly part
*/
atomic_t count;
bool resize_in_progress;
wait_queue_head_t resize_wait;
struct fdtable __rcu *fdt;
struct fdtable fdtab;
/*
* written part on a separate cache line in SMP
*/
spinlock_t file_lock ____cacheline_aligned_in_smp;
unsigned int next_fd;
unsigned long close_on_exec_init[1];
unsigned long open_fds_init[1];
unsigned long full_fds_bits_init[1];
struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};
struct fdtable变量的定义如下,主要就是指针数组fd,该数组中的每一个指针均指向一个已打开的文件描述变量struct file,该结构体类型的定义,在下面描述说明。
include/linux/fdtable.h
struct fdtable {
unsigned int max_fds;
struct file __rcu **fd; /* current fd array */
unsigned long *close_on_exec;
unsigned long *open_fds;
unsigned long *full_fds_bits;
struct rcu_head rcu;
};
该结构体主要描述一个进程已打开的文件,包括该文件对应的inode、dentry,该文件的操作接口(open、read、write、close等)。
include/linux/fs.h
/*
* f_{lock,count,pos_lock} members can be highly contended and share
* the same cacheline. f_{lock,mode} are very frequently used together
* and so share the same cacheline as well. The read-mostly
* f_{path,inode,op} are kept on a separate cacheline.
*/
struct file {
union {
struct llist_node f_llist;
struct rcu_head f_rcuhead;
unsigned int f_iocb_flags;
};
/*
* Protects f_ep, f_flags.
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
fmode_t f_mode;
atomic_long_t f_count;
struct mutex f_pos_lock;
loff_t f_pos;
unsigned int f_flags;
struct fown_struct f_owner;
const struct cred *f_cred;
struct file_ra_state f_ra;
struct path f_path;
struct inode *f_inode; /* cached value */
const struct file_operations *f_op;
u64 f_version;
#ifdef CONFIG_SECURITY
void *f_security;
#endif
/* needed for tty driver, and maybe others */
void *private_data;
#ifdef CONFIG_EPOLL
/* Used by fs/eventpoll.c to link all the hooks to this file */
struct hlist_head *f_ep;
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping;
errseq_t f_wb_err;
errseq_t f_sb_err; /* for syncfs */
} __randomize_layout
针对上面介绍这些结构体变量,连同之前介绍的文件系统、inode、super block、dentry、根dentry、根inode之间的联系,基本上描述进程中涉及文件系统与文件描述符变量和文件系统的关联。
2.1、文件系统类型变量与超级块的联系
2.2、超级块与inode节点之间的联系
2.3、dentry与inode之间的联系
2.4、struct fs_struct与struct dentry之间的关联
struct fs_struct变量中的root、pwd变量主要存储根目录与当前目录相关的内容,主要就是与根目录对应的dentry、当前目录对应的dentry变量相关联,如下图所示
2.5、struct files、struct dentry、struct inode、struct file_operations之间的关联
在之前的说明中,对于一个文件或者目录而言,通过struct dentry、struct inode,即可描述一个文件或目录。而在进程描述符中,则通过struct files与struct dentry、struct inode之间关联,而在之前我们创建的简单文件系统中,无法对该新定义的文件系统进行读写操作,就是没有对文件进行操作的接口,而文件操作接口是结构体struct file_operations实现的,而struct file_operations则和struct files有关联,如下即为这几个结构体之间的关联。
2.6、进程描述符(struct task_strict)与struct fs_struct、struct files之间的关联
上面已经介绍了struct files、struct dentry、struct inode、struct file_operations之间的关联,此处介绍进程描述符与这些结构体的关联。
三、举例应用
可以通过一些具体的系统调用和操作,来展示进程描述符如何与文件系统交互。以下是几个具体的例子,帮助理解进程描述符与文件系统之间的关系。
1. 进程打开文件的过程(open()
系统调用)
在 Linux 中,每个进程都有一个文件描述符表(files_struct
),用于管理进程打开的文件。每当进程调用 open()
系统调用打开一个文件时,内核会执行一系列的操作,涉及到进程描述符、文件描述符表和文件系统相关的结构。
假设进程 A 执行以下代码:
int fd = open("/home/user/file.txt", O_RDONLY);
这个系统调用的执行流程如下:
-
路径解析:内核需要解析路径
/home/user/file.txt
,将其转化为文件系统的实际路径。内核通过vfs_lookup()
函数查找路径中的各级目录,最终定位到文件。 -
dentry 和 inode:在路径解析过程中,内核通过文件系统中的
dentry
(目录项)缓存来加速路径查找。每一层目录都有一个dentry
对象。最终,内核会定位到文件的inode
(索引节点),它包含了文件的元数据(如文件权限、文件大小、文件位置等)。 -
创建文件描述符:一旦找到文件的
inode
,内核会创建一个file
结构,表示该文件的打开状态。这个file
结构中包括了文件的inode
、文件的偏移量、文件操作函数(如read
、write
)等信息。 -
更新进程的文件描述符表:进程的文件描述符表会被更新,将文件描述符
fd
与file
结构关联。进程可以通过文件描述符访问文件。 -
返回文件描述符:系统调用
open()
成功后,返回一个文件描述符(fd
),该描述符可以用于后续的文件操作(如read()
或write()
)。
task_struct
:每个进程都有一个task_struct
结构体,其中包含进程的文件描述符表files
。files_struct
:每个进程的文件描述符表,由一个指向文件结构体数组fd
的指针表示。file
结构:表示一个已打开的文件,包含文件的偏移量、指向文件inode
的指针,以及文件的操作函数指针等。
struct files_struct {
struct file **fd; // 文件描述符表数组
unsigned int max_fds; // 最大文件描述符数
unsigned int next_fd; // 下一个可用的文件描述符
};
struct file {
struct inode *f_inode; // 文件的 inode 结构
struct file_operations *f_op; // 文件操作函数指针
loff_t f_pos; // 文件的偏移量
};
2. 进程改变当前工作目录(chdir()
系统调用)
每个进程都有一个与之关联的当前工作目录(pwd
)。进程可以通过系统调用 chdir()
修改它的当前工作目录。这个操作涉及到更新进程描述符中的相关字段。
假设进程 A 执行以下代码:
chdir("/home/user/documents");
-
路径解析:内核会解析路径
/home/user/documents
,并找到该目录的dentry
(目录项)。 -
更新当前工作目录:在成功解析路径后,内核会更新进程的工作目录。进程描述符中的
fs_struct
会更新pwd
字段,指向新目录的dentry
结构。 -
更新进程描述符:内核会通过
task_struct
中的fs_struct
更新当前进程的工作目录,确保该进程使用新的工作目录进行文件操作。
fs_struct
:每个进程都有一个fs_struct
结构体,包含了当前工作目录(pwd
)和根目录(root
)的dentry
结构。dentry
:目录项,表示文件系统中的一个目录或文件。它缓存文件路径和inode
的映射。
struct fs_struct {
struct dentry *pwd; // 当前工作目录
struct dentry *root; // 根目录
};
3. 进程读取文件(read()
系统调用)
当进程读取一个文件时,它通过文件描述符访问文件,而文件描述符对应的 file
结构则指向文件的 inode
和文件的物理数据块。此操作会涉及到进程描述符、文件描述符表和文件系统的交互。
假设进程 A 执行以下代码:
char buf[100];
int fd = open("/home/user/file.txt", O_RDONLY);
ssize_t bytes_read = read(fd, buf, sizeof(buf));
-
查找文件描述符:首先,内核通过进程的文件描述符表查找文件描述符
fd
对应的file
结构。 -
访问 inode:
file
结构包含了指向文件inode
的指针。内核通过inode
查找文件的物理存储位置。 -
读取文件数据:内核根据
file
结构中的文件操作函数(f_op->read
)读取文件内容。文件内容通过文件系统的缓存(如页面缓存)读取到进程的用户空间。 -
更新偏移量:每次读取数据后,文件描述符中的偏移量会更新,指向文件的下一个读取位置。
file_operations
:文件的操作函数,read()
系统调用会通过file_operations
中的read
函数来实现读取操作。inode
:文件的inode
结构包含了文件的元数据,如文件大小、文件权限、文件存储块的位置等。
struct file_operations {
ssize_t (*read) (struct file *file, char __user *buf, size_t count, loff_t *pos);
ssize_t (*write) (struct file *file, const char __user *buf, size_t count, loff_t *pos);
// 其他操作函数
};
4. 进程关闭文件(close()
系统调用)
当进程不再需要一个打开的文件时,它会调用 close()
系统调用关闭文件。此操作涉及释放文件描述符、更新进程的文件描述符表以及可能的文件系统操作。
假设进程 A 执行以下代码:
close(fd);
-
查找文件描述符:内核通过进程的文件描述符表查找文件描述符
fd
对应的file
结构。 -
释放文件描述符:内核将文件描述符从文件描述符表中删除,并释放与该文件相关的资源。
-
更新文件系统:如果文件已经写入,内核会调用文件系统的
fsync()
函数,将文件的数据写入磁盘。此时,如果文件描述符指向的是常驻内存的文件(如通过mmap()
映射的文件),内核还会同步内存映射。
file
结构:文件描述符在file
结构中包含了文件的inode
、偏移量等信息。files_struct
:每个进程的文件描述符表,管理进程打开的所有文件。
struct file_operations {
int (*release) (struct inode *inode, struct file *file);
};