Linux VFS文件系统分析3

Linux VFS文件系统分析3(基于Linux6.6)---VFS与进程描述符相关介绍

一、概述

在 Linux 操作系统中,进程描述符(Process Descriptor)文件系统相关参数 之间存在紧密的关联,尤其是当进程执行与文件系统交互的操作(如打开文件、读写文件、改变工作目录等)时。以下是概述 Linux 进程描述符与文件系统相关参数之间的关联:

1. 进程描述符(task_struct)与文件系统

每个进程在 Linux 内核中都有一个对应的 task_struct 结构体,作为该进程的描述符。进程描述符包含了进程运行所需的各种信息,包括进程状态、调度信息、内存映射等。而文件系统相关参数通常通过以下几种方式与进程描述符相关联:

当前工作目录(pwd)和根目录(root

每个进程有自己的当前工作目录和根目录,这两个目录由文件系统的 dentry(目录项)结构表示。它们与进程描述符紧密相关,并通过 task_struct 结构体中的以下字段来管理:

  • pwd(当前工作目录):每个进程都有一个与之关联的当前工作目录(current->fs->pwd),这是该进程在文件系统中的起始目录。当进程调用系统调用如 chdirfchdir 修改工作目录时,它实际上会更新 task_struct 中的 pwd 字段。

  • root(根目录):每个进程有一个根目录(current->fs->root),它指定了该进程访问的文件系统的根目录。在多重挂载或 chroot 环境中,进程的根目录可以与系统的实际根目录不同。

struct fs_struct {
    struct dentry *pwd;  // 当前工作目录
    struct dentry *root; // 根目录
};

这些目录项(pwdroot)实际上是 dentry 结构,它们通过内核中的 VFS(虚拟文件系统)层与具体的文件系统实现相关联。

文件描述符表(files_struct

每个进程还维护着一个文件描述符表(files_struct),它是进程用来管理打开文件的一个重要数据结构。文件描述符表中的每个文件描述符(file 结构)都与文件系统紧密关联,因为它们指向文件系统中的实际文件。

  • files_struct:这是每个进程的一个结构,包含了进程当前打开的文件和文件描述符。
  • file:每个文件描述符在 files_struct 中都有一个对应的 file 结构。这个结构体中包含了与文件操作相关的文件系统信息,如文件的 inode、文件的偏移量、文件操作函数(如 read, write, open)等。
struct files_struct {
    struct file **fd;  // 文件描述符数组
    unsigned int max_fds; // 最大文件描述符数
    unsigned int next_fd; // 下一个可用的文件描述符
};

2. 文件系统的操作与进程的交互

打开文件(open

当进程调用 open 系统调用打开一个文件时,内核会在文件系统中查找文件并为该文件创建一个 file 结构。然后,它会将该文件描述符添加到进程的文件描述符表中。每个 file 结构包含了对文件的引用(如 inode)、文件偏移量和当前文件的操作函数(如 readwrite)。

  • open 系统调用的过程中,内核会遍历文件系统的 dentry 缓存,通过 d_lookup 查找目录项,并通过文件的 inode 获取文件的元数据。

读取/写入文件(read, write

当进程执行文件的读写操作时,内核会根据文件描述符找到对应的 file 结构,进而获取该文件在文件系统中的 inode,执行相应的文件操作(如读取文件内容或修改文件内容)。这些操作通常通过 VFS 层来实现,VFS 会根据文件的 inode 类型(例如普通文件、目录、字符设备等)调用文件系统中相应的读写函数。

  • 例如,read 系统调用会检查文件描述符的 file_operations 中的 read 函数,并调用相应的文件系统函数来处理实际的数据读取。

路径解析与文件定位

每当进程需要打开文件时,内核需要解析路径,找到文件的实际位置。路径解析通常通过 dentry 结构来实现。dentry 缓存是 Linux 文件系统路径查找的核心,它通过缓存文件名与 inode 的映射关系加速文件定位。

  • 进程通过 getcwd() 获取当前工作目录。
  • 当进程执行 open() 或其他文件操作时,内核会解析路径,使用 d_lookup() 查找文件的 dentry,并通过 inode 获取文件的元数据。

3. 进程的虚拟文件系统(VFS)视图

VFS 是一个抽象层,它为不同的文件系统提供统一的接口,使得用户程序和不同类型的文件系统之间的交互变得透明。VFS 通过 superblock(超级块)、inodedentry 等数据结构将操作系统与具体文件系统实现连接。

  • 进程与 VFS 交互:每个进程通过 task_struct 中的 fs_struct 访问当前工作目录和根目录。这些目录项实际上是文件系统中的 dentry 结构,VFS 会将它们解析为具体的文件路径并进行操作。

  • 进程与文件系统的关系:进程可以通过文件描述符表中的文件描述符来访问文件,每个文件描述符都关联到一个 file 结构,而 file 结构又与文件系统中的 inodedentry 紧密相连。

4. 文件系统的挂载与进程隔离

不同的进程可以在不同的文件系统环境下运行,特别是在多重挂载(如 chroot)的情况下,进程的文件系统视图可能与系统的实际文件系统视图不同。每个进程可以有自己的根目录(root)和当前工作目录(pwd),这些目录由文件系统的 dentry 结构表示。通过挂载不同的文件系统,进程可以在一个隔离的文件系统环境中运行。

二、进程相关的结构体说明

Linux内核中,进程相关的结构体为进程描述符(struct task_struct),进程描述符与文件系统相关的变量,对其他变量不做扩展分析。如下所示,进程描述符中涉及文件系统的有当前目录对应的文件系统相关的结构体struct fs_struct,以及文件描述符struct files_struct。

 include/linux/sched.h

struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
	/*
	 * For reasons of header soup (see current_thread_info()), this
	 * must be the first element of task_struct.
	 */
	struct thread_info		thread_info;
#endif
	unsigned int			__state;

#ifdef CONFIG_PREEMPT_RT
	/* saved state for "spinlock sleepers" */
	unsigned int			saved_state;
#endif

	/*
	 * This begins the randomizable portion of task_struct. Only
	 * scheduling-critical items should be added above here.
	 */
	randomized_struct_fields_start

	void				*stack;
	refcount_t			usage;
	/* Per task flags (PF_*), defined further below: */
	unsigned int			flags;
	unsigned int			ptrace;

#ifdef CONFIG_SMP
	int				on_cpu;
	struct __call_single_node	wake_entry;
	unsigned int			wakee_flips;
	unsigned long			wakee_flip_decay_ts;
	struct task_struct		*last_wakee;

	/*
	 * recent_used_cpu is initially set as the last CPU used by a task
	 * that wakes affine another task. Waker/wakee relationships can
	 * push tasks around a CPU where each wakeup moves to the next one.
	 * Tracking a recently used CPU allows a quick search for a recently
	 * used CPU that may be idle.
	 */
	int				recent_used_cpu;
	int				wake_cpu;
#endif
	int				on_rq;

	int				prio;
	int				static_prio;
	int				normal_prio;
	unsigned int			rt_priority;

	struct sched_entity		se;
	struct sched_rt_entity		rt;
	struct sched_dl_entity		dl;
	const struct sched_class	*sched_class;

#ifdef CONFIG_SCHED_CORE
	struct rb_node			core_node;
	unsigned long			core_cookie;
	unsigned int			core_occupation;
#endif

#ifdef CONFIG_CGROUP_SCHED
	struct task_group		*sched_task_group;
#endif

#ifdef CONFIG_UCLAMP_TASK
	/*
	 * Clamp values requested for a scheduling entity.
	 * Must be updated with task_rq_lock() held.
	 */
	struct uclamp_se		uclamp_req[UCLAMP_CNT];
	/*
	 * Effective clamp values used for a scheduling entity.
	 * Must be updated with task_rq_lock() held.
	 */
	struct uclamp_se		uclamp[UCLAMP_CNT];
#endif

	struct sched_statistics         stats;

#ifdef CONFIG_PREEMPT_NOTIFIERS
	/* List of struct preempt_notifier: */
	struct hlist_head		preempt_notifiers;
#endif

#ifdef CONFIG_BLK_DEV_IO_TRACE
	unsigned int			btrace_seq;
#endif

	unsigned int			policy;
	int				nr_cpus_allowed;
	const cpumask_t			*cpus_ptr;
	cpumask_t			*user_cpus_ptr;
	cpumask_t			cpus_mask;
	void				*migration_pending;
#ifdef CONFIG_SMP
	unsigned short			migration_disabled;
#endif
	unsigned short			migration_flags;

#ifdef CONFIG_PREEMPT_RCU
	int				rcu_read_lock_nesting;
	union rcu_special		rcu_read_unlock_special;
	struct list_head		rcu_node_entry;
	struct rcu_node			*rcu_blocked_node;
#endif /* #ifdef CONFIG_PREEMPT_RCU */

#ifdef CONFIG_TASKS_RCU
	unsigned long			rcu_tasks_nvcsw;
	u8				rcu_tasks_holdout;
	u8				rcu_tasks_idx;
	int				rcu_tasks_idle_cpu;
	struct list_head		rcu_tasks_holdout_list;
#endif /* #ifdef CONFIG_TASKS_RCU */

#ifdef CONFIG_TASKS_TRACE_RCU
	int				trc_reader_nesting;
	int				trc_ipi_to_cpu;
	union rcu_special		trc_reader_special;
	struct list_head		trc_holdout_list;
	struct list_head		trc_blkd_node;
	int				trc_blkd_cpu;
#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */

	struct sched_info		sched_info;

	struct list_head		tasks;
#ifdef CONFIG_SMP
	struct plist_node		pushable_tasks;
	struct rb_node			pushable_dl_tasks;
#endif

	struct mm_struct		*mm;
	struct mm_struct		*active_mm;

	int				exit_state;
	int				exit_code;
	int				exit_signal;
	/* The signal sent when the parent dies: */
	int				pdeath_signal;
	/* JOBCTL_*, siglock protected: */
	unsigned long			jobctl;

	/* Used for emulating ABI behavior of previous Linux versions: */
	unsigned int			personality;

	/* Scheduler bits, serialized by scheduler locks: */
	unsigned			sched_reset_on_fork:1;
	unsigned			sched_contributes_to_load:1;
	unsigned			sched_migrated:1;

	/* Force alignment to the next boundary: */
	unsigned			:0;

	/* Unserialized, strictly 'current' */

	/*
	 * This field must not be in the scheduler word above due to wakelist
	 * queueing no longer being serialized by p->on_cpu. However:
	 *
	 * p->XXX = X;			ttwu()
	 * schedule()			  if (p->on_rq && ..) // false
	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
	 *   deactivate_task()		      ttwu_queue_wakelist())
	 *     p->on_rq = 0;			p->sched_remote_wakeup = Y;
	 *
	 * guarantees all stores of 'current' are visible before
	 * ->sched_remote_wakeup gets used, so it can be in this word.
	 */
	unsigned			sched_remote_wakeup:1;
#ifdef CONFIG_RT_MUTEXES
	unsigned			sched_rt_mutex:1;
#endif

	/* Bit to tell LSMs we're in execve(): */
	unsigned			in_execve:1;
	unsigned			in_iowait:1;
#ifndef TIF_RESTORE_SIGMASK
	unsigned			restore_sigmask:1;
#endif
#ifdef CONFIG_MEMCG
	unsigned			in_user_fault:1;
#endif
#ifdef CONFIG_LRU_GEN
	/* whether the LRU algorithm may apply to this access */
	unsigned			in_lru_fault:1;
#endif
#ifdef CONFIG_COMPAT_BRK
	unsigned			brk_randomized:1;
#endif
#ifdef CONFIG_CGROUPS
	/* disallow userland-initiated cgroup migration */
	unsigned			no_cgroup_migration:1;
	/* task is frozen/stopped (used by the cgroup freezer) */
	unsigned			frozen:1;
#endif
#ifdef CONFIG_BLK_CGROUP
	unsigned			use_memdelay:1;
#endif
#ifdef CONFIG_PSI
	/* Stalled due to lack of memory */
	unsigned			in_memstall:1;
#endif
#ifdef CONFIG_PAGE_OWNER
	/* Used by page_owner=on to detect recursion in page tracking. */
	unsigned			in_page_owner:1;
#endif
#ifdef CONFIG_EVENTFD
	/* Recursion prevention for eventfd_signal() */
	unsigned			in_eventfd:1;
#endif
#ifdef CONFIG_IOMMU_SVA
	unsigned			pasid_activated:1;
#endif
#ifdef	CONFIG_CPU_SUP_INTEL
	unsigned			reported_split_lock:1;
#endif
#ifdef CONFIG_TASK_DELAY_ACCT
	/* delay due to memory thrashing */
	unsigned                        in_thrashing:1;
#endif

	unsigned long			atomic_flags; /* Flags requiring atomic access. */

	struct restart_block		restart_block;

	pid_t				pid;
	pid_t				tgid;

#ifdef CONFIG_STACKPROTECTOR
	/* Canary value for the -fstack-protector GCC feature: */
	unsigned long			stack_canary;
#endif
	/*
	 * Pointers to the (original) parent process, youngest child, younger sibling,
	 * older sibling, respectively.  (p->father can be replaced with
	 * p->real_parent->pid)
	 */

	/* Real parent process: */
	struct task_struct __rcu	*real_parent;

	/* Recipient of SIGCHLD, wait4() reports: */
	struct task_struct __rcu	*parent;

	/*
	 * Children/sibling form the list of natural children:
	 */
	struct list_head		children;
	struct list_head		sibling;
	struct task_struct		*group_leader;

	/*
	 * 'ptraced' is the list of tasks this task is using ptrace() on.
	 *
	 * This includes both natural children and PTRACE_ATTACH targets.
	 * 'ptrace_entry' is this task's link on the p->parent->ptraced list.
	 */
	struct list_head		ptraced;
	struct list_head		ptrace_entry;

	/* PID/PID hash table linkage. */
	struct pid			*thread_pid;
	struct hlist_node		pid_links[PIDTYPE_MAX];
	struct list_head		thread_group;
	struct list_head		thread_node;

	struct completion		*vfork_done;

	/* CLONE_CHILD_SETTID: */
	int __user			*set_child_tid;

	/* CLONE_CHILD_CLEARTID: */
	int __user			*clear_child_tid;

	/* PF_KTHREAD | PF_IO_WORKER */
	void				*worker_private;

	u64				utime;
	u64				stime;
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
	u64				utimescaled;
	u64				stimescaled;
#endif
	u64				gtime;
	struct prev_cputime		prev_cputime;
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
	struct vtime			vtime;
#endif

#ifdef CONFIG_NO_HZ_FULL
	atomic_t			tick_dep_mask;
#endif
	/* Context switch counts: */
	unsigned long			nvcsw;
	unsigned long			nivcsw;

	/* Monotonic time in nsecs: */
	u64				start_time;

	/* Boot based time in nsecs: */
	u64				start_boottime;

	/* MM fault and swap info: this can arguably be seen as either mm-specific or thread-specific: */
	unsigned long			min_flt;
	unsigned long			maj_flt;

	/* Empty if CONFIG_POSIX_CPUTIMERS=n */
	struct posix_cputimers		posix_cputimers;

#ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
	struct posix_cputimers_work	posix_cputimers_work;
#endif

	/* Process credentials: */

	/* Tracer's credentials at attach: */
	const struct cred __rcu		*ptracer_cred;

	/* Objective and real subjective task credentials (COW): */
	const struct cred __rcu		*real_cred;

	/* Effective (overridable) subjective task credentials (COW): */
	const struct cred __rcu		*cred;

#ifdef CONFIG_KEYS
	/* Cached requested key. */
	struct key			*cached_requested_key;
#endif

	/*
	 * executable name, excluding path.
	 *
	 * - normally initialized setup_new_exec()
	 * - access it with [gs]et_task_comm()
	 * - lock it with task_lock()
	 */
	char				comm[TASK_COMM_LEN];

	struct nameidata		*nameidata;

#ifdef CONFIG_SYSVIPC
	struct sysv_sem			sysvsem;
	struct sysv_shm			sysvshm;
#endif
#ifdef CONFIG_DETECT_HUNG_TASK
	unsigned long			last_switch_count;
	unsigned long			last_switch_time;
#endif
	/* Filesystem information: */
	struct fs_struct		*fs;

	/* Open file information: */
	struct files_struct		*files;

#ifdef CONFIG_IO_URING
	struct io_uring_task		*io_uring;
#endif

	/* Namespaces: */
	struct nsproxy			*nsproxy;

	/* Signal handlers: */
	struct signal_struct		*signal;
	struct sighand_struct __rcu		*sighand;
	sigset_t			blocked;
	sigset_t			real_blocked;
	/* Restored if set_restore_sigmask() was used: */
	sigset_t			saved_sigmask;
	struct sigpending		pending;
	unsigned long			sas_ss_sp;
	size_t				sas_ss_size;
	unsigned int			sas_ss_flags;

	struct callback_head		*task_works;

#ifdef CONFIG_AUDIT
#ifdef CONFIG_AUDITSYSCALL
	struct audit_context		*audit_context;
#endif
	kuid_t				loginuid;
	unsigned int			sessionid;
#endif
	struct seccomp			seccomp;
	struct syscall_user_dispatch	syscall_dispatch;

	/* Thread group tracking: */
	u64				parent_exec_id;
	u64				self_exec_id;

	/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
	spinlock_t			alloc_lock;

	/* Protection of the PI data structures: */
	raw_spinlock_t			pi_lock;

	struct wake_q_node		wake_q;

#ifdef CONFIG_RT_MUTEXES
	/* PI waiters blocked on a rt_mutex held by this task: */
	struct rb_root_cached		pi_waiters;
	/* Updated under owner's pi_lock and rq lock */
	struct task_struct		*pi_top_task;
	/* Deadlock detection and priority inheritance handling: */
	struct rt_mutex_waiter		*pi_blocked_on;
#endif

#ifdef CONFIG_DEBUG_MUTEXES
	/* Mutex deadlock detection: */
	struct mutex_waiter		*blocked_on;
#endif

#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
	int				non_block_count;
#endif

#ifdef CONFIG_TRACE_IRQFLAGS
	struct irqtrace_events		irqtrace;
	unsigned int			hardirq_threaded;
	u64				hardirq_chain_key;
	int				softirqs_enabled;
	int				softirq_context;
	int				irq_config;
#endif
#ifdef CONFIG_PREEMPT_RT
	int				softirq_disable_cnt;
#endif

#ifdef CONFIG_LOCKDEP
# define MAX_LOCK_DEPTH			48UL
	u64				curr_chain_key;
	int				lockdep_depth;
	unsigned int			lockdep_recursion;
	struct held_lock		held_locks[MAX_LOCK_DEPTH];
#endif

#if defined(CONFIG_UBSAN) && !defined(CONFIG_UBSAN_TRAP)
	unsigned int			in_ubsan;
#endif

	/* Journalling filesystem info: */
	void				*journal_info;

	/* Stacked block device info: */
	struct bio_list			*bio_list;

	/* Stack plugging: */
	struct blk_plug			*plug;

	/* VM state: */
	struct reclaim_state		*reclaim_state;

	struct io_context		*io_context;

#ifdef CONFIG_COMPACTION
	struct capture_control		*capture_control;
#endif
	/* Ptrace state: */
	unsigned long			ptrace_message;
	kernel_siginfo_t		*last_siginfo;

	struct task_io_accounting	ioac;
#ifdef CONFIG_PSI
	/* Pressure stall state */
	unsigned int			psi_flags;
#endif
#ifdef CONFIG_TASK_XACCT
	/* Accumulated RSS usage: */
	u64				acct_rss_mem1;
	/* Accumulated virtual memory usage: */
	u64				acct_vm_mem1;
	/* stime + utime since last update: */
	u64				acct_timexpd;
#endif
#ifdef CONFIG_CPUSETS
	/* Protected by ->alloc_lock: */
	nodemask_t			mems_allowed;
	/* Sequence number to catch updates: */
	seqcount_spinlock_t		mems_allowed_seq;
	int				cpuset_mem_spread_rotor;
	int				cpuset_slab_spread_rotor;
#endif
#ifdef CONFIG_CGROUPS
	/* Control Group info protected by css_set_lock: */
	struct css_set __rcu		*cgroups;
	/* cg_list protected by css_set_lock and tsk->alloc_lock: */
	struct list_head		cg_list;
#endif
#ifdef CONFIG_X86_CPU_RESCTRL
	u32				closid;
	u32				rmid;
#endif
#ifdef CONFIG_FUTEX
	struct robust_list_head __user	*robust_list;
#ifdef CONFIG_COMPAT
	struct compat_robust_list_head __user *compat_robust_list;
#endif
	struct list_head		pi_state_list;
	struct futex_pi_state		*pi_state_cache;
	struct mutex			futex_exit_mutex;
	unsigned int			futex_state;
#endif
#ifdef CONFIG_PERF_EVENTS
	struct perf_event_context	*perf_event_ctxp;
	struct mutex			perf_event_mutex;
	struct list_head		perf_event_list;
#endif
#ifdef CONFIG_DEBUG_PREEMPT
	unsigned long			preempt_disable_ip;
#endif
#ifdef CONFIG_NUMA
	/* Protected by alloc_lock: */
	struct mempolicy		*mempolicy;
	short				il_prev;
	short				pref_node_fork;
#endif
#ifdef CONFIG_NUMA_BALANCING
	int				numa_scan_seq;
	unsigned int			numa_scan_period;
	unsigned int			numa_scan_period_max;
	int				numa_preferred_nid;
	unsigned long			numa_migrate_retry;
	/* Migration stamp: */
	u64				node_stamp;
	u64				last_task_numa_placement;
	u64				last_sum_exec_runtime;
	struct callback_head		numa_work;

	/*
	 * This pointer is only modified for current in syscall and
	 * pagefault context (and for tasks being destroyed), so it can be read
	 * from any of the following contexts:
	 *  - RCU read-side critical section
	 *  - current->numa_group from everywhere
	 *  - task's runqueue locked, task not running
	 */
	struct numa_group __rcu		*numa_group;

	/*
	 * numa_faults is an array split into four regions:
	 * faults_memory, faults_cpu, faults_memory_buffer, faults_cpu_buffer
	 * in this precise order.
	 *
	 * faults_memory: Exponential decaying average of faults on a per-node
	 * basis. Scheduling placement decisions are made based on these
	 * counts. The values remain static for the duration of a PTE scan.
	 * faults_cpu: Track the nodes the process was running on when a NUMA
	 * hinting fault was incurred.
	 * faults_memory_buffer and faults_cpu_buffer: Record faults per node
	 * during the current scan window. When the scan completes, the counts
	 * in faults_memory and faults_cpu decay and these values are copied.
	 */
	unsigned long			*numa_faults;
	unsigned long			total_numa_faults;

	/*
	 * numa_faults_locality tracks if faults recorded during the last
	 * scan window were remote/local or failed to migrate. The task scan
	 * period is adapted based on the locality of the faults with different
	 * weights depending on whether they were shared or private faults
	 */
	unsigned long			numa_faults_locality[3];

	unsigned long			numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */

#ifdef CONFIG_RSEQ
	struct rseq __user *rseq;
	u32 rseq_len;
	u32 rseq_sig;
	/*
	 * RmW on rseq_event_mask must be performed atomically
	 * with respect to preemption.
	 */
	unsigned long rseq_event_mask;
#endif

#ifdef CONFIG_SCHED_MM_CID
	int				mm_cid;		/* Current cid in mm */
	int				last_mm_cid;	/* Most recent cid in mm */
	int				migrate_from_cpu;
	int				mm_cid_active;	/* Whether cid bitmap is active */
	struct callback_head		cid_work;
#endif

	struct tlbflush_unmap_batch	tlb_ubc;

	/* Cache last used pipe for splice(): */
	struct pipe_inode_info		*splice_pipe;

	struct page_frag		task_frag;

#ifdef CONFIG_TASK_DELAY_ACCT
	struct task_delay_info		*delays;
#endif

#ifdef CONFIG_FAULT_INJECTION
	int				make_it_fail;
	unsigned int			fail_nth;
#endif
	/*
	 * When (nr_dirtied >= nr_dirtied_pause), it's time to call
	 * balance_dirty_pages() for a dirty throttling pause:
	 */
	int				nr_dirtied;
	int				nr_dirtied_pause;
	/* Start of a write-and-pause period: */
	unsigned long			dirty_paused_when;

#ifdef CONFIG_LATENCYTOP
	int				latency_record_count;
	struct latency_record		latency_record[LT_SAVECOUNT];
#endif
	/*
	 * Time slack values; these are used to round up poll() and
	 * select() etc timeout values. These are in nanoseconds.
	 */
	u64				timer_slack_ns;
	u64				default_timer_slack_ns;

#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
	unsigned int			kasan_depth;
#endif

#ifdef CONFIG_KCSAN
	struct kcsan_ctx		kcsan_ctx;
#ifdef CONFIG_TRACE_IRQFLAGS
	struct irqtrace_events		kcsan_save_irqtrace;
#endif
#ifdef CONFIG_KCSAN_WEAK_MEMORY
	int				kcsan_stack_depth;
#endif
#endif

#ifdef CONFIG_KMSAN
	struct kmsan_ctx		kmsan_ctx;
#endif

#if IS_ENABLED(CONFIG_KUNIT)
	struct kunit			*kunit_test;
#endif

#ifdef CONFIG_FUNCTION_GRAPH_TRACER
	/* Index of current stored address in ret_stack: */
	int				curr_ret_stack;
	int				curr_ret_depth;

	/* Stack of return addresses for return function tracing: */
	struct ftrace_ret_stack		*ret_stack;

	/* Timestamp for last schedule: */
	unsigned long long		ftrace_timestamp;

	/*
	 * Number of functions that haven't been traced
	 * because of depth overrun:
	 */
	atomic_t			trace_overrun;

	/* Pause tracing: */
	atomic_t			tracing_graph_pause;
#endif

#ifdef CONFIG_TRACING
	/* Bitmask and counter of trace recursion: */
	unsigned long			trace_recursion;
#endif /* CONFIG_TRACING */

#ifdef CONFIG_KCOV
	/* See kernel/kcov.c for more details. */

	/* Coverage collection mode enabled for this task (0 if disabled): */
	unsigned int			kcov_mode;

	/* Size of the kcov_area: */
	unsigned int			kcov_size;

	/* Buffer for coverage collection: */
	void				*kcov_area;

	/* KCOV descriptor wired with this task or NULL: */
	struct kcov			*kcov;

	/* KCOV common handle for remote coverage collection: */
	u64				kcov_handle;

	/* KCOV sequence number: */
	int				kcov_sequence;

	/* Collect coverage from softirq context: */
	unsigned int			kcov_softirq;
#endif

#ifdef CONFIG_MEMCG
	struct mem_cgroup		*memcg_in_oom;
	gfp_t				memcg_oom_gfp_mask;
	int				memcg_oom_order;

	/* Number of pages to reclaim on returning to userland: */
	unsigned int			memcg_nr_pages_over_high;

	/* Used by memcontrol for targeted memcg charge: */
	struct mem_cgroup		*active_memcg;
#endif

#ifdef CONFIG_BLK_CGROUP
	struct gendisk			*throttle_disk;
#endif

#ifdef CONFIG_UPROBES
	struct uprobe_task		*utask;
#endif
#if defined(CONFIG_BCACHE) || defined(CONFIG_BCACHE_MODULE)
	unsigned int			sequential_io;
	unsigned int			sequential_io_avg;
#endif
	struct kmap_ctrl		kmap_ctrl;
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
	unsigned long			task_state_change;
# ifdef CONFIG_PREEMPT_RT
	unsigned long			saved_state_change;
# endif
#endif
	struct rcu_head			rcu;
	refcount_t			rcu_users;
	int				pagefault_disabled;
#ifdef CONFIG_MMU
	struct task_struct		*oom_reaper_list;
	struct timer_list		oom_reaper_timer;
#endif
#ifdef CONFIG_VMAP_STACK
	struct vm_struct		*stack_vm_area;
#endif
#ifdef CONFIG_THREAD_INFO_IN_TASK
	/* A live task holds one reference: */
	refcount_t			stack_refcount;
#endif
#ifdef CONFIG_LIVEPATCH
	int patch_state;
#endif
#ifdef CONFIG_SECURITY
	/* Used by LSM modules for access restriction: */
	void				*security;
#endif
#ifdef CONFIG_BPF_SYSCALL
	/* Used by BPF task local storage */
	struct bpf_local_storage __rcu	*bpf_storage;
	/* Used for BPF run context */
	struct bpf_run_ctx		*bpf_ctx;
#endif

#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
	unsigned long			lowest_stack;
	unsigned long			prev_lowest_stack;
#endif

#ifdef CONFIG_X86_MCE
	void __user			*mce_vaddr;
	__u64				mce_kflags;
	u64				mce_addr;
	__u64				mce_ripv : 1,
					mce_whole_page : 1,
					__mce_reserved : 62;
	struct callback_head		mce_kill_me;
	int				mce_count;
#endif

#ifdef CONFIG_KRETPROBES
	struct llist_head               kretprobe_instances;
#endif
#ifdef CONFIG_RETHOOK
	struct llist_head               rethooks;
#endif

#ifdef CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH
	/*
	 * If L1D flush is supported on mm context switch
	 * then we use this callback head to queue kill work
	 * to kill tasks that are not running on SMT disabled
	 * cores
	 */
	struct callback_head		l1d_flush_kill;
#endif

#ifdef CONFIG_RV
	/*
	 * Per-task RV monitor. Nowadays fixed in RV_PER_TASK_MONITORS.
	 * If we find justification for more monitors, we can think
	 * about adding more or developing a dynamic method. So far,
	 * none of these are justified.
	 */
	union rv_task_monitor		rv[RV_PER_TASK_MONITORS];
#endif

#ifdef CONFIG_USER_EVENTS
	struct user_event_mm		*user_event_mm;
#endif

	int wait_res_type;
	union {
		struct folio		*wait_folio;
		struct bio		*wait_bio;
	};
	unsigned long wait_moment;

	/*
	 * New fields for task_struct should be added above here, so that
	 * they are included in the randomized portion of task_struct.
	 */
	randomized_struct_fields_end

	/* CPU-specific state of this task: */
	struct thread_struct		thread;

	/*
	 * WARNING: on x86, 'thread_struct' contains a variable-sized
	 * structure.  It *MUST* be at the end of 'task_struct'.
	 *
	 * Do not put anything below here!
	 */
};

struct fs_struct说明

该结构体主要包括该进程执行时,对应用户所在的根目录项,以及当前目录项(使用struct path变量表示)。

include/linux/fs_struct.h 

struct fs_struct {
	int users;
	spinlock_t lock;
	seqcount_spinlock_t seq;
	int umask;
	int in_exec;
	struct path root, pwd;
} __randomize_layout;

而struct path中包括struct vfsmount类型变量,表征该目录项所在文件系统的超级块以及根dentry的变量。

include/linux/path.h

struct path {
	struct vfsmount *mnt;
	struct dentry *dentry;
} __randomize_layout;

include/linux/mount.h

struct vfsmount {
	struct dentry *mnt_root;	/* root of the mounted tree */
	struct super_block *mnt_sb;	/* pointer to superblock */
	int mnt_flags;
	struct mnt_idmap *mnt_idmap;
} __randomize_layout;

 针对root、pwd变量,可通过系统调用chdir、chroot接口进行修改。这两个接口在进行权限相关的检测后,最终会调用 set_fs_root、set_fs_pwd进行修改。该两个接口的定义如下。

fs/fs_struct.c 

/*
 * Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values.
 * It can block.
 */
void set_fs_root(struct fs_struct *fs, const struct path *path)
{
	struct path old_root;

	path_get(path);
	spin_lock(&fs->lock);
	write_seqcount_begin(&fs->seq);
	old_root = fs->root;
	fs->root = *path;
	write_seqcount_end(&fs->seq);
	spin_unlock(&fs->lock);
	if (old_root.dentry)
		path_put(&old_root);
}

fs/fs_struct.c

/*
 * Replace the fs->{pwdmnt,pwd} with {mnt,dentry}. Put the old values.
 * It can block.
 */
void set_fs_pwd(struct fs_struct *fs, const struct path *path)
{
	struct path old_pwd;

	path_get(path);
	spin_lock(&fs->lock);
	write_seqcount_begin(&fs->seq);
	old_pwd = fs->pwd;
	fs->pwd = *path;
	write_seqcount_end(&fs->seq);
	spin_unlock(&fs->lock);

	if (old_pwd.dentry)
		path_put(&old_pwd);
}

struct files_struct说明

其中主要包括fd_array、fdtab这两个变量,其中fdtab中存储了当前已打开文件数、struct file类型的指针数组、已打开文件对应的位图等变量。

include/linux/fdtable.h

/*
 * Open file table structure
 */
struct files_struct {
  /*
   * read mostly part
   */
	atomic_t count;
	bool resize_in_progress;
	wait_queue_head_t resize_wait;

	struct fdtable __rcu *fdt;
	struct fdtable fdtab;
  /*
   * written part on a separate cache line in SMP
   */
	spinlock_t file_lock ____cacheline_aligned_in_smp;
	unsigned int next_fd;
	unsigned long close_on_exec_init[1];
	unsigned long open_fds_init[1];
	unsigned long full_fds_bits_init[1];
	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

struct fdtable变量的定义如下,主要就是指针数组fd,该数组中的每一个指针均指向一个已打开的文件描述变量struct file,该结构体类型的定义,在下面描述说明。

include/linux/fdtable.h 

struct fdtable {
	unsigned int max_fds;
	struct file __rcu **fd;      /* current fd array */
	unsigned long *close_on_exec;
	unsigned long *open_fds;
	unsigned long *full_fds_bits;
	struct rcu_head rcu;
};

该结构体主要描述一个进程已打开的文件,包括该文件对应的inode、dentry,该文件的操作接口(open、read、write、close等)。

include/linux/fs.h 

/*
 * f_{lock,count,pos_lock} members can be highly contended and share
 * the same cacheline. f_{lock,mode} are very frequently used together
 * and so share the same cacheline as well. The read-mostly
 * f_{path,inode,op} are kept on a separate cacheline.
 */
struct file {
	union {
		struct llist_node	f_llist;
		struct rcu_head 	f_rcuhead;
		unsigned int 		f_iocb_flags;
	};

	/*
	 * Protects f_ep, f_flags.
	 * Must not be taken from IRQ context.
	 */
	spinlock_t		f_lock;
	fmode_t			f_mode;
	atomic_long_t		f_count;
	struct mutex		f_pos_lock;
	loff_t			f_pos;
	unsigned int		f_flags;
	struct fown_struct	f_owner;
	const struct cred	*f_cred;
	struct file_ra_state	f_ra;
	struct path		f_path;
	struct inode		*f_inode;	/* cached value */
	const struct file_operations	*f_op;

	u64			f_version;
#ifdef CONFIG_SECURITY
	void			*f_security;
#endif
	/* needed for tty driver, and maybe others */
	void			*private_data;

#ifdef CONFIG_EPOLL
	/* Used by fs/eventpoll.c to link all the hooks to this file */
	struct hlist_head	*f_ep;
#endif /* #ifdef CONFIG_EPOLL */
	struct address_space	*f_mapping;
	errseq_t		f_wb_err;
	errseq_t		f_sb_err; /* for syncfs */
} __randomize_layout

针对上面介绍这些结构体变量,连同之前介绍的文件系统、inode、super block、dentry、根dentry、根inode之间的联系,基本上描述进程中涉及文件系统与文件描述符变量和文件系统的关联。

2.1、文件系统类型变量与超级块的联系

 

 

2.2、超级块与inode节点之间的联系

 

2.3、dentry与inode之间的联系

 

2.4、struct fs_struct与struct dentry之间的关联

struct fs_struct变量中的root、pwd变量主要存储根目录与当前目录相关的内容,主要就是与根目录对应的dentry、当前目录对应的dentry变量相关联,如下图所示

2.5、struct files、struct dentry、struct inode、struct file_operations之间的关联

在之前的说明中,对于一个文件或者目录而言,通过struct dentry、struct inode,即可描述一个文件或目录。而在进程描述符中,则通过struct files与struct dentry、struct inode之间关联,而在之前我们创建的简单文件系统中,无法对该新定义的文件系统进行读写操作,就是没有对文件进行操作的接口,而文件操作接口是结构体struct file_operations实现的,而struct file_operations则和struct files有关联,如下即为这几个结构体之间的关联。

2.6、进程描述符(struct task_strict)与struct fs_struct、struct files之间的关联

上面已经介绍了struct files、struct dentry、struct inode、struct file_operations之间的关联,此处介绍进程描述符与这些结构体的关联。

三、举例应用

可以通过一些具体的系统调用和操作,来展示进程描述符如何与文件系统交互。以下是几个具体的例子,帮助理解进程描述符与文件系统之间的关系。

1. 进程打开文件的过程(open() 系统调用)

在 Linux 中,每个进程都有一个文件描述符表(files_struct),用于管理进程打开的文件。每当进程调用 open() 系统调用打开一个文件时,内核会执行一系列的操作,涉及到进程描述符、文件描述符表和文件系统相关的结构。

假设进程 A 执行以下代码:

int fd = open("/home/user/file.txt", O_RDONLY);

这个系统调用的执行流程如下:

  1. 路径解析:内核需要解析路径 /home/user/file.txt,将其转化为文件系统的实际路径。内核通过 vfs_lookup() 函数查找路径中的各级目录,最终定位到文件。

  2. dentry 和 inode:在路径解析过程中,内核通过文件系统中的 dentry(目录项)缓存来加速路径查找。每一层目录都有一个 dentry 对象。最终,内核会定位到文件的 inode(索引节点),它包含了文件的元数据(如文件权限、文件大小、文件位置等)。

  3. 创建文件描述符:一旦找到文件的 inode,内核会创建一个 file 结构,表示该文件的打开状态。这个 file 结构中包括了文件的 inode、文件的偏移量、文件操作函数(如 readwrite)等信息。

  4. 更新进程的文件描述符表:进程的文件描述符表会被更新,将文件描述符 fdfile 结构关联。进程可以通过文件描述符访问文件。

  5. 返回文件描述符:系统调用 open() 成功后,返回一个文件描述符(fd),该描述符可以用于后续的文件操作(如 read()write())。

  • task_struct:每个进程都有一个 task_struct 结构体,其中包含进程的文件描述符表 files
  • files_struct:每个进程的文件描述符表,由一个指向文件结构体数组 fd 的指针表示。
  • file 结构:表示一个已打开的文件,包含文件的偏移量、指向文件 inode 的指针,以及文件的操作函数指针等。
struct files_struct {
    struct file **fd;   // 文件描述符表数组
    unsigned int max_fds; // 最大文件描述符数
    unsigned int next_fd; // 下一个可用的文件描述符
};

struct file {
    struct inode *f_inode;  // 文件的 inode 结构
    struct file_operations *f_op;  // 文件操作函数指针
    loff_t f_pos;  // 文件的偏移量
};

2. 进程改变当前工作目录(chdir() 系统调用)

每个进程都有一个与之关联的当前工作目录(pwd)。进程可以通过系统调用 chdir() 修改它的当前工作目录。这个操作涉及到更新进程描述符中的相关字段。

假设进程 A 执行以下代码:

chdir("/home/user/documents");
  1. 路径解析:内核会解析路径 /home/user/documents,并找到该目录的 dentry(目录项)。

  2. 更新当前工作目录:在成功解析路径后,内核会更新进程的工作目录。进程描述符中的 fs_struct 会更新 pwd 字段,指向新目录的 dentry 结构。

  3. 更新进程描述符:内核会通过 task_struct 中的 fs_struct 更新当前进程的工作目录,确保该进程使用新的工作目录进行文件操作。

  • fs_struct:每个进程都有一个 fs_struct 结构体,包含了当前工作目录(pwd)和根目录(root)的 dentry 结构。
  • dentry:目录项,表示文件系统中的一个目录或文件。它缓存文件路径和 inode 的映射。
struct fs_struct {
    struct dentry *pwd;  // 当前工作目录
    struct dentry *root; // 根目录
};

3. 进程读取文件(read() 系统调用)

当进程读取一个文件时,它通过文件描述符访问文件,而文件描述符对应的 file 结构则指向文件的 inode 和文件的物理数据块。此操作会涉及到进程描述符、文件描述符表和文件系统的交互。

假设进程 A 执行以下代码:

char buf[100];
int fd = open("/home/user/file.txt", O_RDONLY);
ssize_t bytes_read = read(fd, buf, sizeof(buf));
  1. 查找文件描述符:首先,内核通过进程的文件描述符表查找文件描述符 fd 对应的 file 结构。

  2. 访问 inodefile 结构包含了指向文件 inode 的指针。内核通过 inode 查找文件的物理存储位置。

  3. 读取文件数据:内核根据 file 结构中的文件操作函数(f_op->read)读取文件内容。文件内容通过文件系统的缓存(如页面缓存)读取到进程的用户空间。

  4. 更新偏移量:每次读取数据后,文件描述符中的偏移量会更新,指向文件的下一个读取位置。

  • file_operations:文件的操作函数,read() 系统调用会通过 file_operations 中的 read 函数来实现读取操作。
  • inode:文件的 inode 结构包含了文件的元数据,如文件大小、文件权限、文件存储块的位置等。
struct file_operations {
    ssize_t (*read) (struct file *file, char __user *buf, size_t count, loff_t *pos);
    ssize_t (*write) (struct file *file, const char __user *buf, size_t count, loff_t *pos);
    // 其他操作函数
};

4. 进程关闭文件(close() 系统调用)

当进程不再需要一个打开的文件时,它会调用 close() 系统调用关闭文件。此操作涉及释放文件描述符、更新进程的文件描述符表以及可能的文件系统操作。

假设进程 A 执行以下代码:

close(fd);
  1. 查找文件描述符:内核通过进程的文件描述符表查找文件描述符 fd 对应的 file 结构。

  2. 释放文件描述符:内核将文件描述符从文件描述符表中删除,并释放与该文件相关的资源。

  3. 更新文件系统:如果文件已经写入,内核会调用文件系统的 fsync() 函数,将文件的数据写入磁盘。此时,如果文件描述符指向的是常驻内存的文件(如通过 mmap() 映射的文件),内核还会同步内存映射。

  • file 结构:文件描述符在 file 结构中包含了文件的 inode、偏移量等信息。
  • files_struct:每个进程的文件描述符表,管理进程打开的所有文件。
struct file_operations {
    int (*release) (struct inode *inode, struct file *file);
};
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值