direct IO 内核实现分析及揭示一个坑——基于3.10.0-693.11.1_ddr 通过direct io 直接落盘-优快云博客

本文深入分析Linux内核中direct IO的工作机制，从系统调用write到direct IO的调用路径，探讨了direct IO在面对多线程环境、mmap映射和内存对齐时的挑战，特别是指出在direct IO读操作中可能出现的页表替换问题，并提出了避免该问题的三种策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

linux的读写系统调用提供了一个O_DIRECT标记，可以让尝试绕过缓存，直接对磁盘进行读写(为啥是尝试绕过？当直接落盘失败时还要通过缓存去落盘)。为了能实现直接落盘，使用direct IO限制多多，文件偏移得对齐到磁盘block，内存地址得对齐到磁盘block，读写size也得对齐到磁盘block。但direct IO的实现还是有个小缺陷。这个缺陷我在fuse上的分析已经讲清楚了，对于缺陷原理不清晰的，可以移步我的fuse缺陷分析，这里主要分析direct IO的实现，顺带说一下哪里引入了缺陷。

从系统调用write到direct IO的调用路径是

write --> vfs_write --> do_sync_write --> generic_file_aio_write -->__generic_file_aio_write

从__generic_file_aio_write开始就设计到direct IO的内容了，我们从__generic_file_aio_write开始分析

/*@iocb do_sync_write 声明的内核io控制块，主要用于等待io完成实现同步
 *@iov 用户调用write时给的地址空间，数组只有1个元素
 *@nr_segs 为1
 *@写文件偏移
 */
ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
				 unsigned long nr_segs, loff_t *ppos)
{
   
   
	//省略部分无关紧要的代码
	...
	if (io_is_direct(file)) {
   
   
		loff_t endbyte;
		ssize_t written_buffered;
		//这里尝试直接写入磁盘
		written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
							ppos, count, ocount);
		/*
		 * If the write stopped short of completing, fall back to
		 * buffered writes.  Some filesystems do this for writes to
		 * holes, for example.  For DAX files, a buffered write will
		 * not succeed (even if it did, DAX does not handle dirty
		 * page-cache pages correctly).
		 */
		if (written < 0 || written == count || IS_DAX(inode))
			goto out;

		pos += written;
		count -= written;
		//如果不能将所有内容直接写入磁盘，则将未写入的使用缓存写写入
		written_buffered = generic_file_buffered_write(iocb, iov,
						nr_segs, pos, ppos, count,
						written);
		/*
		 * If generic_file_buffered_write() retuned a synchronous error
		 * then we want to return the number of bytes which were
		 * direct-written, or the error code if that was zero.  Note
		 * that this differs from normal direct-io semantics, which
		 * will return -EFOO even if some bytes were written.
		 */
		if (written_buffered < 0) {
   
   
			err = written_buffered;
			goto out;
		}

		/*
		 * We need to ensure that the page cache pages are written to
		 * disk and invalidated to preserve the expected O_DIRECT
		 * semantics.
		 */
		//在缓存写写入后，这里将缓存刷入磁盘，达到一个系统调用直接落盘的效果
		endbyte = pos + written_buffered - written - 1;
		err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
		...
	} 
	...
}

如果文件打开的时候带有O_DIRECT标记的话，就尝试直接写入磁盘，如果未能直接落盘而又不是返回错误的话，就尝试使用缓存写，先写入内存缓存，再让缓存落盘。所以man手册O_DIRECT这个标记的说明是Try to minimize cache effects of the I/O to and from this file.而不是不使用cache。

ssize_t
generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
		unsigned long *nr_segs, loff_t pos, loff_t *ppos,
		size_t count, size_t ocount)
{
   
   
	...
	//首先把对应写入的区域缓存在内存的部分刷入磁盘
	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
	if (written)
		goto out;

	/*
	 * After a write we want buffered reads to be sure to go to disk to get
	 * the new data.  We invalidate clean cached page from the region we're
	 * about to write.  We do this *before* the write so that we can return
	 * without clobbering -EIOCBQUEUED from ->direct_IO().
	 */
	if (mapping->nrpages) {
   
   
		//在磁盘里写了之后，缓存里的内容就和磁盘的不一致了，所以需要让缓存的内容失效，为什么不写了之后在写之后呢？ 看上面原生注释，我没怎么看明白
		written = invalidate_inode_pages2_range(mapping,
					pos >> PAGE_CACHE_SHIFT, end);
		/*
		 * If a page can not be invalidated, return 0 to fall back
		 * to buffered write.
		 */
		if (written) {
   
   
			if (written == -EBUSY)
				return 0;
			goto out;
		}
	}

	written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);

	/*
	 * Finally, try again to invalidate clean pages which might have been
	 * cached by non-direct readahead, or faulted in by get_user_pages()
	 * if the source of the write was an mmap'ed region of the file
	 * we're writing.  Either one is a pretty crazy thing to do,
	 * so we don't support it 100%.  If this invalidation
	 * fails, tough, the write still worked...
	 */
	 //为什么之前让页失效了这里还要失效一次？这里有个坑爹的情景
	if (mapping->nrpages) {
   
   
		invalidate_inode_pages2_range(mapping,
					      pos >> PAGE_CACHE_SHIFT, end);
	}
	...
}

generic_file_direct_write首先将要写的地方的内存缓存先刷下磁盘，然后再让对应的缓存失效，再direct写完后还要让页面失效一次。这是为啥呢？我们考虑这样一个情景：

进程用mmap用一个页映射了文件的一片空间，然后用将那片内存传给write用direct io写，写的地址还恰好在是mmap映射的那一块。

你说direct io写完后要不要将那个范围的页缓存失效呢？这时上面英文注释所说的一个原因，还有其它原因我还不是很清楚。这里对mmap实现不熟的可以看我mmap实现的分析。

generic_file_direct_write调用了文件系统提供的direct IO函数，但如文件系统的其它函数一样，大部分文件系统的实现是封装的内核的direct io函数，以ext2为例，最终调用的还是do_blockdev_direct_IO，这个函数是读写通用的，在分析的时候要注意下。

 /*
@rw 	读写方向
@iocb 	内核io控制块，里面包含读写的文件offset，读写的长度
@inode 	目标inode
@bdev 	目标所在bdev
@iov 	用户缓存iov，一般只有1个
@offset 文件offset
@nr_segs 	一般为1
@get_block 	解决内存文件block号到硬盘block号的映射
@end_io 	io完成后掉用的方法，此处为NULL
@submit_io 	这个不知道，此处为NULL
@flags 		标记，此处为DIO_LOCKING | DIO_SKIP_HOLES
*/
static inline ssize_t
do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
	unsigned long nr_segs,