linux的读写系统调用提供了一个O_DIRECT标记,可以让尝试绕过缓存,直接对磁盘进行读写(为啥是尝试绕过?当直接落盘失败时还要通过缓存去落盘)。为了能实现直接落盘,使用direct IO限制多多,文件偏移得对齐到磁盘block,内存地址得对齐到磁盘block,读写size也得对齐到磁盘block。但direct IO的实现还是有个小缺陷。这个缺陷我在fuse上的分析已经讲清楚了,对于缺陷原理不清晰的,可以移步我的fuse缺陷分析,这里主要分析direct IO的实现,顺带说一下哪里引入了缺陷。
从系统调用write到direct IO的调用路径是
write --> vfs_write --> do_sync_write --> generic_file_aio_write -->__generic_file_aio_write
从__generic_file_aio_write开始就设计到direct IO的内容了,我们从__generic_file_aio_write开始分析
/*@iocb do_sync_write 声明的内核io控制块,主要用于等待io完成实现同步
*@iov 用户调用write时给的地址空间,数组只有1个元素
*@nr_segs 为1
*@写文件偏移
*/
ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos)
{
//省略部分无关紧要的代码
...
if (io_is_direct(file)) {
loff_t endbyte;
ssize_t written_buffered;
//这里尝试直接写入磁盘
written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
/*
* If the write stopped short of completing, fall back to
* buffered writes. Some filesystems do this for writes to
* holes, for example. For DAX files, a buffered write will
* not succeed (even if it did, DAX does not handle dirty
* page-cache pages correctly).
*/
if (written < 0 || written == count || IS_DAX(inode))
goto out;
pos += written;
count -= written;
//如果不能将所有内容直接写入磁盘,则将未写入的使用缓存写写入
written_buffered = generic_file_buffered_write(iocb, iov,
nr_segs, pos, ppos, count,
written);
/*
* If generic_file_buffered_write() retuned a synchronous error
* then we want to return the number of bytes which were
* direct-written, or the error code if that was zero. Note
* that this differs from normal direct-io semantics, which
* will return -EFOO even if some bytes were written.
*/
if (written_buffered < 0) {
err = written_buffered;
goto out;
}
/*
* We need to ensure that the page cache pages are written to
* disk and invalidated to preserve the expected O_DIRECT
* semantics.
*/
//在缓存写写入后,这里将缓存刷入磁盘,达到一个系统调用直接落盘的效果
endbyte = pos + written_buffered - written - 1;
err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
...
}
...
}
如果文件打开的时候带有O_DIRECT标记的话,就尝试直接写入磁盘,如果未能直接落盘而又不是返回错误的话,就尝试使用缓存写,先写入内存缓存,再让缓存落盘。所以man手册O_DIRECT这个标记的说明是Try to minimize cache effects of the I/O to and from this file.
而不是不使用cache。
ssize_t
generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long *nr_segs, loff_t pos, loff_t *ppos,
size_t count, size_t ocount)
{
...
//首先把对应写入的区域缓存在内存的部分刷入磁盘
written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
if (written)
goto out;
/*
* After a write we want buffered reads to be sure to go to disk to get
* the new data. We invalidate clean cached page from the region we're
* about to write. We do this *before* the write so that we can return
* without clobbering -EIOCBQUEUED from ->direct_IO().
*/
if (mapping->nrpages) {
//在磁盘里写了之后,缓存里的内容就和磁盘的不一致了,所以需要让缓存的内容失效,为什么不写了之后在写之后呢? 看上面原生注释,我没怎么看明白
written = invalidate_inode_pages2_range(mapping,
pos >> PAGE_CACHE_SHIFT, end);
/*
* If a page can not be invalidated, return 0 to fall back
* to buffered write.
*/
if (written) {
if (written == -EBUSY)
return 0;
goto out;
}
}
written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
/*
* Finally, try again to invalidate clean pages which might have been
* cached by non-direct readahead, or faulted in by get_user_pages()
* if the source of the write was an mmap'ed region of the file
* we're writing. Either one is a pretty crazy thing to do,
* so we don't support it 100%. If this invalidation
* fails, tough, the write still worked...
*/
//为什么之前让页失效了这里还要失效一次?这里有个坑爹的情景
if (mapping->nrpages) {
invalidate_inode_pages2_range(mapping,
pos >> PAGE_CACHE_SHIFT, end);
}
...
}
generic_file_direct_write首先将要写的地方的内存缓存先刷下磁盘,然后再让对应的缓存失效,再direct写完后还要让页面失效一次。这是为啥呢?我们考虑这样一个情景:
进程用mmap用一个页映射了文件的一片空间,然后用将那片内存传给write用direct io写,写的地址还恰好在是mmap映射的那一块。
你说direct io写完后要不要将那个范围的页缓存失效呢?这时上面英文注释所说的一个原因,还有其它原因我还不是很清楚。这里对mmap实现不熟的可以看我mmap实现的分析。
generic_file_direct_write调用了文件系统提供的direct IO函数,但如文件系统的其它函数一样,大部分文件系统的实现是封装的内核的direct io函数,以ext2为例,最终调用的还是do_blockdev_direct_IO,这个函数是读写通用的,在分析的时候要注意下。
/*
@rw 读写方向
@iocb 内核io控制块,里面包含读写的文件offset,读写的长度
@inode 目标inode
@bdev 目标所在bdev
@iov 用户缓存iov,一般只有1个
@offset 文件offset
@nr_segs 一般为1
@get_block 解决内存文件block号到硬盘block号的映射
@end_io io完成后掉用的方法,此处为NULL
@submit_io 这个不知道,此处为NULL
@flags 标记,此处为DIO_LOCKING | DIO_SKIP_HOLES
*/
static inline ssize_t
do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
struct block_device *bdev, const struct iovec *iov, loff_t offset,
unsigned long nr_segs,