1. 题目
With read and write, we copy the data from the kernel’s buffer to the application’s buffer (read), and then copy the data from the application’s buffer to the kernel’s buffer (write). With mmap and memcpy, we copy the data directly from one kernel buffer mapped into our address space into another kernel buffer mapped into our address space.This copying occurs as a result of page fault handling when we reference memory pages that don’t yet exist (there is one fault per page read and one fault per page written). If the overhead for the system call and extra copying differs from the page fault overhead, then one approach will perform better than the other.
请理解上面的描述,如果您对上面的描述很清楚,那么本文您可以从略。 本文目的只是一般性的介绍上面描述所涉及的知识,并非深究内核,虚拟内存, SWAP等底层相关知识。
2. 资料整理
2.1 英文资料
Linux always tries to use RAM to speed up disk operations by using available memory for buffers (file system metadata) and cache (pages with actual contents of files or block devices).
首先理解cache
和buffer
,以上引用来自redhat,使用buffer
和cache
来加速disk opertions
,且两者在RAM中存储的内容不同。前者是file system meta
后者是actual contents of files
,笔者在自身电脑执行free
,结果如下:
[root@localhost ~]# free -h
total used free shared buff/cache available
Mem: 3.8G 1.4G 1.0G 5.3M 1.5G 2.1G
Swap: 3.9G 0B 3.9G
接着看看quora中的精彩回答:
The page cache caches pages of files to optimize file I/O. The buffer cache caches disk blocks to optimize block I/O.
Starting with Linux kernel version 2.4, the contents of the two caches were unified. The VM subsystem now drives I/O and it does so out of the page cache. If cached data has both a file and a block representation—as most data does—the buffer cache will simply point into the page cache; thus only one instance of the data is cached in memory. The page cache is what you picture when you think of a disk cache: It caches file data from a disk to make subsequent I/O faster.
The buffer cache remains, however, as the kernel still needs to perform block I/O in terms of blocks, not pages. As most blocks represent file data, most of the buffer cache is represented by the page cache. But a small amount of block data isn’t file backed—metadata and raw block I/O for example—and thus is solely represented by the buffer cache.
笔者上面的的结果和redhat
官网给出的不太一致,从内核2.4以后buff
就和cache
合并在一起,以上面的引用指出来如果缓存的数据有两种表示形式,buffer cache
就简单的指向page cache
,所以常规的write
或者read
操作,将数据写入kernel buffer
,这里就是指page cache
,而由系统决定将其写回disk的时间。这里页的相关知识就需要看看虚拟内存映射,接着再看看mmap
相关知识,以下引用来自stackoverflow:
The only thing the mmap function really does is change some kernel data structures, and possibly the page table. It doesn’t actually put anything into physical memory at all. After you call mmap, the allocated region probably doesn’t even point to physical memory: accessing it will cause a page fault. This kind of page fault is transparently handled by the kernel, in fact, this is one of the kernel’s primary duties.
The main difference between using mmap to read a file and read to read a file is that unmodified pages in an mmap region do not contribute to overall memory pressure, they are almost “free”, memory wise, as long as they are not being used. In contrast, files read with the read function will always contribute to memory pressure whether they are being used or not, and whether they have been modified or not.
Finally, mmap is faster than read only in the use cases which it favors – random access and page reuse. For linearly traversing a file, especially a small file, read will generally be faster since it does not require modifying the page tables, and it takes fewer system calls.
As a recommendation, I can say that any large file which you will be scanning through should generally be read in its entirety with mmap on 64-bit systems, and you can mmap it in chunks on 32-bit systems where virtual memory is less available.
简单总结下,以上引用的大体内容:调用mmap
会改变一些内核数据结构,来构造虚拟地址空间和文件的映射关系,但是此时主存中并没有和虚拟地址空间中page
对应的page frame
,只有对其映射地址的访问引发了相应的页错误,内核在处理该页错误的时候将文件相应的数据拷贝到主存中。那么使用mmap
所占用的虚拟内存随时都可以被"释放"掉,只要该区域没有被使用。而read
或者write
所进行的系统调用会进行page cache
,笼统的说mmap
比read
或者write
快是不适合的,要看是进行缺页错误处理的开销和系统调用及频繁的page cache
的开销哪个更大了。
2.2 中文资料
比较好的文章: 以下引用来自该篇文章,比较清楚的的阐释了本文的问题。
例如说,进程通过 mmap 系统调用,直接建立了磁盘文件和虚拟内存的映射关系。然而,在 mmap 调用之后,并不会立即从磁盘上读取这一文件。而是在实际需要文件内容时,通过 CPU 触发缺页错误,要求 Page Fault Handler 去将文件内容读入内存。
第二篇比较好的文章,截取部分相关的话语:
总结来说,常规文件操作为了提高读写效率和保护磁盘,使用了页缓存机制。这样造成读文件时需要先将文件页从磁盘拷贝到页缓存中,由于页缓存处在内核空间,不能被用户进程直接寻址,所以还需要将页缓存中数据页再次拷贝到内存对应的用户空间中。这样,通过了两次数据拷贝过程,才能完成进程对文件内容的获取任务。写操作也是一样,待写入的buffer在内核空间不能直接访问,必须要先拷贝至内核空间对应的主存,再写回磁盘中(延迟写回),也是需要两次数据拷贝。
而使用mmap操作文件中,创建新的虚拟内存区域和建立文件磁盘地址和虚拟内存区域映射这两步,没有任何文件拷贝操作。而之后访问数据时发现内存中并无数据而发起的缺页异常过程,可以通过已经建立好的映射关系,只使用一次数据拷贝,就从磁盘中将数据传入内存的用户空间中,供进程使用。
3. 总结
再回到一开始的问题:read
和write
涉及kernel buffer
和用户空间的buffer
进行相互copy
,由系统决定sync
的时刻或者用户强制调用sync
系列函数。使用mmap
建立进程虚拟地址空间和文件的对应关系,当我们去读或者写相对应的虚拟地址空间时,因为缺页错误从磁盘copy
到主存page cache
的动作才发生。两者没有绝对的谁快谁慢,适应的场合不同而已。
NOTE : 对以上内容有质疑,请及时指正笔者。