So you want to write to a file real fast…
Or: A tale about Linux file write patterns.
So I once wrote a custom core dump handler to be used with Linux’score_pattern. What it does is take a core dump onSTDIN plus a few arguments, and then write the core to a predictablelocation on disk with a time stamp and suitable access rights. Coredumps tend to be rather large, and in general you don’t know inadvance how much data you’ll write to disk. So I built a functionalityto write a chunk of data to disk (say, 16MB) and then check withfstatfs() if the disk has still more than threshold capacity (say,10GB). This way, a rapidly restarting and core-dumping applicationcannot lead to “disk full” follow up failures that will inevitablylead to a denial of service for most data handling services.
So… how do we write a lot of data to disk really fast? – Let us mayberephrase the question: How do we write data to disk in the firstplace? Let’s assume we have already opened file descriptors in andout, and we just want to copy everything from in to out.
One might be tempted to try something like this:
ssize_t read_write(int in, int out)
{
ssize_t n, t = 0;
char buf[1024];
while((n = read(in, buf, 1024)) > 0) {
t += write(out, buf, n);
}
return t;
}
“But…!”, you cry out, “there’s so much wrong with this!” And you areright, of course:
- The return value
nis not checked. It might be-1. This might bebecause e.g. we have got a bad file descriptor, or because thesyscall was interrupted. - A call to
write(out, buf, 1024)will – if it does not return-1– write at least one byte, but we have no guarantee that we willactually write allnbytes to disk. So we have to loop the writeuntil we have writtennbytes.
An updated and semantically correct patternreads like this (in a real program you’d have to do real errorhandling instead of assertions, of course):
ssize_t read_write_bs(int in, int out, ssize_t bs)
{
ssize_t w = 0, r = 0, t, n, m;
char *buf = malloc(bs);
assert(buf != NULL);
t = filesize(in);
while(r < t && (n = read(in, buf, bs))) {
if(n == -1) { assert(errno == EINTR); continue; }
r = n;
w = 0;
while(w < r && (m = write(out, buf + w, (r - w)))) {
if(m == -1) { assert(errno == EINTR); continue; }
w += m;
}
}
free(buf);
return w;
}
We have a total number of bytes to read (t), the number of bytesalready read (r), and the number of bytes already written (w).Only when t == r == w are we done (or if the input stream endsprematurely). Error checking is performed so that we restartinterrupted syscalls and crash on real errors.
What about the bs parameter? Of course you may have already noticedin the first example that we always copied 1024 bytes. Typically, ablock on the file system is 4KB, so we are only writing quarterblocks, which is likely bad for performance. So we’ll try differentblock sizes and compare the results.
We can find out the file system’s block size like this (as usual,real error handling left out):
ssize_t block_size(int fd)
{
struct statfs st;
assert(fstatfs(fd, &st) != -1);
return (ssize_t) st.f_bsize;
}
OK, let’s do some benchmarks! (Full code is on GitHub.) Forsimplicity I’ll try things on my laptop computer with Ext3+dmcrypt andan SSD. This is “read a 128MB file and write it out”, repeated fordifferent block sizes, timing each version three times and printingthe best time in the first column. In parantheses you’ll see thepercentage increase in comparison to the best run of all methods:
read+write 16bs 164ms 191ms 206ms
read+write 256bs 167ms 168ms 187ms (+ 1.8%)
read+write 4bs 169ms 169ms 177ms (+ 3.0%)
read+write bs 184ms 191ms 200ms (+ 12.2%)
read+write 1k 299ms 317ms 329ms (+ 82.3%)
Mh. Seems like multiples of the FS’s block sizes don’t really matterhere. In some runs, the 16x blocksize is best, sometimes it’s the256x. The only obvious point is that writing only a single block atonce is bad, and writing fractions of a block at once is very badindeed performance-wise.
Now what’s there to improve? “Surely it’s the overhead of usingread() to get data,” I hear you saying, “Use mmap() for that!”So we come up with this:
ssize_t mmap_write(int in, int out)
{
ssize_t w = 0, n;
size_t len;
char *p;
len = filesize(in);
p = mmap(NULL, len, PROT_READ, MAP_SHARED, in, 0);
assert(p != NULL);
while(w < len && (n = write(out, p + w, (len - w)))) {
if(n == -1) { assert(errno == EINTR); continue; }
w += n;
}
munmap(p, len);
return w;
}
Admittedly, the pattern is simpler.But, alas, it is even a little bit slower! (YMMV)
read+write 16bs 167ms 171ms 209ms
mmap+write 186ms 187ms 211ms (+ 11.4%)
“Surely copying around useless data is hurting performance,” I hear yousay, “it’s 2014, use zero-copy already!” – OK. So basically there aretwo approaches for this on Linux: One cumbersome but rather old andknown to work, and then there is the new and shinysendfileinterface.
For the spliceapproach, since either reader or writer of your splice call must bepipes (and in our case both are regular files), we need to create apipe solely for the purpose of splicing data from in to the writeend of the pipe, and then again splicing that same chunk from the readend to the out fd:
ssize_t pipe_splice(int in, int out)
{
size_t bs = 65536;
ssize_t w = 0, r = 0, t, n, m;
int pipefd[2];
int flags = SPLICE_F_MOVE | SPLICE_F_MORE;
assert(pipe(pipefd) != -1);
t = filesize(in);
while(r < t && (n = splice(in, NULL, pipefd[1], NULL, bs, flags))) {
if(n == -1) { assert(errno == EINTR); continue; }
r += n;
while(w < r && (m = splice(pipefd[0], NULL, out, NULL, bs, flags))) {
if(m == -1) { assert(errno == EINTR); continue; }
w += m;
}
}
close(pipefd[0]);
close(pipefd[1]);
return w;
}
“This is not true zero copy!”, I hear you cry, and it’s true, the ‘pagestealing’ mechanism has been discontinuedas of 2007. So what we get is an “in-kernel memory copy”, but at leastthe file contents don’t cross the kernel/userspace boundary twiceunnecessarily (we don’t inspect it anyway, right?).
The sendfile() approach is more immediate and clean:
ssize_t do_sendfile(int in, int out)
{
ssize_t t = filesize(in);
off_t ofs = 0;
while(ofs < t) {
if(sendfile(out, in, &ofs, t - ofs) == -1) {
assert(errno == EINTR);
continue;
}
}
return t;
}
So… do we get an actual performance gain?
sendfile 159ms 168ms 175ms
pipe+splice 161ms 162ms 163ms (+ 1.3%)
read+write 16bs 164ms 165ms 178ms (+ 3.1%)
“Yes! I knew it!” you say. But I’m lying here. Every time I executethe benchmark, another different approach is the fastest. Sometimesthe read/write approach comes in first before the two others. So itseems that this is not really a performance saver, is it? I like thesendfile() semantics, though. But beware:
In Linux kernels before 2.6.33, out_fd must refer to a socket. SinceLinux 2.6.33 it can be any file. If it is a regular file, thensendfile() changes the file offset appropriately.
Strangely, sendfile() works on regular files in the default DebianSqueeze Kernel (2.6.32-5) without problems. –
“But,” I hear you saying, “the system has no clue what your intentionsare, give it a few hints!” and you are probably right, that shouldn’t hurt:
void advice(int in, int out)
{
ssize_t t = filesize(in);
posix_fadvise(in, 0, t, POSIX_FADV_WILLNEED);
posix_fadvise(in, 0, t, POSIX_FADV_SEQUENTIAL);
}
But since the file is very probably fully cached, the performance isnot improved significantly. “BUT you should supply a hint on how muchyou will write, too!” – And you are right. And this is where the storybranches off into two cases: Old and new file systems.
I’ll just tell the kernel that I want to write t bytes to disk now,and please reserve space (I don’t care about a “disk full” that Icould catch and act on):
void do_falloc(int in, int out)
{
ssize_t t = filesize(in);
posix_fallocate(out, 0, t);
}
I’m using my workstation’s SSD with XFS now (not my laptop any more).Suddenly everything is much faster, so I’ll simply run the benchmarkson a 512MB file so that it actually takes time:
sendfile + advices + falloc 205ms 208ms 208ms
pipe+splice + advices + falloc 207ms 209ms 210ms (+ 1.0%)
sendfile 226ms 226ms 229ms (+ 10.2%)
pipe+splice 227ms 227ms 231ms (+ 10.7%)
read+write 16bs + advices + falloc 235ms 240ms 240ms (+ 14.6%)
read+write 16bs 258ms 259ms 263ms (+ 25.9%)
Wow, so this posix_fallocate() thing is a real improvement! It seemsreasonable enough, of course: Already the file system can prepare an– if possible contiguous – sequence of blocks in the requested size. Butwait! What about Ext3? Back to the laptop:
sendfile 161ms 171ms 194ms
read+write 16bs 164ms 174ms 189ms (+ 1.9%)
pipe+splice 167ms 170ms 178ms (+ 3.7%)
read+write 16bs + advices + falloc 224ms 229ms 229ms (+ 39.1%)
pipe+splice + advices + falloc 229ms 239ms 241ms (+ 42.2%)
sendfile + advices + falloc 232ms 235ms 249ms (+ 44.1%)
Bummer. That was unexpected. Why is that? Let’s check strace whilewe execute this program:
fallocate(1, 0, 0, 134217728) = -1 EOPNOTSUPP (Operation not supported)
...
pwrite(1, "\0", 1, 4095) = 1
pwrite(1, "\0", 1, 8191) = 1
pwrite(1, "\0", 1, 12287) = 1
pwrite(1, "\0", 1, 16383) = 1
...
What? Who does this? – Glibc does this! It sees the syscall fail andre-creates the semantics by hand. (Beware, Glibc code follows. Safe toskip if you want to keep your sanity.)
/* Reserve storage for the data of the file associated with FD. */
int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
#ifdef __NR_fallocate
# ifndef __ASSUME_FALLOCATE
if (__glibc_likely (__have_fallocate >= 0))
# endif
{
INTERNAL_SYSCALL_DECL (err);
int res = INTERNAL_SYSCALL (fallocate, err, 6, fd, 0,
__LONG_LONG_PAIR (offset >> 31, offset),
__LONG_LONG_PAIR (len >> 31, len));
if (! INTERNAL_SYSCALL_ERROR_P (res, err))
return 0;
# ifndef __ASSUME_FALLOCATE
if (__glibc_unlikely (INTERNAL_SYSCALL_ERRNO (res, err) == ENOSYS))
__have_fallocate = -1;
else
# endif
if (INTERNAL_SYSCALL_ERRNO (res, err) != EOPNOTSUPP)
return INTERNAL_SYSCALL_ERRNO (res, err);
}
#endif
return internal_fallocate (fd, offset, len);
}
And you guessed it, internal_fallocate() just does a pwrite() onthe first byte for every block until the space requirement isfulfilled. This is slowing things down considerably. This is bad. –
“But other people just truncate the file! I saw this!”, you interject,and again you are right.
void enlarge_truncate(int in, int out)
{
ssize_t t = filesize(in);
ftruncate(out, t);
}
Indeed the truncate versions work faster on Ext3:
pipe+splice + advices + trunc 157ms 158ms 160ms
read+write 16bs + advices + trunc 158ms 167ms 188ms (+ 0.6%)
sendfile + advices + trunc 164ms 167ms 181ms (+ 4.5%)
sendfile 164ms 171ms 193ms (+ 4.5%)
pipe+splice 166ms 167ms 170ms (+ 5.7%)
read+write 16bs 178ms 185ms 185ms (+ 13.4%)
Alas, not on XFS. There, the fallocate() system call is just moreperformant. (You can also usexfsctldirectly for that.) –
And this is where the story ends.
In place of a sweeping conclusion, I’m a little bit disappointed thatthere seems to be no general semantics to say “I’ll write n bytesnow, please be prepared”. Obviously, using posix_fallocate() on Ext3hurts very much (this may be why cp is notemployingit). So I guess the best solution is still something like this:
if(fallocate(out, 0, 0, len) == -1 && errno == EOPNOTSUPP)
ftruncate(out, len);
Maybe you have another idea how to speed up the writing process? Thendrop me an email, please.
Update 2014-05-03: Coming back after a couple of days’ vacation, Ifound the post was onHackerNews andgenerated some 23k hits here. I corrected the small mistake in example 2(as pointed out in the comments – thanks!). – I trust that the diligentreader will have noticed that this is not a complete survey of eitherI/O hierarchy, file system and/or hard drive performace. It is, as the subtitleshould have made clear, a “tale about Linux file write patterns”.
Update 2014-06-09: Sebastian pointed out an errorin the mmap write pattern (the write should start at p + w, not at p).Also, the basic read/write pattern contained a subtle error. Tricky business –Thanks!
本文探讨了在Linux环境下高效写入文件的各种方法和技术细节,通过对比不同的写入模式和使用系统调用,为开发者提供了实际可行的优化建议。

被折叠的 条评论
为什么被折叠?



