LINUX C 如何让写文件更快

本文探讨了在Linux环境下高效写入文件的各种方法和技术细节,通过对比不同的写入模式和使用系统调用,为开发者提供了实际可行的优化建议。

So you want to write to a file real fast…

Or: A tale about Linux file write patterns.

So I once wrote a custom core dump handler to be used with Linux’score_pattern. What it does is take a core dump onSTDIN plus a few arguments, and then write the core to a predictablelocation on disk with a time stamp and suitable access rights. Coredumps tend to be rather large, and in general you don’t know inadvance how much data you’ll write to disk. So I built a functionalityto write a chunk of data to disk (say, 16MB) and then check withfstatfs() if the disk has still more than threshold capacity (say,10GB). This way, a rapidly restarting and core-dumping applicationcannot lead to “disk full” follow up failures that will inevitablylead to a denial of service for most data handling services.

So… how do we write a lot of data to disk really fast? – Let us mayberephrase the question: How do we write data to disk in the firstplace? Let’s assume we have already opened file descriptors in andout, and we just want to copy everything from in to out.

One might be tempted to try something like this:

ssize_t read_write(int in, int out)
{
    ssize_t n, t = 0;
    char buf[1024];
    while((n = read(in, buf, 1024)) > 0) {
        t += write(out, buf, n);
    }
    return t;
}

“But…!”, you cry out, “there’s so much wrong with this!” And you areright, of course:

  • The return value n is not checked. It might be -1. This might bebecause e.g. we have got a bad file descriptor, or because thesyscall was interrupted.
  • A call to write(out, buf, 1024) will – if it does not return -1– write at least one byte, but we have no guarantee that we willactually write all n bytes to disk. So we have to loop the writeuntil we have written n bytes.

An updated and semantically correct patternreads like this (in a real program you’d have to do real errorhandling instead of assertions, of course):

ssize_t read_write_bs(int in, int out, ssize_t bs)
{
    ssize_t w = 0, r = 0, t, n, m;

    char *buf = malloc(bs);
    assert(buf != NULL);

    t = filesize(in);

    while(r < t && (n = read(in, buf, bs))) {
        if(n == -1) { assert(errno == EINTR); continue; }
        r = n;
        w = 0;
        while(w < r && (m = write(out, buf + w, (r - w)))) {
            if(m == -1) { assert(errno == EINTR); continue; }
            w += m;
        }
    }

    free(buf);

    return w;
}

We have a total number of bytes to read (t), the number of bytesalready read (r), and the number of bytes already written (w).Only when t == r == w are we done (or if the input stream endsprematurely). Error checking is performed so that we restartinterrupted syscalls and crash on real errors.

What about the bs parameter? Of course you may have already noticedin the first example that we always copied 1024 bytes. Typically, ablock on the file system is 4KB, so we are only writing quarterblocks, which is likely bad for performance. So we’ll try differentblock sizes and compare the results.

We can find out the file system’s block size like this (as usual,real error handling left out):

ssize_t block_size(int fd)
{
    struct statfs st;
    assert(fstatfs(fd, &st) != -1);
    return (ssize_t) st.f_bsize;
}

OK, let’s do some benchmarks! (Full code is on GitHub.) Forsimplicity I’ll try things on my laptop computer with Ext3+dmcrypt andan SSD. This is “read a 128MB file and write it out”, repeated fordifferent block sizes, timing each version three times and printingthe best time in the first column. In parantheses you’ll see thepercentage increase in comparison to the best run of all methods:

read+write 16bs             164ms      191ms      206ms
read+write 256bs            167ms      168ms      187ms  (+ 1.8%)
read+write 4bs              169ms      169ms      177ms  (+ 3.0%)
read+write bs               184ms      191ms      200ms  (+ 12.2%)
read+write 1k               299ms      317ms      329ms  (+ 82.3%)

Mh. Seems like multiples of the FS’s block sizes don’t really matterhere. In some runs, the 16x blocksize is best, sometimes it’s the256x. The only obvious point is that writing only a single block atonce is bad, and writing fractions of a block at once is very badindeed performance-wise.

Now what’s there to improve? “Surely it’s the overhead of usingread() to get data,” I hear you saying, “Use mmap() for that!”So we come up with this:

ssize_t mmap_write(int in, int out)
{
    ssize_t w = 0, n;
    size_t len;
    char *p;

    len = filesize(in);
    p = mmap(NULL, len, PROT_READ, MAP_SHARED, in, 0);
    assert(p != NULL);

    while(w < len && (n = write(out, p + w, (len - w)))) {
        if(n == -1) { assert(errno == EINTR); continue; }
        w += n;
    }

    munmap(p, len);

    return w;
}

Admittedly, the pattern is simpler.But, alas, it is even a little bit slower! (YMMV)

read+write 16bs               167ms      171ms      209ms
mmap+write                    186ms      187ms      211ms  (+ 11.4%)

“Surely copying around useless data is hurting performance,” I hear yousay, “it’s 2014, use zero-copy already!” – OK. So basically there aretwo approaches for this on Linux: One cumbersome but rather old andknown to work, and then there is the new and shinysendfileinterface.

For the spliceapproach, since either reader or writer of your splice call must bepipes (and in our case both are regular files), we need to create apipe solely for the purpose of splicing data from in to the writeend of the pipe, and then again splicing that same chunk from the readend to the out fd:

ssize_t pipe_splice(int in, int out)
{
    size_t bs = 65536;
    ssize_t w = 0, r = 0, t, n, m;
    int pipefd[2];
    int flags = SPLICE_F_MOVE | SPLICE_F_MORE;

    assert(pipe(pipefd) != -1);

    t = filesize(in);

    while(r < t && (n = splice(in, NULL, pipefd[1], NULL, bs, flags))) {
        if(n == -1) { assert(errno == EINTR); continue; }
        r += n;
        while(w < r && (m = splice(pipefd[0], NULL, out, NULL, bs, flags))) {
            if(m == -1) { assert(errno == EINTR); continue; }
            w += m;
        }
    }

    close(pipefd[0]);
    close(pipefd[1]);

    return w;
}

“This is not true zero copy!”, I hear you cry, and it’s true, the ‘pagestealing’ mechanism has been discontinuedas of 2007. So what we get is an “in-kernel memory copy”, but at leastthe file contents don’t cross the kernel/userspace boundary twiceunnecessarily (we don’t inspect it anyway, right?).

The sendfile() approach is more immediate and clean:

ssize_t do_sendfile(int in, int out)
{
    ssize_t t = filesize(in);
    off_t ofs = 0;

    while(ofs < t) {
        if(sendfile(out, in, &ofs, t - ofs) == -1) {
            assert(errno == EINTR);
            continue;
        }
    }

    return t;
}

So… do we get an actual performance gain?

sendfile                    159ms      168ms      175ms
pipe+splice                 161ms      162ms      163ms  (+ 1.3%)
read+write 16bs             164ms      165ms      178ms  (+ 3.1%)

“Yes! I knew it!” you say. But I’m lying here. Every time I executethe benchmark, another different approach is the fastest. Sometimesthe read/write approach comes in first before the two others. So itseems that this is not really a performance saver, is it? I like thesendfile() semantics, though. But beware:

In Linux kernels before 2.6.33, out_fd must refer to a socket. SinceLinux 2.6.33 it can be any file. If it is a regular file, thensendfile() changes the file offset appropriately.

Strangely, sendfile() works on regular files in the default DebianSqueeze Kernel (2.6.32-5) without problems. –

“But,” I hear you saying, “the system has no clue what your intentionsare, give it a few hints!” and you are probably right, that shouldn’t hurt:

void advice(int in, int out)
{
    ssize_t t = filesize(in);
    posix_fadvise(in, 0, t, POSIX_FADV_WILLNEED);
    posix_fadvise(in, 0, t, POSIX_FADV_SEQUENTIAL);
}

But since the file is very probably fully cached, the performance isnot improved significantly. “BUT you should supply a hint on how muchyou will write, too!” – And you are right. And this is where the storybranches off into two cases: Old and new file systems.

I’ll just tell the kernel that I want to write t bytes to disk now,and please reserve space (I don’t care about a “disk full” that Icould catch and act on):

void do_falloc(int in, int out)
{
    ssize_t t = filesize(in);
    posix_fallocate(out, 0, t);
}

I’m using my workstation’s SSD with XFS now (not my laptop any more).Suddenly everything is much faster, so I’ll simply run the benchmarkson a 512MB file so that it actually takes time:

sendfile + advices + falloc            205ms      208ms      208ms
pipe+splice + advices + falloc         207ms      209ms      210ms  (+ 1.0%)
sendfile                               226ms      226ms      229ms  (+ 10.2%)
pipe+splice                            227ms      227ms      231ms  (+ 10.7%)
read+write 16bs + advices + falloc     235ms      240ms      240ms  (+ 14.6%)
read+write 16bs                        258ms      259ms      263ms  (+ 25.9%)

Wow, so this posix_fallocate() thing is a real improvement! It seemsreasonable enough, of course: Already the file system can prepare an– if possible contiguous – sequence of blocks in the requested size. Butwait! What about Ext3? Back to the laptop:

sendfile                               161ms      171ms      194ms
read+write 16bs                        164ms      174ms      189ms  (+ 1.9%)
pipe+splice                            167ms      170ms      178ms  (+ 3.7%)
read+write 16bs + advices + falloc     224ms      229ms      229ms  (+ 39.1%)
pipe+splice + advices + falloc         229ms      239ms      241ms  (+ 42.2%)
sendfile + advices + falloc            232ms      235ms      249ms  (+ 44.1%)

Bummer. That was unexpected. Why is that? Let’s check strace whilewe execute this program:

fallocate(1, 0, 0, 134217728)           = -1 EOPNOTSUPP (Operation not supported)
...
pwrite(1, "\0", 1, 4095)                = 1
pwrite(1, "\0", 1, 8191)                = 1
pwrite(1, "\0", 1, 12287)               = 1
pwrite(1, "\0", 1, 16383)               = 1
...

What? Who does this? – Glibc does this! It sees the syscall fail andre-creates the semantics by hand. (Beware, Glibc code follows. Safe toskip if you want to keep your sanity.)

/* Reserve storage for the data of the file associated with FD.  */
int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
#ifdef __NR_fallocate
# ifndef __ASSUME_FALLOCATE
  if (__glibc_likely (__have_fallocate >= 0))
# endif
    {
      INTERNAL_SYSCALL_DECL (err);
      int res = INTERNAL_SYSCALL (fallocate, err, 6, fd, 0,
                                  __LONG_LONG_PAIR (offset >> 31, offset),
                                  __LONG_LONG_PAIR (len >> 31, len));

      if (! INTERNAL_SYSCALL_ERROR_P (res, err))
        return 0;

# ifndef __ASSUME_FALLOCATE
      if (__glibc_unlikely (INTERNAL_SYSCALL_ERRNO (res, err) == ENOSYS))
        __have_fallocate = -1;
      else
# endif
        if (INTERNAL_SYSCALL_ERRNO (res, err) != EOPNOTSUPP)
          return INTERNAL_SYSCALL_ERRNO (res, err);
    }
#endif

  return internal_fallocate (fd, offset, len);
}

And you guessed it, internal_fallocate() just does a pwrite() onthe first byte for every block until the space requirement isfulfilled. This is slowing things down considerably. This is bad. –

“But other people just truncate the file! I saw this!”, you interject,and again you are right.

void enlarge_truncate(int in, int out)
{
    ssize_t t = filesize(in);
    ftruncate(out, t);
}

Indeed the truncate versions work faster on Ext3:

pipe+splice + advices + trunc        157ms      158ms      160ms
read+write 16bs + advices + trunc    158ms      167ms      188ms  (+ 0.6%)
sendfile + advices + trunc           164ms      167ms      181ms  (+ 4.5%)
sendfile                             164ms      171ms      193ms  (+ 4.5%)
pipe+splice                          166ms      167ms      170ms  (+ 5.7%)
read+write 16bs                      178ms      185ms      185ms  (+ 13.4%)

Alas, not on XFS. There, the fallocate() system call is just moreperformant. (You can also usexfsctldirectly for that.) –

And this is where the story ends.

In place of a sweeping conclusion, I’m a little bit disappointed thatthere seems to be no general semantics to say “I’ll write n bytesnow, please be prepared”. Obviously, using posix_fallocate() on Ext3hurts very much (this may be why cp is notemployingit). So I guess the best solution is still something like this:

if(fallocate(out, 0, 0, len) == -1 && errno == EOPNOTSUPP)
    ftruncate(out, len);

Maybe you have another idea how to speed up the writing process? Thendrop me an email, please.

Update 2014-05-03: Coming back after a couple of days’ vacation, Ifound the post was onHackerNews andgenerated some 23k hits here. I corrected the small mistake in example 2(as pointed out in the comments – thanks!). – I trust that the diligentreader will have noticed that this is not a complete survey of eitherI/O hierarchy, file system and/or hard drive performace. It is, as the subtitleshould have made clear, a “tale about Linux file write patterns”.

Update 2014-06-09: Sebastian pointed out an errorin the mmap write pattern (the write should start at p + w, not at p).Also, the basic read/write pattern contained a subtle error. Tricky business –Thanks!

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值