I/O performance in Python - Python Performance 5 of n

本文探讨了使用Python读取大型二进制文件时的性能优化策略,通过实验对比了不同读取方法的效率,并深入分析了Python内部处理文件读取的过程。最终发现,直接读取大块数据并进行子串操作是最有效的策略之一。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

I/O performance in Python

The Problem

I am writing a small python script to keep track of various events and messages. It uses a flat file as an index, each record being of the same size and containing details about each message.

This file can get large, in the order of several hundreds of megabytes. The python code is trivial, given that each record is exactly the same size, but what is the fastest way to access and use that index file?

With python (or any programming language, for what is worth), I have plenty of ways to read a file:

  • I can just rely on read and io.read in python having perfectly good buffering, and just read (or io.read) a record at a time.
  • I can read it in one go, and then operate in memory (eg, a single read in a string, followed by using offsets within the string).
  • I can do my own buffering, read a large chunk at a time, and then operate on each chunk as a set of records (eg, multiple read of some multiple of the size of the record, followed by using offsets within each chunk).
  • I can use fancier libraries that allow me to mmap or use some other crazy approach.

and even more ways to parse it. For now, I just wanted to focus on the reading part, as in my case parsing was a simple split or struct.unpack.

The Quest

Disclaimer: most of the examples below are oversimplified and not particularly beautiful. I should have used the with construct, checked forexceptions, or preferred list operations over string operations at times. I was trying to avoid any distractions, by keeping the examples as similar as possible. Please bear with me, and keep your pythonic feelings at bay, if you can :), but do please provide suggestions if you have any!

The first question I wanted to answer was "is solving this problem even worth spending time on it? Why can't I just write the simplest code, usingread, and be done with it?".

So that's where I started from:

f = open(sys.argv[1], "rb")
for i in itertools.count():
  record = f.read(16)
  if not record:
    break

print i

The counter, printing i, is just to verify that all the records have been read correctly.

The file I've used for benchmarks is about 1.4 Gb in size:

$ ls -al ./test-index.bin
-rw-r--r-- 1 rabexc users 1493893120 Dec 16  2010 ./test-index.bin

My system has about 4 Gb of ram, only < 1 of which is used in steady state:

$ free
             total       used       free     shared    buffers     cached
Mem:       4039164    2363280    1675884          0      38128    1697472
-/+ buffers/cache:     627680    3411484
Swap:      6291452        140    6291312

Given that I'm not using the system for anything else but this test, this means that after I read the file for a few times, it should live in cache, and always be accessed from memory.

So, how do I time the code? well, a few run of the scripts with time should do it:

$ time python /tmp/snippets/snippet00.py ./test-index.bin
93368320
real    0m54.960s
user    0m53.715s
sys     0m0.620s

(this is like the 3rd run, which is important to make sure that all the important bits, including the test file, are read from cache rather than actual disk, so I can benchmark the code, and not my hard drive).

This means 55 seconds were spent running the python snippet above, and 0.6 seconds were spent by the operating system to provide the content of the file.

This is quite disappointing: I was hoping for my code to take < 10 seconds. How fast can this go at best? Let's try comparing this number with how long cat takes to read the file:

$ time cat ./test-index.bin &> /dev/null
real    0m0.531s
user    0m0.016s
sys     0m0.512s

well, a plain cat is taking about half a second, and I'm sure there are better ways to read files than what cat does! It ought to be possible to speed up my python code!

Given that I don't know anything about python internals, let's start by looking at what cat is doing internally with a simple strace:

fadvise64_64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
...
read(3, "QFI\373\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\0\0\1@\0\0\0"..., 32768) = 32768
write(1, "QFI\373\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\0\0\1@\0\0\0"..., 32768) = 32768
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
read(3, "\200\0\0\0\0\6\0\0\200\0\0\0\0\214\0\0\200\0\0\0\1\21\0\0\200\0\0\0\1\226\0\0"..., 32768) = 32768
write(1, "\200\0\0\0\0\6\0\0\200\0\0\0\0\214\0\0\200\0\0\0\1\21\0\0\200\0\0\0\1\226\0\0"..., 32768) = 32768
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
read(3, "\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
write(1, "\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
...

So:

  1. It tells the operating systems it intends to read the file sequentially (fadvise).
  2. It reads the file in chunk of 32K.
  3. Which it immediately writes out to the output file.

The Results

Unless you are interested in spending the next 30 mins of your life learning about experiments that turned out to be of very little help, let me summarize the results first:

  • I now believe that python is actually quite fast at reading the file. The issue seems to be in accessing the buffer 16 bytes at a time. A simple substring using list slice syntax over a large buffer costs about 300 ns, multiply by a 100,000,000 runs, this gets to ~30 seconds.

  • I could not find any good way to speed this part up. Anything I did outside the example snippets ended up paying roughly the same cost. Note also that most of the APIs take strings as input, so it is hard to use anything else.

  • Eliminating this cost in example snippets of code was tricky, but brought the performance to be in the same order of magnitude as that of cat.

  • Not sure even eliminating this cost would buy much. Sure, it will make the benchmark faster, but any sort of processing in the middle of the loop will likely end up costing something.

  • Different ways to actually read the file did bring some improvements. Eg, I went from ~30-40 seconds with the best approaches, to about 2 minutes with the worst ones. Given that the naive code is about 50 seconds, those 10 seconds are probably not worth any sort of extra complexity in your code.

The Experiments

Naive open with read

You have already seen this snippet of code:

f = open(sys.argv[1], "rb")
for i in itertools.count():
  record = f.read(16)
  if not record:
    break

print i

Running this script takes about 54 seconds. Just to put this in perspective, it is about 100 times slower than cat, although it is really not doing anything with each record.

$ time python /tmp/snippets/snippet00.py ./test-index.bin
93368320
real    0m54.960s
user    0m53.715s
sys     0m0.620s

If I run strace on this script, I see that internally python is reading 4K at a time, and buffering. Good!

read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096

this means it is using buffers of about 4K size, or 8 times smaller than cat. If I increase the buffer size to 32K, with:

f = open(sys.argv[1], "rb", 32768)

it is still taking ~48 seconds, which means that most of the time is still, well, spent by python itself, with sys time still ~0.5 seconds.

If I further increase the buffer size to 1 Mb and 10 Mb, performance does not improve.

But if I leave a large buffer, and also increase the f.read() size from 16 bytes to 32 K or 1 Mb, run time goes down to about 3 seconds.

Given that the buffer size controls how many times the read syscall is invoked, while the size of the f.read() controls how many times the loop, and the python code to perform the read is called, it seems like the cost is dominated by the python code to perform the read.

This is where knowning more of the python internals would come in handy, but by peeking at the python process with a profiler, I don't see that much, except for the fact that python itself is spending lots of CPU, which doesn't sound very surprising.

 35.17%  [.] 0x000bd586
 30.86%  [.] PyEval_EvalFrameEx
  9.15%  [.] PyDict_GetItem
  6.22%  [.] _PyObject_GenericGetAttrWithDict
  2.98%  [.] PyDict_SetItem
  2.13%  [.] Py_UniversalNewlineFread
  2.06%  [.] PyCFunction_Call
  1.90%  [.] PyObject_Malloc
  1.83%  [.] PyObject_GetAttr
  1.78%  [.] PyString_FromStringAndSize
  1.76%  [.] PyObject_Free
  1.37%  [.] PyEval_RestoreThread
  1.30%  [.] PyTuple_New
  0.68%  [.] PyEval_SaveThread
  0.58%  [.] _PyArg_ParseTuple_SizeT
  0.24%  [.] getresgid@plt
  0.00%  [.] _PyModule_Clear
  0.00%  [.] PyType_ClearCache

So the question arises: does it get faster if, for example, I use io.read instead of the plain read? And what if I use os.read on file descriptors instead?

io.open and read

This is what the code looks like:

f = io.open(sys.argv[1], "rb")
for i in itertools.count():
  record = f.read(16)
  if not record:
    break
print i

and this is how fast it is:

$ time python /tmp/snippets/snippet01.py ./test-index.bin
93368320

real    0m42.741s
user    0m42.039s
sys     0m0.636s

I was not expecting io module to be any faster, really, especially given that according to strace, they are using the same buffer size. This seems to indicate that the python code path to run io.read() is indeed slightly faster? oh, well, the 10 seconds gain doesn't make it particularly exciting.

open and os.read

Let's try with a simple os.read. Given that it is the lowest level function I have available, I expect it to have the least amount of python code in it. Although I know nothing about python internals, I would expect plain read and io.read to be implemented on top of os.read.

However, calling os.read directly to get 16 bytes at a time will incur in heavy syscall costs, but let's try:

f = open(sys.argv[1], "rb")
fd = f.fileno()
for i in itertools.count():
  record = os.read(fd, 16)
  if not record:
    break

print i

And here's how long it takes:

$ time python /tmp/snippets/snippet02.py ./test-index.bin
93368320

real    1m53.372s
user    0m58.292s
sys     0m54.927s

Which is roughly twice as slower than any other solution so far. Unsurprisingly, this means that buffering is important. Note, however, how the systime has gone up significantly from previous tests, from ~half a second to about 54 seconds, while the user time is pretty much unchanged.

This probably means that we are paying roughly the same cost within python itself, plus the cost to perform many more syscalls and actually asking the kernel 16 bytes at a time of the file.

But what if I increase buffering here? And read 1 Mb at a time?

f = open(sys.argv[1], "rb")
fd = f.fileno()
for i in itertools.count():
  record = os.read(fd, 1048576)
  if not record:
    break

print i

Time goes down significantly:

$ time python ./snippet02.py /opt/vms/pool/test-index.bin 
1425

real  0m1.446s
user  0m0.012s
sys 0m1.424s

So, where is the cost? Is it in the read function code path, or is it in working 16 bytes at a time in python?

Given that by using a plain read with 1 Mb at a time the run time went down to about 3 seconds, I expect the difference between os.read andread in terms of cost to be that 1.5 second.

So, what about the 40+ seconds caused by reading 16 bytes at a time?

Let's see what happens if I os.read() 1 Mb at a time, and then work on 16 bytes at a time. Here's what the code looks like:

f = open(sys.argv[1], "rb")
fd = f.fileno()
offsets = range(0, 1048765, 16)
i = 0
while True:
  buff = os.read(fd, 1048576)
  if not buff:
    break

  for offset in offsets:
    record = buff[offset:offset + 16]
    i += 1

print i

And here are the timings:

$ time python ./snippet05.py /opt/vms/pool/test-index.bin 
93405900

real    0m38.921s
user    0m37.328s
sys     0m1.520s

What makes these timings interesting is that if I comment out the inner loop, it takes 1.5 seconds once again. So the 37 extra seconds are spent by the inner loop!

If I turn the body of the loop into a simple pass, the run time goes down to about 6 seconds, while if I leave only therecord = buff[offset:offset + 16] as instruction, this becomes ~30 seconds.

This means that computing the substring is really what I need to make faster, and costs about 300 ns.

Reading the whole file in one go

Here is what the code looks like:

f = open(sys.argv[1], "rb")
data = f.read()

for i in itertools.count():
  start = i * 16
  end = start + 16
  if start >= len(data):
    break
  record = data[start:end]

print i

and here is the output:

$ time python /tmp/snippets/snippet03.py test-index.bin
93368320

real    1m12.975s
user    1m11.120s
sys     0m1.720s

which is not that surprising. This is actually much slower than any other read operation. If the cost is in string processing, it may be that we are paying this cost multiple times now: once to fill data, reading one buffer at a time, and once every time I create a record object from the buffer.

In facts, if I comment out the record = data[start:end] line the time goes back to almost normal:

$ time python /tmp/snippets/snippet03.py test-index.bin
93368320

real    0m48.553s
user    0m44.427s
sys     0m1.764s
io.open and readinto

By peeking at pydoc io, there seems to be a method readinto that reads data into a bytearray instead of a string, so let's try it:

b = bytearray(16) 
f = io.open(sys.argv[1], "rb")
for i in itertools.count():
  numread = f.readinto(b)
  if not numread:
    break

print i

The first experiment doesn't really go well, this is worse than anything I've seen before:

$ time python /tmp/snippets/snippet04.py test-index.bin
93368320

real    1m40.463s
user    1m38.794s
sys     0m0.648s

By looking at strace, it seems to be doing lots of read(4096), which probably means read into a buffer, and then copy into the bytearray of 16 bytes:

read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096

Which seems to be slower than reading into a plain string? interesting.

Let's try with a larger bytearray, like 1048576, and see what happens instead:

b = bytearray(1048576)
# b = bytearray(16)
f = io.open(sys.argv[1], "rb")
for i in itertools.count():
  numread = f.readinto(b)
  if not numread:
    break

print i

run time is now:

$ time python /tmp/snippets/snippet04.py test-index.bin
1425

real    0m3.926s
user    0m0.220s
sys     0m1.568s

4 seconds, which is comparable to what I had before with large reads.

If I look with strace, mostly out of curiosity, here's what happens:

mmap2(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7296000
read(3, "\277\267\330\5\301\224\230\374\212\335Y\354\326a\237K\217][\354\334a\257\376c\357,v\352\260\213\364"..., 1048576) = 1048576
munmap(0xb7296000, 1052672)             = 0

which looks like it's reading directly into the bytearray.

What is surprising is the mmap. As pointed out on the reddit thread, the mmap and munmap are used to allocate memory (flags indicate anonymous memory, and file descriptor is -1).This might simply be related to the malloc() call in linux glibc, that according to the man page will use mmapinstead of sbrk for any allocation larger than MMAP_THRESHOLD, by default set to 128 k.

What I find suspicious here is that if you look at strace, you see a continuous pattern of mmap and munmap. This probably means a mallocfollowed by a free shortly after, but I would have expected a good memory allocator to "cache" allocations within the process for some time before returning memory to the operating system, although it's been a while since I've looked at the implementation of malloc.

So, without actually looking at the python code, here's my theory of the internals:

  • If "bytearray > buffer", read directly into bytearray.
  • Otherwise read into a buffer of configured size and copy the requested data into bytearray.

Let's try with a bytearray of 8192 bytes, which is twice as much as the buffer size:

$ time python /tmp/snippets/snippet04.py test-index.bin
182360

real    0m1.256s
user    0m0.704s
sys     0m0.548s

From strace, this is reading 8k at a times, presumably into the bytearray directly, given the run time.

Let's try one more time, with buffering disabled (note the 0 in io.open):

b = bytearray(16)
f = io.open(sys.argv[1], "rb", 0)
for i in itertools.count():
  numread = f.readinto(b)
  if not numread:
    break

print i

With strace, I can now clearly see that python is reading 16 bytes at a time directly:

read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 16
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 16
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 16

However, this is far slower than I was expecting:

$ time python /tmp/snippets/snippet04.py test-index.bin
93368320

real    2m4.992s
user    1m10.476s
sys     0m54.287s

Note the sys time at 54 seconds, probably because of the repeated read calls. The 1 minute and 10 seconds of user time is probably the high number of iterations? I would not have expected readinto to be slower than other primitives.

With python3

Those were mostly run out of curiosity, with the default buffer size and the most naive code.

Naive open and read
$ time python3 /tmp/snippets/snippet00.py test-index.bin

real    0m32.056s
user    0m30.662s
sys     0m1.300s
io.open and read
$ time python3 /tmp/snippets/snippet01.py test-index.bin
93368320

real    0m38.533s
user    0m37.862s
sys     0m0.588s
open and os.read
$ time python3 /tmp/snippets/snippet02.py test-index.bin 
93368320

real    2m6.902s
user    1m11.432s
sys     0m55.223s
reading the file in one go
$ time python3 /tmp/snippets/snippet03.py test-index.bin
93368320

real    1m36.938s
user    1m33.706s
sys     0m1.712s
io.open and readinto
$ time python3 /tmp/snippets/snippet04.py test-index.bin
45590

real    0m0.595s
user    0m0.084s
sys     0m0.512s
From:
http://www.rabexc.org/posts/io-performance-in-python?tag=Technology
<think>我们被要求使用simpleperf工具中的report_html.py脚本将perf.data文件转换为HTML格式的报告。根据引用[1]中的信息,我们可以使用如下命令: ```bash python3 ~/system/extras/simpleperf/report_html.py -i perf.data -o simpleperf.html ``` 这个命令的含义是: - `python3`:使用Python 3解释器运行脚本。 - `~/system/extras/simpleperf/report_html.py`:report_html.py脚本的路径。注意,这里的路径是一个示例,实际路径可能需要根据你的安装位置进行调整。 - `-i perf.data`:指定输入文件为当前目录下的perf.data。 - `-o simpleperf.html`:指定输出文件为当前目录下的simpleperf.html。 但是,请注意,引用[1]中给出的路径是`~/system/extras/simpleperf//report_html.py`,这里有两个斜杠,可能是笔误,应该是一个。另外,路径中的`~`表示当前用户的主目录。 另外,引用[3]中提到了app_profiler.py脚本,但这里我们只需要生成报告,所以不需要使用它。 如果上述路径中的report_html.py不在你的系统中,你可能需要找到simpleperf工具安装的位置,或者检查是否已经下载了simpleperf的源代码。通常,simpleperf是Android NDK的一部分,但也可以单独获取。 如果你没有找到report_html.py,可能需要从Android的源代码中获取。你可以从以下网址下载simpleperf的源代码:https://android.googlesource.com/platform/system/extras/+/master/simpleperf/scripts/report_html.py 下载后,你可以使用以下命令: ```bash python3 /path/to/report_html.py -i perf.data -o report.html ``` 另外,引用[2]和[4]提到了其他生成报告的方法,但它们是针对不同工具(如pytest)的,与simpleperf无关,因此可以忽略。 因此,总结步骤: 1. 确保你已经收集了perf.data文件(通过simpleperf record命令)。 2. 获取report_html.py脚本(如果系统中没有,需要下载)。 3. 运行上述命令生成HTML报告。 注意:运行脚本可能需要安装一些Python依赖,如matplotlib(用于生成图表)等。如果缺少依赖,请根据错误提示安装。 §§相关问题§§ 1. 如何安装simpleperf工具? 2. 如何使用simpleperf收集性能数据(perf.data)? 3. 除了HTML格式,simpleperf还能生成哪些格式的报告?</think>根据引用内容,使用simpleperf的report_html.py脚本将perf.data转换为HTML报告的步骤如下: ### 生成HTML性能报告的方法 1. **基本命令格式** 使用Python 3运行report_html.py脚本,指定输入文件和输出路径: ```bash python3 /path/to/report_html.py -i perf.data -o 输出路径.html ``` 示例(引用[1]): ```bash python3 ~/system/extras/simpleperf/report_html.py -i perf.data -o simpleperf.html ``` 2. **关键参数说明** - `-i perf.data`:指定输入的perf.data性能数据文件 - `-o 文件名.html`:设置输出的HTML报告路径 - 若脚本在本地目录,可直接使用相对路径(引用[2]思路): ```bash python3 report_html.py -i perf.data -o report.html ``` 3. **常见问题处理** - 若报错`unrecognized arguments`(如引用[4]),检查参数格式是否正确 - 确保Python环境已安装所需依赖(如pandas, matplotlib等) - 若perf.data不在当前目录,需使用完整路径:`-i /path/to/perf.data` ### 执行示例 ```bash # 假设脚本在~/simpleperf目录 python3 ~/simpleperf/report_html.py -i ./perf.data -o ./performance_report.html ``` 执行后将在当前目录生成`performance_report.html`文件,包含可视化性能数据。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值