Profiling Python using cProfile: a concrete case

本文介绍如何使用Python自带的cProfile模块进行性能测试，并通过KCacheGrind进行高级可视化分析，以找出并优化代码瓶颈。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

关于 python 性能测试和调优的工具

python自带

一而言之，

python -m cProfile -s time your_scripts.py

原文地址

Writing programs is fun, but making them fast can be a pain. Python programsare no exception to that, but the basic profiling toolchain is actually notthat complicated to use. Here, I would like to show you how you can quicklyprofile and analyze your Python code to find what part of the code you shouldoptimize.

What's profiling?

Profiling a Python program is doing a dynamic analysis that measures theexecution time of the program and everything that compose it. That meansmeasuring the time spent in each of its functions. This will give you dataabout where your program is spending time, and what area might be worthoptimizing.

It's a very interesting exercise. Many people focus on local optimizations,such as determining e.g. which of the Python functions range or xrange isgoing to be faster. It turns out that knowing which one is faster may never bean issue in your program, and that the time gained by one of the functionsabove might not be worth the time you spend researching that, or arguing aboutit with your colleague.

Trying to blindly optimize a program without measuring where it is actuallyspending its time is a useless exercise. Following your guts alone is notalways sufficient.

There are many types of profiling, as there are many things you can measure. Inthis exercise, we'll focus on CPU utilization profiling, meaning the time spentby each function executing instructions. Obviously, we could do many more kindof profiling and optimizations, such as memory profiling which would measurethe memory used by each piece of code – something I talk about inThe Hacker's Guide to Python.

cProfile

Since Python 2.5, Python provides a C module calledcProfile which has areasonable overhead and offers a good enough feature set. The basic usage goesdown to:

>>> import cProfile
>>> cProfile.run('2 + 2')
         2 function calls in 0.000 seconds
 
   Ordered by: standard name
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Though you can also run a script with it, which turns out to be handy:

$ python -m cProfile -s cumtime lwn2pocket.py
         72270 function calls (70640 primitive calls) in 4.481 seconds
 
   Ordered by: cumulative time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.004    0.004    4.481    4.481 lwn2pocket.py:2(<module>)
        1    0.001    0.001    4.296    4.296 lwn2pocket.py:51(main)
        3    0.000    0.000    4.286    1.429 api.py:17(request)
        3    0.000    0.000    4.268    1.423 sessions.py:386(request)
      4/3    0.000    0.000    3.816    1.272 sessions.py:539(send)
        4    0.000    0.000    2.965    0.741 adapters.py:323(send)
        4    0.000    0.000    2.962    0.740 connectionpool.py:421(urlopen)
        4    0.000    0.000    2.961    0.740 connectionpool.py:317(_make_request)
        2    0.000    0.000    2.675    1.338 api.py:98(post)
       30    0.000    0.000    1.621    0.054 ssl.py:727(recv)
       30    0.000    0.000    1.621    0.054 ssl.py:610(read)
       30    1.621    0.054    1.621    0.054 {method 'read' of '_ssl._SSLSocket' objects}
        1    0.000    0.000    1.611    1.611 api.py:58(get)
        4    0.000    0.000    1.572    0.393 httplib.py:1095(getresponse)
        4    0.000    0.000    1.572    0.393 httplib.py:446(begin)
       60    0.000    0.000    1.571    0.026 socket.py:410(readline)
        4    0.000    0.000    1.571    0.393 httplib.py:407(_read_status)
        1    0.000    0.000    1.462    1.462 pocket.py:44(wrapped)
        1    0.000    0.000    1.462    1.462 pocket.py:152(make_request)
        1    0.000    0.000    1.462    1.462 pocket.py:139(_make_request)
        1    0.000    0.000    1.459    1.459 pocket.py:134(_post_request)
[…]

This prints out all the function called, with the time spend in each and thenumber of times they have been called.

Advanced visualization with KCacheGrind

While being useful, the output format is very basic and does not make easy tograb knowledge for complete programs.For more advanced visualization, I leverageKCacheGrind. If you did any Cprogramming and profiling these last years, you may have used it as it isprimarily designed as front-end for Valgrind generatedcall-graphs.

In order to use, you need to generate a cProfile result file, then convert itto KCacheGrind format. To do that, I usepyprof2calltree.

$ python -m cProfile -o myscript.cprof myscript.py
$ pyprof2calltree -k -i myscript.cprof

And the KCacheGrind window magically appears!

Concrete case: Carbonara optimization

I was curious about the performances ofCarbonara,the small timeserie library I wrote forGnocchi. I decided to do some basic profilingto see if there was any obvious optimization to do.

In order to profile a program, you need to run it. But running the wholeprogram in profiling mode can generate a lot of data that you don't careabout, and adds noise to what you're trying to understand. Since Gnocchi hasthousands of unit tests and a few for Carbonara itself, I decided to profilethe code used by these unit tests, as it's a good reflection of basic featuresof the library.

Note that this is a good strategy for a curious and naive first-pass profiling.There's no way that you can make sure that the hotspots you will see in theunit tests are the actual hotspots you will encounter in production. Therefore,a profiling in conditions and with a scenario that mimics what's seen inproduction is often a necessity if you need to push your program optimizationfurther and want to achieve perceivable and valuable gain.

I activated cProfile using the method described above, creating acProfile.Profile object around my tests (I actuallystarted to implement that in testtools).I then run KCacheGrind as described above. Using KCacheGrind, I generatedthe following figures.

The test I profiled here is called test_fetch and is pretty easy tounderstand: it puts data in a timeserie object, and then fetch the aggregatedresult. The above list shows that 88 % of the ticks are spent in set_values(44 ticks over 50). This function is used to insert values into the timeserie,not to fetch the values. That means that it's really slow to insert data, andpretty fast to actually retrieve them.

Reading the rest of the list indicates that several functions share the rest ofthe ticks, update, _first_block_timestamp, _truncate, _resample, etc.Some of the functions in the list are not part of Carbonara, so there's nopoint in looking to optimize them. The only thing that can be optimized is,sometimes, the number of times they're called.

The call graph gives me a bit more insight about what's going on here. Using myknowledge about how Carbonara works, I don't think that the whole stack on theleft for _first_block_timestamp makes much sense. This function is supposedto find the first timestamp for an aggregate, e.g. with a timestamp of 13:34:45and a period of 5 minutes, the function should return 13:30:00. The way itworks currently is by calling the resample function from Pandas on atimeserie with only one element, but that seems to be very slow. Indeed,currently this function represents 25 % of the time spent by set_values (11ticks on 44).

Fortunately, I recently added a small function called _round_timestamp thatdoes exactly what _first_block_timestamp needs that without calling anyPandas function, so no resample. So I ended up rewriting that function thisway:

def _first_block_timestamp(self):
-        ts = self.ts[-1:].resample(self.block_size)
-        return (ts.index[-1] - (self.block_size * self.back_window))
+        rounded = self._round_timestamp(self.ts.index[-1], self.block_size)
+        return rounded - (self.block_size * self.back_window)

And then I re-run the exact same test to compare the output of cProfile.

The list of function seems quite different this time. The number of time spendused by set_values dropped from 88 % to 71 %.

The call stack for set_values shows that pretty well: we can't even see the_first_block_timestamp function as it is so fast that it totally disappearedfrom the display. It's now being considered insignificant by the profiler.