In this example [1], gprof data is from a run of gzip compression program. So, what does the profile say?

profile in p-prof

First, gzip spent most of its time calling functions treat_file(), main() and zip() - but not directly in these functions; rather, the magenta color encodes time spent below them - in the functions called from treat_file(), main() and zip(). Since that time is the same for 3 functions and since the time in function itself is insignificant in each case (no red bar) it's quite probable that one of these functions called the second and that one called the third (not necessarily in the order p-prof presents them - I'd bet treat_file() didn't call main(), for example... [2]).

Looking at main(), there's something fishy here: the total length of the bar (relative to the width of the pane) encodes the percentage of the total running time spent calling the function, so it looks like the program spent almost half its time in and below main() - but then, where did it spend the other half?! Clearly, we must not trust the profile blindly - it may be just plain wrong... Let's ignore the "dark time" for the moment and look closer.

focusing on the first function

Selecting the function treat_file() in the main list, we see that it was really called from main() (only) and called zip() and a bunch of other functions. One of the nice features of a profiler is that it may tell you which of the calls visible in the source code is important - in this case, the purpose of treat_file() is clearly to call zip() and everything else in treat_file() is just scaffolding.

OTOH for the purpose of speeding gzip up, the first three functions aren't very interesting - they're just the top of the stack, and time is really spent below.

first non-trivial (performance-wise) function

Let's select the first function with a red bar and look at the text lines for deflate() and its only caller zip(). Parent text says zip called deflate 2368 times. These calls took 9.66 seconds in deflate plus 18.47 seconds in deflate's children, while at the bottom of main pane we're informed that The program spent 9.66 seconds in deflate and another 18.47 seconds in its children. deflate was called 2368 times.

So, the critical path goes from zip() to deflate(). It branches there - flush_block() and fill_window() (with their respective children) called from deflate() are each responsible for almost as much time as deflate() by itself. Text line for flush_block() (as a child of deflate()) is flush_block was called from deflate 4952 times. These calls took 0.00 seconds in flush_block plus 7.50 seconds in its children. while fill_window() has the text fill_window was called from deflate 4443 times. These calls took 4.95 seconds in fill_window plus 2.43 seconds in its children. Tooltips say it more tersely: 0.00+7.5 sec, 4952/4952 calls for flush_block(), 4.95+2.43 sec, 4443/4733 calls for fill_window(). The number after '/' is a total number of calls to the function - we see that fill_window() had also other callers than deflate(). These extra callers are apparently the reason why fill_window() is listed after flush_block() in the children pane, but before it in the main list. Even so, the graphics looks basically the same - deflate() is the principal caller of flush_block() and fill_window().

To optimize this part of the program, we may try a couple of strategies: speed up deflate(), or speed up its children, or decrease the number of times deflate() calls them (in this case, the latter should be possible by simply increasing the size of some buffers, but whether it would actually speed the program isn't that clear - try it out).

Also, before trying to speed up deflate(), we should look at functions listed after if - they look like much better targets for micro-optimization.

the critical function

One third total time in a single function - short_loop() is obviously the place to optimize! - except the authors of gzip didn't need p-prof to tell them that... short_loop() (as well as the next function, do_scan()) is already written in (presumably hand-optimized - who am I to say?) assembly language. That's why p-prof didn't pick up its callers (parents pane is empty), and that explains the time missing in main() - short_loop() and do_scan() together account for 55.90% running time, which isn't included in functions above them (at least main(), treat_file(), zip() and deflate() - others may exist but cannot be found without reading gzip sources, which seems like far too much work to me).

So, in conclusion, gzip is already optimized. I would expect nothing less from FSF, but with p-prof, you can see for yourself.


[1] Based on p-prof 0.08. Later versions look a bit different but really just a bit...

[2] It would be nice if p-prof presented functions taking equal time in their stack order. It would also be a lot of work... :-)