gcc Option Needed for gprof

最新推荐文章于 2022-07-13 08:35:51 发布

原创最新推荐文章于 2022-07-13 08:35:51 发布 · 993 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#gcc #function #output #initialization #profiling #each

C/C++ 同时被 2 个专栏收录

167 篇文章

订阅专栏

Unix/Linux

77 篇文章

订阅专栏

本文详细介绍了如何使用gprof和gcov工具来分析和优化程序性能。通过编译时的参数设置，生成性能报告，进而识别程序的瓶颈并进行针对性优化。以一个简单的程序为例，展示如何通过引入sqrt函数来显著提高程序运行速度，并使用kprof工具进行图形化展示。文章还展示了如何利用gprof和gcov进行代码覆盖率分析，从而进一步提升程序效率。

来自《Linux® Debugging and Performance Tuning: Tips and Techniques》ch1

重点如下：

1) 编译时使用 -pg 产生profile gmon.out

2) gprof -b a.out

3) 编译时使用-pg -fprofile-arcs -ftest-coverage 支持产生调用图

4) 运行程序后，使用gcov *.c产生调用图*.c.gcov

5) 使用kprof查看图形界面

gcc Option Needed for gprof

Before programs can be profiled using gprof, they must be compiled with the-pg gcc option. To get complete information about gprof, you can use the commandinfo gprof or man gprof.

Listing 1.5 shows the benefits that profiling can have on a small program. The sample1 program prints the prime numbers up to 50,000. You can use the output from gprof to increase this program's performance by changing the program to sample2, shown later in Listing 1.8.

Listing 1.5. sample1.c

1  #include <stdlib.h>
2  #include <stdio.h>
3
4  int prime (int num);
5
6  int main()
7  {
8   int i;
9   int colcnt = 0;
10  for (i=2; i <= 50000; i++)
11    if (prime(i)) {
12      colcnt++;
13      if (colcnt%9 == 0) {
14         printf("%5d\n",i);
15         colcnt = 0;
16      }
17    else
18       printf("%5d ", i);
19    }
20    putchar('\n');
21    return 0;
22  }
23
24  int prime (int num) {
25      /* check to see if the number is a prime? */
26      int i;
27      for (i=2; i < num; i++)
28      if (num %i == 0)
29        return 0;
30      return 1;
31  }

Building the sample1 Program and Using gprof

The sample1.c program needs to be compiled with the option -pg to have profile data generated, as shown in Figure 1.11.

Figure 1.11. Building and running sample1.

When the sample1 program is run, the gmon.out file is created.

To view the profiling data, the gprof utility must be on your system. If your system isrpm-based, the rpm command shows the version of gprof, as shown inFigure 1.12.

Figure 1.12. The version of gprof.

gprof is in the binutils package. For you to use the utility, the package must be installed on your system. One useful gprof option is-b. The -b option eliminates the text output that explains the data output provided by gprof:

# gprof -b ./sample1

The output shown in Listing 1.6 from gprof gives some high-level information like the total running time, which is 103.74 seconds. The main routine running time is 0.07 seconds, and the prime routine running time is 103.67 seconds. The prime routine is called 49,999 times.

Listing 1.6. Output from gprof for sample1

Flat profile:

Each sample counts as 0.01 seconds.
  %      cumulative   self              self     total
 time     seconds   seconds   calls  ms/call   ms/call   name
 99.93     103.67   103.67    49999    2.07      2.07    prime
  0.07     103.74     0.07                               main

                   Call graph

granularity: each sample hit covers 4 byte(s) for 0.01% of
103.74 seconds

index % time    self   children     called     name
                                                   <spontaneous>
[1]    100.0    0.07       103.67              main [1]
              103.67       0.00  49999/49999        prime [2]
------------------------------------------------
              103.67       0.00  49999/49999   main [1]
[2]     99.9  103.67       0.00  49999         prime [2]
------------------------------------------------

Index by function name
   [1] main                      [2] prime

Next we can use the gcov program to look at the actual number of times each line of the program was executed. (SeeChapter 2, "Code Coverage," for more about gcov.)

We will build the sample1 program with two additional options-fprofile-arcs and-ftest-coverage, as shown in Figure 1.13. These options let you look at the program using gcov, as shown inFigure 1.14.

Figure 1.13. Building sample1 with gcov options.

[View full size image]

Figure 1.14. Running sample1 and creating gcov output.

Running gcov on the source code produces the file sample1.c.gcov. It shows the actual number of times each line of the program was executed.Listing 1.7 is the output of gcov on sample1.

Listing 1.7. Output from gcov for sample1

        -:   0:Source:sample1.c
        -:   0:Graph:sample1.bbg
        -:   0:Data:sample1.da
        -:   1:#include <stdlib.h>
        -:   2:#include <stdio.h>
        -:   3:
        -:   4:int prime (int num);
        -:   5:
        -:   6:int main()
        1:   7: {
        1:   8:  int i;
        1:   9:  int colcnt = 0;
    50000:  10:  for (i=2; i <= 50000; i++)
    49999:  11:    if (prime(i)) {
     5133:  12:      colcnt++;
     5133:  13:      if (colcnt%9 == 0) {
      570:  14:  printf("%5d\n",i);
      570:  15:  colcnt = 0;
        -:  16:    }
        -:  17:  else
     4563:  18:    printf("%5d ", i);
        -:  19:    }
        1:  20:      putchar('\n');
        1:  21:      return 0;
        -:  22: }
        -:  23:
    49999:  24:int prime (int num) {
        -:  25:     /* check to see if the number is a prime?
                    */
    49999:  26:     int i;
121337004:  27:     for (i=2; i < num; i++)
121331871:  28:     if (num %i == 0)
    44866:  29:     return 0;
     5133:  30:     return 1;
        -:  31:     }
        -:  32:

There are 5,133 prime numbers. The expensive operations in the routine prime are thefor loop (line 27) and the if statement (line 28). The "hot spots" are the loop and the if test inside the prime routine. This is where we will work to increase the program's performance. One change that will help this program is to use thesqrt() function, which returns the nonnegative square root function of the number passed in. sample2, shown inListing 1.8, has been changed to use the sqrt function in the newly created function called faster.

Listing 1.8. sample2.c

1 #include <stdlib.h>
2 #include <stdio.h>
3 #include <math.h>
4
5 int prime (int num);
6 int faster (int num);
7
8 int main()
9 {
10   int i;
11   int colcnt = 0;
12   for (i=2; i <= 50000; i++)
13     if (prime(i)) {
14       colcnt++;
15      if (colcnt%9 == 0) {
16         printf("%5d\n",i);
17         colcnt = 0;
18      }
19     else
20         printf("%5d ", i);
21     }
22      putchar('\n');
23      return 0;
24 }
25
26 int prime (int num) {
27     /* check to see if the number is a prime? */
28    int i;
29     for (i=2; i <= faster(num); i++)
30     if (num %i == 0)
31       return 0;
32     return 1;
33  }
34
35 int faster (int num)
36 {
37  return (int) sqrt( (float) num);
38 }

Now you can build the sample2 program (see Figure 1.15) and use gprof to check how long the program will take to run (seeFigure 1.16). Also, the gcov output shows the reduced number of times each line needs to be executed. InListing 1.9, the total running time has been reduced from 103.74 seconds to 2.80 seconds.

Figure 1.15. Building and running sample2.

Figure 1.16. Using gprof on sample2.

Listing 1.9 shows the output of gprof for the sample2 program.

Listing 1.9. Output from gprof for sample2

Flat profile:

Each sample counts as 0.01 seconds.
%     cumulative    self               self       total
time    seconds     seconds   calls   us/call    us/call  name
52.68     1.48      1.48     1061109    1.39        1.39  faster
46.61     2.78      1.30       49999   26.10       55.60  prime
 0.71     2.80      0.02                               main

                Call graph

granularity: each sample hit covers 4 byte(s)  for 0.36% of 2.80
seconds

index   % time    self   children   called       name
                                                    <spontaneous>
[1]     100.0     0.02      2.78                  main [1]
                  1.30      1.48   49999/49999       prime [2]
---------------------------------------------------
                  1.30      1.48   49999/49999       main [1]
[2]      99.3     1.30      1.48   49999          prime [2]
                  1.48      0.00   1061109/1061109   faster [3]
---------------------------------------------------
                  1.48      0.00   1061109/1061109   prime [2]
[3]      52.7     1.48      0.00   1061109        faster [3]
---------------------------------------------------
Index by function name

[3] faster                    [1] main                     [2]
prime

Now we'll run gcov on the sample2 program, as shown in Figures 1.17 and 1.18.

Figure 1.17. Building sample2 with gcov and running sample2.

[View full size image]

Figure 1.18. Running sample2 and getting gcov output.

Listing 1.10 shows gcov output for the sample2 program.

Listing 1.10. Output of sample2.c.gcov

      -:   0:Source:sample2.c
      -:   0:Graph:sample2.bbg
      -:   0:Data:sample2.da
      -:   1:#include <stdlib.h>
      -:   2:#include <stdio.h>
      -:   3:#include <math.h>
      -:   4:
      -:   5:int prime (int num);
      -:   6:int faster (int num);
      -:   7:
      -:   8:int main()
      1:   9:{
      1:  10:   int i;
      1:  11:   int colcnt = 0;
  50000:  12:   for (i=2; i <= 50000; i++)
  49999:  13:      if (prime(i)) {
   5133:  14:      colcnt++;
   5133:  15:      if (colcnt%9 == 0) {
    570:  16:printf("%5d\n",i);
    570:  17:colcnt = 0;
      -:  18:     }
      -:  19:   else
   4563:  20:     printf("%5d ", i);
      -:  21:    }
      1:  22:      putchar('\n');
      1:  23:      return 0;
      -:  24: }
      -:  25:
  49999:  26:int prime (int num) {
      -:  27:     /* check to see if the number is a
                  prime? */
  49999:  28:    int i;
1061109:  29:     for (i=2; i <= faster(num); i++)
1055976:  30:     if (num %i == 0)
  44866:  31:       return 0;
   5133:  32:     return 1;
      -:  33:     }
      -:  34:
      -:  35:int faster (int num)
1061109:  36: {
1061109:  37:  return (int) sqrt( (float) num);
      -:  38: }
      -:  39:

The for loop in the prime routine has been reduced from 121 million executions to 1 million executions. Therefore, the total time has been reduced from 103.74 seconds to 2.80 seconds.

The tools gprof and gcov helped find the "hot spots" in this sample program. After the "hot spots" were found, the program was changed to increase its overall performance. It is interesting how changing a few lines of code can have a great impact on a program's performance.

Listing 1.11, sample3.cpp, has three different functions (1, 2, and 3). It shows a more complex use of profiling, with both flat and graphic profiles. We'll also use kprof, which can use gprof output. It presents the information in list or tree views, which make the information easier to understand when programs are more complicated. Let's start by building the sample3.cpp program and displaying the flat and graphic profiles and then displaying the data using kprof.

Listing 1.11. sample3.cpp

1  #include <iostream>
2
3  void function1(){
4     for(int i=0;i<1000000;i++);
5  }
6
7  void function2(){
8     function1();
9     for (int i=0;i<2000000;i++);
10 }
11
12 void function3(){
13    function1();
14    function2();
15    for (int i=0;i<3000000;i++);
16       function1();
17 }
18
19 int main(){
20    for(int i=0;i<10;i++)
21    function1();
22
23    for (int i=0;i<5000000;i++);
24
25    for(int i=0;i<10;i++)
26       function2();
27       for(int i=0; i<13;i++);
28          {
29          function3();
30          function2();
31          function1();
32          }
33 }

Figure 1.19 shows the commands used to build and run the sample3 program. gprof is also run on sample3 to get the profile data from sample3.

Figure 1.19. Building and capturing gprof output for sample3.

We won't use the -b option on the gprof output on the sample3 program so that we can see all the descriptive information that gprof can display.

The sample3.gprof should look similar to this:

Flat profile:
Each sample counts as 0.01 seconds.
%       cumulative    self             self      total
time    seconds       seconds  calls   ms/call   ms/call   name
43.36   4.21          4.21     12      0.35      0.52      function2()
42.84   8.37          4.16     25      0.17      0.17      function1()
 8.65   9.21          0.84                                 main
 5.15   9.71          0.50      1      0.50      1.35      function3()
 0.00   9.71          0.00      1      0.00      0.00      global constructors
                                                           keyed to function1()
 0.00   9.71          0.00      1      0.00      0.00
__static_initialization_and_destruction_0(int, int)

Field	Description
% time	The percentage of the program's total running time used by this function.
cumulative seconds	A running sum of the number of seconds accounted for by this function and those listed above it.
self seconds	The number of seconds accounted for by this function alone. This is the major sort for this listing.
calls	The number of times this function was invoked if this function is profiled; otherwise, it is blank.
self ms/call	The average number of milliseconds spent in this function per call if this function is profiled; otherwise, it is blank.
total ms/call	The average number of milliseconds spent in this function and its descendents per call if this function is profiled; otherwise, it is blank.
name	The function's name. This is the minor sort for this listing. The index shows the location of the function in the gprof listing. If the index is in parentheses, it shows where it would appear in the gprof listing if it were to be printed.

Call graph (explanation follows)
granularity: each sample hit covers 4 byte(s) for 0.10% of 9.71 seconds
index %  time  self     children      called        name
                                                       <spontaneous>
[1]   100.0    0.84     8.87                        main [1]
               3.86     1.83          11/12               function2() [2]
               1.83     0.00          11/25               function1() [3]
               0.50     0.85          1/1                 function3() [4]
------------------------------------
               0.35     0.17          1/12                function3() [4]
               3.86     1.83          11/12              main [1]
[2]   63.9     4.21     2.00          12            function2() [2]
               2.00     0.00          12/25              function1() [3]
------------------------------------
               0.33     0.00          2/25              function3() [4]
               1.83     0.00          11/25             main [1]
               2.00     0.00          12/25             function2() [2]
[3]   42.8     4.16     0.00          25            function1() [3]
------------------------------------
               0.50      0.85         1/1                main [1]
[4]   13.9     0.50      0.85         1           function3() [4]
               0.35      0.17         1/12               function2() [2]
               0.33      0.00         2/25               function1() [3]
------------------------------------
               0.00      0.00         1/1          __do_global_ctors_aux [13]
[11]  0.0      0.00      0.00         1        global constructors keyed to
function1() [11]
               0.00      0.00         1/1
__static_initialization_and_destruction_0(int, int) [12]
------------------------------------
               0.00      0.00         1/1          global constructors keyed to
function1() [11]
[12]           0.0       0.00         0.00          1
__static_initialization_and_destruction_0(int, int) [12]
------------------------------------

This table describes the program's call tree. It is sorted by the total amount of time spent in each function and its children.

Each entry in this table consists of several lines. The line with the index number at the left margin lists the current function. The lines above it list the functions that called this function, and the lines below it list the functions this one called.

You see the following:

Field	Description
index	A unique number given to each element of the table. Index numbers are sorted numerically. The index number is printed next to every function name so that it is easier to look up the function in the table.
% time	The percentage of the total time that was spent in this function and its children. Note that due to different viewpoints, functions excluded by options, and so on, these numbersdo not add up to 100%.

Field	Description
self	The total amount of time spent in this function.
children	The total amount of time propagated into this function by its children.
called	The number of times the function was called. If the function called itself recursively, the number includes only nonrecursive calls and is followed by a + and the number of recursive calls.
name	The name of the current function. The index number is printed after it. If the function is a member of a cycle, the cycle number is printed between the function's name and the index number.

For the function's parents, the fields have the following meanings:

Field	Description
self	The amount of time that was propagated directly from the function into this parent.
children	The amount of time that was propagated from the function's children into this parent.
called	The number of times this parent called the function and the total number of times the function was called. Recursive calls to the function are not included in the number after the /.
name	The parent's name. The parent's index number is printed after it. If the parent is a member of a cycle, the cycle number is printed between the name and the index number.

If the function's parents cannot be determined, the word <spontaneous> is printed in the name field, and all the other fields are blank.

For the function's children, the fields have the following meanings:

Field	Description
self	The amount of time that was propagated directly from the child into the function.
children	The amount of time that was propagated from the child's children to the function.
called	The number of times the function called this child and the total number of times the child was called. Recursive calls by the child are not listed in the number after the /.
name	The child's name. The child's index number is printed after it. If the child is a member of a cycle, the cycle number is printed between the name and the index number.

If the call graph has any cycles (circles), there is an entry for the cycle as a whole. This entry shows who called the cycle (as parents) and the members of the cycle (as children). The + recursive calls entry shows how many function calls were internal to the cycle. The calls entry for each member shows, for that member, how many times it was called from other members of the cycle.

Index by function name
[11] global constructors keyed to function1() [3] function1() [4] function3()
[12] __static_initialization_and_destruction_0(int, int) [2] function2() [1]
main