An example of profiling application

Let’s see a simple example:

  1. # cat add_vec.cpp
  2. #include <algorithm>
  3. #include <iostream>
  4. #include <vector>
  5. constexpr int vec_size = 10000;
  6. void mul(std::vector<int>& a, std::vector<int>& b, std::vector<int>& c)
  7. {
  8. std::transform(
  9. a.begin(),
  10. a.end(),
  11. b.begin(),
  12. c.begin(),
  13. [](auto op1, auto op2) {return op1 * op2;}
  14. );
  15. }
  16. std::vector<int> A(vec_size, 1);
  17. std::vector<int> B(vec_size, 2);
  18. std::vector<int> C(vec_size);
  19. int main()
  20. {
  21. mul(A, B, C);
  22. for (auto const& v : C)
  23. {
  24. if (v != 2)
  25. {
  26. std::cout << "Oops, something happened!\n";
  27. return 1;
  28. }
  29. }
  30. return 0;
  31. }

Build and run “perf record“ command to profile it:

  1. # g++ add_vec.cpp -o add_vec
  2. # perf record ./add_vec
  3. [ perf record: Woken up 1 times to write data ]
  4. [ perf record: Captured and wrote 0.003 MB perf.data (27 samples) ]

A “perf.data” file will be generated. Use “perf report“ command to analyze it:

  1. # perf report
  2. Samples: 27 of event 'cycles:uppp', Event count (approx.): 2149376
  3. Overhead Command Shared Object Symbol
  4. 33.67% add_vec ld-2.28.so [.] do_lookup_x
  5. 26.37% add_vec ld-2.28.so [.] _dl_lookup_symbol_x
  6. 11.43% add_vec [unknown] [k] 0xffffffffb2200a87
  7. 9.16% add_vec libc-2.28.so [.] _dl_addr
  8. 8.43% add_vec add_vec [.] main
  9. 4.21% add_vec ld-2.28.so [.] _dl_relocate_object
  10. 2.76% add_vec libc-2.28.so [.] strlen
  11. 2.45% add_vec ld-2.28.so [.] strcmp
  12. 1.52% add_vec ld-2.28.so [.] _dl_next_ld_env_entry

At least from output, do_lookup_x in ld-2.28.so is sampled mostly. Highlight main function and press a (which will call “perf annotate“ command to show annotated code):

  1. Samples: 27 of event 'cycles:uppp', 4000 Hz, Event count (approx.): 2149376
  2. main /home/xiaonan/perf_example/add_vec [Percent: local period]
  3. Percent mov %rax,%rdi
  4. callq std::vector<int, std::allocator<int> >::begin
  5. mov %rax,-0x28(%rbp)
  6. mov -0x18(%rbp),%rax
  7. mov %rax,%rdi
  8. callq std::vector<int, std::allocator<int> >::end
  9. mov %rax,-0x20(%rbp)
  10. 5c: lea -0x20(%rbp),%rdx
  11. lea -0x28(%rbp),%rax
  12. mov %rdx,%rsi
  13. mov %rax,%rdi
  14. callq __gnu_cxx::operator!=<int*, std::vector<int, std::allocator<int> > >
  15. test %al,%al
  16. je b6
  17. lea -0x28(%rbp),%rax
  18. mov %rax,%rdi
  19. callq __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::operator*
  20. mov %rax,-0x10(%rbp)
  21. mov -0x10(%rbp),%rax
  22. 100.00 mov (%rax),%eax
  23. cmp $0x2,%eax
  24. je a8
  25. ......

If you want to map the assembly code to source code, try to build program with -g option:

  1. # g++ add_vec.cpp -g -o add_vec

Another useful option of using “perf record“ is -g, which records function call stack information.

  1. # perf record -g ./add_vec
  2. [ perf record: Woken up 1 times to write data ]
  3. [ perf record: Captured and wrote 0.004 MB perf.data (28 samples) ]
  4. $ perf report
  5. Samples: 28 of event 'cycles:uppp', Event count (approx.): 2044515
  6. Children Self Command Shared Object Symbol
  7. + 48.61% 0.00% add_vec [unknown] [.] 0x5541f689495641d7
  8. + 48.61% 0.00% add_vec libc-2.28.so [.] __libc_start_main
  9. - 48.61% 18.12% add_vec add_vec [.] main
  10. - 30.48% main
  11. mul
  12. - std::transform<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal
  13. + __gnu_cxx::operator!=<int*, std::vector<int, std::allocator<int> > >
  14. - 18.12% 0x5541f689495641d7
  15. __libc_start_main
  16. main
  17. ......

This time, every line had a ‘+‘ at the beginning which can show the call stack clearly (from upper lever to lower). Take main function as an example:

  1. int main()
  2. {
  3. mul(A, B, C);
  4. for (auto const& v : C)
  5. {
  6. if (v != 2)
  7. {
  8. std::cout << "Oops, something happened!\n";
  9. return 1;
  10. }
  11. }
  12. return 0;
  13. }

The percentage of mul function’s consume time is ~30.48%, for-loop accounts for ~18.12%. For column “Children“, it denotes all the time spent in this function; while column “Self“ will exclude other sub-function calls, e.g., main will exclude mul.