From: eLinux.org

Kernel Function Trace

Contents

Introduction

Kernel Function Trace (KFT) is a kernel function tracing system, which
uses the “-finstrument-functions” capability of the gcc compiler to add
instrumentation callouts to every function entry and exit. The KFT
system provides for capturing these callouts and generating a trace of
events, with timing details. KFT is excellent at providing a good timing
overview of kernel procedures, allowing you to see where time is spent
in functions and sub-routines in the kernel.

The main mode of operation with KFT is to use the system with a dynamic
trace configuration. That is, you can set a trace configuration after
kernel startup, using the /proc/kft interface, and retrieve trace data
immediately. However, another (special) mode of operation is available,
called STATIC_RUN mode, where the configuration for a KFT run is
configured and compiled statically into the kernel. This mode is useful
for getting a trace of kernel operation during system bootup (before
user space is running).

The KFT configuration lets you specify how to automatically start and
stop a trace, whether to include interrupts as part of the trace, and
whether to filter the trace data by various criteria (for minimum
function duration, only certain listed functions, etc.) KFT trace data
is retrieved by reading from /proc/kft_data after the trace is
complete.

Tools are supplied to convert numeric trace data to kernel symbols, and
to process and analyze the data in a KFT trace.

Basic Use

Documentation for KFT is available (as of 2.6.12) in
Documentation/kft.txt, after applying the kft-all-in-one.patch.

An online guide is provided at Using Kernel Function
Trace

Here’s a presentation about KFT usage:

KFT used to be called KFI (for Kernel Function Instrumentation). For
prior releases of KFT, see KernelFunctionInstrumentation

Download

Patches

KFT utilities

KFT includes several helper scripts which are located in the kernel
scripts directory:

  • addr2sym - convert function addresses to symbols in the trace data
  • kd - KFT dump - does filtering, sorting, analysis and trace
    formatting of KFT trace logs
  • mkkftrun.pl - used during building the kernel to convert a
    configuration file into a C file to be compiled into the kernel
  • sym2addr - convert function names to addresses in a KFT
    configuration file (for a dynamic trace)

See Documentation/kft.txt, in the kernel source tree after applying the
patch, for instructions on using these programs.

How To Use

  • download both the patch
  • apply the patch in the kernel top-level directory:
  • patch -p1 \<kft.patch
  • read the rest of the instructions in the Documentation/kft.txt file.
    (my apologies for being lazy!)

Adding platform support for the kft clock source

The current patch (from Sep 2005), uses sched_clock() as the clock
source for kft_readclock(). sched_clock() is new in the 2.6 kernel,
and returns a 64-bit value containing nanoseconds (not necessarily
relative to any particular time base, but assumed to be monotonically
increasing, and relatively frequency-stable.)

If your platform has good support for sched_clock(), then KFT should
work for you unmodified. If not, you may wish to do one of two things:

  • improve support for sched_clock() in your board port, or
  • write a custom kft_readclock() routine.

A “good” sched_clock() routine will provide at least microsecond
resolution on return values. Some architectures have sched_clock()
returning values based on the jiffy variable, which on many embedded
platforms only has resolution to 10 milliseconds.

There are some sample custom kft_readclock() routines in the current
patch for different architectures.

Issues

Here is a list of things that need more work:

  • may need to add noinstrument attributes for some time-critical code
    (need to check this)
    • maybe check “Function Trace in KDB” patch for help with this

Overhead

Mitsubishi measured the overhead of KFI (the predecessor to KFT). The
period is from start_kernel() to smp_init().

Platform was: SH7751R 240MHz (Memory Clock 80MHz)

With KFI : 922.419 msec Without KFI : 666.982 msec Overhead : 27.69%

Similar technologies

There are other technologies for doing call traces or kernel profiling
that are similar to KFT. Some of these are mentioned on the Kernel
Instrumentation
page.
One that is very similar is a kernel trace mechanism for use with KDB. A
patch was posted to LKML in January of 2002. See the message:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0201.3/0888.html

Filter Q&A

Tim asked the question:

Q. Is there a way to adjust the trigger or filters to reduce the memory
usage?

A. The memory usage is determined by the size of the log, which is
specified by logentries in the KFT configuration. If logentries is
not specified, it defaults to a rather large number (20,000 in the
current code). To use a smaller trace log, specify a smaller number of
logentries in the KFT configuration.

The use of triggers and filters can help you fit more data (or more
pertinent data) into the log, so you can more readily see the
information you are interested in.

By setting start and stop triggers with a narrower “range” of operation,
then the amount of data put into the log will be more limited. For
example, the default configuration for a static trace uses

trigger start entry start_kernel trigger stop entry to_userspace

This will trace EVERYTHING that the kernel does between those two
routines. However, you can limit tracing to a much smaller time area of
kernel initialization using better triggers. Here is an example showing
a triggers for just watching mem_init():
trigger start entry mem_init trigger stop exit mem_init

Filters are also vital to reduce the number of entries the trace log.
With no time filters in place, KFT will log every single function
executed by the kernel. This will quickly overrun the log (no matter
what size you have reserved with logentries.

When using KFT to find long-duration functions in the kernel, we usually
are not interested in routines that execute quickly, and instead use
something like “filter mintime 500” to filter out routines taking less
than 500 microseconds.

Sample results

Here is an excerpt from a KFI log trace (processed with addr2sym). It
shows all functions which lasted longer than 500 microseconds, from when
the kernel entered start_kernel() to when it entered to_userspace().

kft log output (excerpt)

Kernel Instrumentation Run ID 0

Logging started at 6785045 usec by entry to function start_kernel
Logging stopped at 8423650 usec by entry to function to_userspace

Filters: 500 usecs minimum execution time

Filter Counters:

Execution time filter count = 896348 Total entries filtered = 896348
Entries not found = 24

Number of entries after filters = 1757

  1. Entry Delta PID Function Called At
  2. 1 0 0 start_kernel L6+0x0
  3. 14 8687 0 setup_arch start_kernel+0x35
  4. 39 891 0 setup_memory setup_arch+0x2a8
  5. 53 872 0 register_bootmem_low_pages setup_memory+0x8f
  6. 54 871 0 free_bootmem register_bootmem_low_pages+0x95
  7. 54 871 0 free_bootmem_core free_bootmem+0x34
  8. 930 7432 0 paging_init setup_arch+0x2af
  9. 935 7427 0 zone_sizes_init paging_init+0x4e
  10. 935 7427 0 free_area_init zone_sizes_init+0x83
  11. 935 7427 0 free_area_init_node free_area_init+0x4b
  12. 935 3759 0 __alloc_bootmem_node free_area_init_node+0xc5
  13. 935 3759 0 __alloc_bootmem_core __alloc_bootmem_node+0x43
  14. 4694 3668 0 free_area_init_core free_area_init_node+0x75
  15. 4817 3535 0 memmap_init_zone free_area_init_core+0x2bd
  16. 8807 266911 0 time_init start_kernel+0xb6
  17. 8807 261404 0 get_cmos_time time_init+0x1c
  18. 270211 5507 0 select_timer time_init+0x41
  19. 270211 5507 0 init_tsc select_timer+0x45
  20. 270211 5507 0 calibrate_tsc init_tsc+0x6c
  21. 275718 1638 0 console_init start_kernel+0xbb
  22. 275718 1638 0 con_init console_init+0x59
  23. 275954 733 0 vgacon_save_screen con_init+0x288
  24. 277376 6730 0 mem_init start_kernel+0xf8
  25. 277376 1691 0 free_all_bootmem mem_init+0x52
  26. 277376 1691 0 free_all_bootmem_core free_all_bootmem+0x24
  27. 284118 25027 0 calibrate_delay start_kernel+0x10f
  28. 293860 770 0 __delay calibrate_delay+0x62
  29. 293860 770 0 delay_tsc __delay+0x26
  30. 294951 1534 0 __delay calibrate_delay+0x62
  31. 294951 1534 0 delay_tsc __delay+0x26
  32. 297134 1149 0 __delay calibrate_delay+0xbe
  33. 297134 1149 0 delay_tsc __delay+0x26
  34. .
  35. .
  36. .
  37. 1638605 0 145 filemap_nopage do_no_page+0xef
  38. 1638605 0 145 __lock_page filemap_nopage+0x286
  39. 1638605 0 145 io_schedule __lock_page+0x95
  40. 1638605 0 145 schedule io_schedule+0x24
  41. 1638605 0 5 schedule worker_thread+0x217
  42. 1638605 0 1 to_userspace init+0xa6

The log is attached here:
Media:Kfiboot-9.lst A
Delta value of 0 usually means the exit from the routine was not seen.

kft log analysis with ‘kd’

Below is a kd dump of the data from the above log.

For the purpose of finding areas of big time in the kernel, the
functions with high “Local” time are important. For example,
delay_tsc() is called 156 times, resulting in 619 milliseconds of
duration. Other time-consuming routines were: isapnp_isolate(),
get_cmos_time(), default_idle().

The top line showing schedule() called 192 times and lasting over 5
seconds, is accounted wrong due to the switch in execution control
inside the schedule routine. (The count of 192 calls is correct, but the
duration is wrong.)

  1. $ ~/work/kft/kft/kd -n 30 kftboot-9.lst
  2. Function Count Time Average Local
  3. ------------------------- ----- -------- -------- --------
  4. schedule 192 5173790 26946 5173790
  5. do_basic_setup 1 1159270 1159270 14
  6. do_initcalls 1 1159256 1159256 627
  7. __delay 156 619322 3970 0
  8. delay_tsc 156 619322 3970 619322
  9. __const_udelay 146 608427 4167 0
  10. probe_hwif 8 553972 69246 126
  11. do_probe 31 553025 17839 68
  12. ide_delay_50ms 103 552588 5364 0
  13. isapnp_init 1 383138 383138 18
  14. isapnp_isolate 1 383120 383120 311629
  15. ide_init 1 339778 339778 22
  16. probe_for_hwifs 1 339756 339756 103
  17. ide_scan_pcibus 1 339653 339653 13
  18. init_setup_piix 2 339640 169820 0
  19. ide_scan_pcidev 2 339640 169820 0
  20. piix_init_one 2 339640 169820 0
  21. ide_setup_pci_device 2 339640 169820 242
  22. probe_hwif_init 4 339398 84849 40
  23. time_init 1 266911 266911 0
  24. get_cmos_time 1 261404 261404 261404
  25. ide_generic_init 1 214614 214614 0
  26. ideprobe_init 1 214614 214614 0
  27. wait_for_completion 6 194573 32428 0
  28. default_idle 183 192589 1052 192589
  29. io_schedule 18 171313 9517 0
  30. __wait_on_buffer 14 150369 10740 141
  31. i8042_init 1 137210 137210 295
  32. i8042_port_register 2 135318 67659 301
  33. __serio_register_port 2 135017 67508 0

kft nested call trace with ‘kd -c’

Below is a kd -c trace of the data from a log taken from a PPC440g
platform, from a (dynamic) trace of the function do_fork().

Here is the configuration file that was used:
new begin trigger start entry do_fork trigger stop exit do_fork end

Here is the first part of the trace in nested call format: Times (Entry,
Duration and Local) are in micro-seconds. Note the timer interrupt
during the routine.

  1. Entry Duration Local Pid Trace
  2. 4 20428 209 33 do_fork
  3. 7 6 6 33 | alloc_pidmap
  4. 18 2643 84 33 | copy_process
  5. 21 114 19 33 | | dup_task_struct
  6. 24 8 6 33 | | | prepare_to_copy
  7. 27 2 2 33 | | | | sub_preempt_count
  8. 35 22 9 33 | | | kmem_cache_alloc
  9. 38 2 2 33 | | | | __might_sleep
  10. 43 11 9 33 | | | | cache_alloc_refill
  11. 49 2 2 33 | | | | | sub_preempt_count
  12. 60 65 6 33 | | | __get_free_pages
  13. 63 59 14 33 | | | | __alloc_pages
  14. 65 3 3 33 | | | | | __might_sleep
  15. 71 3 3 33 | | | | | zone_watermark_ok
  16. 77 37 17 33 | | | | | buffered_rmqueue
  17. 80 4 4 33 | | | | | | __rmqueue
  18. 86 3 3 33 | | | | | | sub_preempt_count
  19. 92 3 3 33 | | | | | | bad_range
  20. 98 2 2 33 | | | | | | __mod_page_state
  21. 103 8 5 33 | | | | | | prep_new_page
  22. 106 3 3 33 | | | | | | | set_page_refs
  23. 117 2 2 33 | | | | | zone_statistics
  24. 141 25 4 33 | | do_posix_clock_monotonic_gettime
  25. 143 21 6 33 | | | do_posix_clock_monotonic_get
  26. 146 15 6 33 | | | | do_posix_clock_monotonic_gettime_parts
  27. 149 9 6 33 | | | | | getnstimeofday
  28. 152 3 3 33 | | | | | | do_gettimeofday
  29. 169 3 3 33 | | copy_semundo
  30. 174 41 17 33 | | copy_files
  31. 177 19 9 33 | | | kmem_cache_alloc
  32. 180 2 2 33 | | | | __might_sleep
  33. 185 8 5 33 | | | | cache_alloc_refill
  34. 188 3 3 33 | | | | | sub_preempt_count
  35. 200 3 3 33 | | | count_open_files
  36. 209 2 2 33 | | | sub_preempt_count
  37. 218 19 8 33 | | kmem_cache_alloc
  38. 220 2 2 33 | | | __might_sleep
  39. 225 9 6 33 | | | cache_alloc_refill
  40. 229 3 3 33 | | | | sub_preempt_count
  41. 241 2 2 33 | | sub_preempt_count
  42. 246 216 9 33 | | kmem_cache_alloc
  43. 249 199 199 33 | | | __might_sleep
  44. !!!! start
  45. 253 151 63 33 timer_interrupt
  46. 256 8 6 -1 ! profile_tick
  47. 259 2 2 -1 ! ! profile_hit
  48. 267 61 15 -1 ! update_process_times
  49. 270 8 5 -1 ! ! account_system_time
  50. 273 3 3 -1 ! ! ! update_mem_hiwater
  51. 281 8 5 -1 ! ! run_local_timers
  52. 284 3 3 -1 ! ! ! raise_softirq
  53. 293 27 16 -1 ! ! scheduler_tick

. . .

To see the full trace, go to the
KftDoForkTrace page.

Categories: