Tracing kernel using ftrace

21 minute read

Ftrace

What is ftrace?

Ftrace is a tracing utility built directly into the Linux kernel. The interface for Ftrace resides in the debugfs file system in the tracing directory. Official Documentation can be found in the Linux kernel source tree at Documentation/trace/ftrace.txt

Kernel Configuration

CONFIG_FUNCTION_TRACER  
CONFIG_FUNCTION_GRAPH_TRACER  
CONFIG_STACK_TRACER  
CONFIG_DYNAMIC_FTRACE

Note, the following options:

CONFIG_HAVE_FUNCTION_TRACER
CONFIG_HAVE_DYNAMIC_FTRACE
CONFIG_HAVE_FUNCTION_GRAPH_TRACER

are enabled when the architecture supports the corresponding feature. Do not confuse them with the listed options. The features are only enabled when the listed configuration options are enabled and not when only the HAVE options are.

Function tracer uses -pg option of gcc to have every function in the kernel call a special function “mcount()”, which will cause a slight overhead because of the addition of frame pointers that come with the -pg option.

What can ftrace do?

rpi3:/d/tracing # cat available_tracers
blk function_graph wakeup_dl wakeup_rt wakeup preemptirqsoff preemptoff irqsoff
function nop

There are nine tracers built into raspberrypi3 kernel, these are:

  • blk: The block tracer. The tracer used by the blktrace user application.
  • function: Function call tracer to trace all kernel functions.
  • function_graph: traces on both entry and exit of the functions.
  • wakeup: Traces and records the max latency that it takes for the highest priority task to get scheduled after it has been woken up.
  • wakeup_rt: Traces and records the max latency that it takes for just RT tasks
  • wakeup_dl: Traces and records the max latency that it takes for a SCHED_DEADLINE task to be woken
  • irqsoff: Traces the areas that disable interrupts and saves the trace with the longest max latency.
  • preemptsoff: Similar to irqsoff but traces and records the amount of time for which preemption is disabled.
  • preemptirqsoff: Similar to irqsoff and preemptoff, but traces and records the largest time for which irqs and/or preemption is disabled.

Refer to Documentation/trace/ftrace.txt for more tracers that not enabled by this kernel configuration.

function tracer

rpi3:/d/tracing # echo function > current_tracer
rpi3:/d/tracing # cat current_tracer
function

rpi3:/d/tracing # echo 1 > tracing_on
rpi3:/d/tracing # cat trace | head -20
# tracer: function
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
          <idle>-0     [003] d..1   490.568451: bcm2836_arm_irqchip_handle_irq <-__irq_svc
          <idle>-0     [003] d..1   490.568455: arm_heavy_mb <-bcm2836_arm_irqchip_handle_irq
          <idle>-0     [003] d..1   490.568457: handle_IPI <-bcm2836_arm_irqchip_handle_irq
          <idle>-0     [003] d..1   490.568459: scheduler_ipi <-handle_IPI
          <idle>-0     [003] d..1   490.568461: irq_enter <-scheduler_ipi
          <idle>-0     [003] d..1   490.568462: rcu_irq_enter <-irq_enter
          <idle>-0     [003] d..1   490.568464: rcu_dynticks_eqs_exit <-rcu_irq_enter
          <idle>-0     [003] d..1   490.568466: __local_bh_disable_ip <-irq_enter
          <idle>-0     [003] d..1   490.568467: tick_irq_enter <-irq_enter
          <idle>-0     [003] d..1   490.568469: tick_check_oneshot_broadcast_this_cpu <-tick_irq_enter
          <idle>-0     [003] d..1   490.568471: ktime_get <-tick_irq_enter

function_graph tracer

rpi3:/d/tracing # echo function_graph > current_tracer
rpi3:/d/tracing # cat trace | head -20
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 1)   0.521 us    |                } /* vma_compute_subtree_gap */
 1)   6.510 us    |              } /* vma_gap_callbacks_rotate */
 1) ! 107.500 us  |            } /* __vma_link_rb */
 1)               |            __vma_link_file() {
 1)               |              _raw_spin_lock_irq() {
 1)   0.782 us    |                preempt_count_add();
 1)   5.312 us    |              }
 1)   1.927 us    |              vma_interval_tree_insert();
 1)               |              _raw_spin_unlock_irq() {
 1)   0.677 us    |                preempt_count_sub();
 1)   5.052 us    |              }
 1) + 24.323 us   |            }
 1)   0.520 us    |            up_write();
 1) ! 152.396 us  |          } /* vma_link */
 1)   0.573 us    |          vm_stat_account();
 1)   0.573 us    |          uprobe_mmap();

The DURATION column shows the time spent in the corresponding function. The function graph tracer records the time the function was entered and exited and reports the difference as the duration. These numbers only appear with the leaf functions and the “}” symbol.

Note that this time also includes the overhead of all functions within a nested function as well as the overhead of the function graph tracer itself. The function graph tracer hijacks the return address of the function in order to insert a trace callback for the function exit. This breaks the CPU’s branch prediction and causes a bit more overhead than the function tracer. The closest true timings only occur for the leaf functions.

The marks in the above command result has the following meanings:

  '$' - greater than 1 second
  '@' - greater than 100 milisecond
  '*' - greater than 10 milisecond
  '#' - greater than 1000 microsecond
  '!' - greater than 100 microsecond
  '+' - greater than 10 microsecond
  ' ' - less than or equal to 10 microsecond.

Using trace_printk()

printk() is the king of all debuggers, but it has a problem. If you are debugging a high volume area such as the timer interrupt, the scheduler, or the network, printk() can lead to bogging down the system or can even create a live lock. It is also quite common to see a bug “disappear” when adding a few printk()s. This is due to the sheer overhead that printk() introduces.

Ftrace introduces a new form of printk() called trace_printk(). It can be used just like printk(), and can also be used in any context (interrupt code, NMI code, and scheduler code). What is nice about trace_printk() is that it does not output to the console. Instead it writes to the Ftrace ring buffer and can be read via the trace file.

Writing into the ring buffer with trace_printk() only takes around a tenth of a microsecond or so. But using printk(), especially when writing to the serial console, may take several milliseconds per write. The performance advantage of trace_printk() lets you record the most sensitive areas of the kernel with very little impact.

Disabling ftrace

tracing_on

rpi3:/d/tracing # echo 0 > tracing_on

This will disable the Ftrace ring buffer from recording. Everything else still happens with the tracers and they will still incur most of their overhead. They do notice that the ring buffer is not recording and will not attempt to write any data, but the calls that the tracers make are still performed.

ftrace_enabled
Note, the proc sysctl ftrace_enable is a big on/off switch for the function tracer. By default it is enabled (when function tracing is enabled in the kernel). If it is disabled, all function tracing is disabled. This includes not only the function tracers for ftrace, but also for any other uses (perf, kprobes, stack tracing, profiling, etc).

Please disable this with care.

This can be disabled (and enabled) with:

rpi3:/d/tracing # sysctl kernel.ftrace_enabled=0
rpi3:/d/tracing # sysctl kernel.ftrace_enabled=1

or

rpi3:/d/tracing # echo 0 > /proc/sys/kernel/ftrace_enabled
rpi3:/d/tracing # echo 1 > /proc/sys/kernel/ftrace_enabled

Disabling ftrace within kernel
You may wanna disable ftrace in kernel during the development of driver. There are two functions inside the kernel: tracing_on() and tracing_off() for which act just as we echoing ‘1’ or ‘0’ into tracing_on file.

Trace Marker

trace_marker was used to synchronize between the actions in user space and kernel space.

rpi3:/d/tracing # echo nop > current_tracer
rpi3:/d/tracing # echo hello world > trace_marker
rpi3:/d/tracing # cat trace
# tracer: nop
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
              sh-307   [000] ....  5493.548616: tracing_mark_write: hello world

Tracing kernel crash

ftrace_dump_on_oops was used for kernel crash debugging, it will dump the entire ftrace buffer to the console in ASCII format on kernel panic or oops. ftrace_dump_on_oops can be enabled via command line or add ftrace_dump_on_oops to kernel boot cmdline.

rpi3:/d/tracing # sysctl kernel.ftrace_dump_on_oops=1
kernel.ftrace_dump_on_oops = 1

Dumping the whole ftrace buffer maybe time-consuming depends on the buffer size of the ftrace, you can changed the default buffer size through buffer_size_kb file:

rpi3:/d/tracing # echo 10 > buffer_size_kb

The ftrace trace buffer are the same size for each CPU by default, so the above echoing to buffer_size_kb will make the buffer_total_size_kb increase fourfold since raspberry PI 3 has 4 CPUs on chip.

rpi3:/d/tracing # cat buffer_total_size_kb
40

Buffer size for individual CPUs may vary, in this case the buffer_size_kb will show ‘X’ when doing cat.

rpi3:/d/tracing # echo 7 > per_cpu/cpu0/buffer_size_kb
rpi3:/d/tracing # cat buffer_size_kb
X
rpi3:/d/tracing # cat buffer_total_size_kb
37

Stack Tracing

Enabling stack trace Stack tracing can be eanbled with sysctrl or echoing:

rpi3:/d/tracing # sysctl kernel.stack_tracer_enabled=1
kernel.stack_tracer_enabled = 1

or

rpi3:/d/tracing # echo 1 > /proc/sys/kernel/stack_tracer_enabled
rpi3:/d/tracing # cat stack_max_size
2592
rpi3:/d/tracing # cat stack_trace
        Depth    Size   Location    (38 entries)
        -----    ----   --------
  0)     2648       4   __wake_up_common+0x18/0x144
  1)     2644     132   __wake_up_common_lock+0x88/0xac
  2)     2512      24   __wake_up+0x24/0x2c
  3)     2488      24   tty_wakeup+0x38/0x6c
  4)     2464      24   tty_port_default_wakeup+0x24/0x30
  5)     2440      16   tty_port_tty_wakeup+0x20/0x24
  6)     2424      16   uart_write_wakeup+0x24/0x2c
  7)     2408      32   serial8250_tx_chars+0x1e8/0x1fc
[...]
 25)     1808      32   bcm2836_arm_irqchip_handle_irq+0xa8/0xac
 26)     1776      68   __irq_svc+0x70/0xb0
 27)     1708      28   _raw_spin_unlock_irq+0x34/0x68
 28)     1680      72   finish_task_switch+0x8c/0x220
 29)     1608      28   __schedule+0x34c/0x98c
 30)     1580      84   schedule+0x58/0xb8
 31)     1496     120   schedule_hrtimeout_range_clock+0x12c/0x134
 32)     1376      24   schedule_hrtimeout_range+0x28/0x30
 33)     1352      32   poll_schedule_timeout+0x54/0x84
 34)     1320     776   do_select+0x57c/0x694
 35)      544     336   core_sys_select+0x24c/0x48c
 36)      208     112   SyS_pselect6+0x344/0x36c
 37)       96      96   ret_fast_syscall+0x0/0x28

do_select had the biggest stack with 776 bytes being used in the above result.

Please be noted that the stack tracer has no overhead only when it is not enabled. When it is running you may notice a bit of a performance degradation.

Enabling the stack tracer (CONFIG_STACK_TRACER) will show where the biggest use of the stack takes place. The stack tracer is built from the function tracer infrastructure. It does not use the Ftrace ring buffer.

To see the max stack size during boot up, add “stacktrace” to the kernel command line.

Note that the stack tracer will not trace the max stack size when the kernel is using a separate stack. Because interrupts have their own stack, it will not trace the stack usage there.

Filtering with ftrace

There are two files that can be used for function filtering:

set_ftrace_filter
set_ftrace_notrace

Only those functions listed in set_ftrace_filter will be traced, this will help to reduce the impact to the performance of system when the function tracer is activated.

The set_ftrace_notrace file is the opposite of set_ftrace_filter. Instead of limiting the trace to a set of functions, functions listed in set_ftrace_notrace will not be traced. Functions such as rcu_read_lock() and spin_lock() fall into this category.

To remove all functions just echo a bland line into the filter file will be ok.

rpi3:/d/tracing # echo > set_ftrace_notrace
rpi3:/d/tracing # cat set_ftrace_notrace
rpi3:/d/tracing # 

The functions listed in these files can also be set on the kernel command line. The options ftrace_notrace and ftrace_filter will preset these files by listing a comma delimited set of functions.

ftrace_notrace=rcu_read_lock,rcu_read_unlock,spin_lock,spin_unlock ftrace_filter=kfree,kmalloc,schedule,vmalloc_fault,spurious_fault

Functions added by the kernel command line set what will be in the corresponding filter files. These options only pre-load the files, functions can still be removed or added using the bash redirection.

The functions listed in set_ftrace_notrace take precedence. That is, if a function is listed in both set_ftrace_notrace and set_ftrace_filter, that function will not be traced.

trace-cmd - utilities for Linux ftrace

Install from source code

cd /tmp/
git clone https://github.com/rostedt/trace-cmd.git
cd trace-cmd
sudo apt-get install libgtk2.0-dev -y
sudo apt-get install libxml2-dev -y
make gui&&sudo make install_gui

make install_gui will also install kernelshark to the system.

Build for android

sudo apt-get install gcc-arm-linux-gnueabi
make LDFLAGS=-static CC=arm-linux-gnueabi-gcc trace-cmd

Copy tracecmd/trace-cmd to target rootfs.

Note, do not try to build trace-cmd in MacOS

This section illustrates how to use trace-cmd to record a kernel trace, command without parameters will show what we can do with trace-cmd:

rpi3:/d/tracing # trace-cmd

trace-cmd version 2.8.dev

usage:
  trace-cmd [COMMAND] ...

  commands:
     record - record a trace into a trace.dat file
     start - start tracing without recording into a file
     extract - extract a trace from the kernel
     stop - stop the kernel from recording trace data
     restart - restart the kernel trace data recording
     show - show the contents of the kernel tracing buffer
     reset - disable all kernel tracing and clear the trace buffers
     clear - clear the trace buffers
     report - read out the trace stored in a trace.dat file
     stream - Start tracing and read the output directly
     profile - Start profiling and read the output directly
     hist - show a historgram of the trace.dat information
     stat - show the status of the running tracing (ftrace) system
     split - parse a trace.dat file into smaller file(s)
     options - list the plugin options available for trace-cmd report
     listen - listen on a network socket for trace clients
     list - list the available events, plugins or options
     restore - restore a crashed record
     snapshot - take snapshot of running trace
     stack - output, enable or disable kernel stack tracing
     check-events - parse trace event formats

For detailed infomation about subcommand, issue trace-cmd subcommand –help

Recording a trace

rpi3:/ # trace-cmd record --help

trace-cmd version 2.8.dev

usage:
 trace-cmd record [-v][-e event [-f filter]][-p plugin][-F][-d][-D][-o file] \
           [-q][-s usecs][-O option ][-l func][-g func][-n func] \
           [-P pid][-N host:port][-t][-r prio][-b size][-B buf][command ...]
           [-m max][-C clock]
          -e run command with event enabled
          -f filter for previous -e event
          -R trigger for previous -e event
          -p run command with plugin enabled
          -F filter only on the given process
          -P trace the given pid like -F for the command
          -c also trace the childen of -F (or -P if kernel supports it)
          -C set the trace clock
          -T do a stacktrace on all events
          -l filter function name
          -g set graph function
          -n do not trace function
          -m max size per CPU in kilobytes
          -M set CPU mask to trace
          -v will negate all -e after it (disable those events)
          -d disable function tracer when running
          -D Full disable of function tracing (for all users)
          -o data output file [default trace.dat]
          -O option to enable (or disable)
          -r real time priority to run the capture threads
          -s sleep interval between recording (in usecs) [default: 1000]
          -S used with --profile, to enable only events in command line
          -N host:port to connect to (see listen)
          -t used with -N, forces use of tcp in live trace
          -b change kernel buffersize (in kilobytes per CPU)
          -B create sub buffer and folling events will be enabled here
          -k do not reset the buffers after tracing.
          -i do not fail if an event is not found
          -q print no output to the screen
          --quiet print no output to the screen
          --module filter module name
          --by-comm used with --profile, merge events for related comms
          --profile enable tracing options needed for report --profile
          --func-stack perform a stack trace for function tracer
             (use with caution)
          --max-graph-depth limit function_graph depth

When recording a trace, trace-cmd will fork off a process for each CPU on the system. Each of these processes will open the file in debugfs that represents the CPU the process is dedicated to record from. The process recording CPU0 will open /sys/kernel/debug/tracing/per_cpu/cpu0/trace_pipe_raw, the process recording CPU1 will open a similar file in the cpu1 directory, and so on. The trace_pipe_raw file is a mapping directly to the Ftrace internal buffer for each CPU. Each of these CPU processes will read these files using splice to record into a temporary file during the trace. At the end of the record, the main process will concatenate the temporary files into a single trace.dat file.

There’s no need to manually mount the debugfs filesystem before using the tool as trace-cmd will look to see if and where it is mounted. If debugfs is not mounted, it will automatically mount it at /sys/kernel/debug.

As trace-cmd is a front end to Ftrace, the arguments of record reflect some of the features of Ftrace. The -e option enables an event. The argument following the -e can be an event name, event subsystem name, or the special name all. The all name will make trace-cmd enable all events that the system supports. If a subsystem name is specified, then all events under that subsystem will be enabled during the trace. For example, specifying sched will enable all the events within the sched subsystem. To enable a single event, the event name can be used by itself, or the subsystem:event format can be used. If more than one event or subsystem is to be traced, then multiple -e options may be specified.

To show the events system supports, use list subcommand with -e option.

Ftrace also has special plugin tracers that do not simply trace specific events. These tracers include the function, function graph, and latency tracers. Through the debugfs tracing directory, these plugins are enabled by echoing the type of tracer into the current_tracer file. With trace-cmd record, they are enabled with the -p option. Using the tracer plugin name as the argument for -p enables that plugin. You can still specify one or more events with a plugin, but you may only specify a single plugin, or no plugin at all.

Reading trace file

rpi3:/d/tracing # trace-cmd report --help

trace-cmd version 2.8.dev

usage:
 trace-cmd report [-i file] [--cpu cpu] [-e][-f][-l][-P][-L][-N][-R][-E]\
           [-r events][-n events][-F filter][-v][-V][-T][-O option]
           [-H [start_system:]start_event,start_match[,pid]/[end_system:]end_event,end_match[,flags]
           [-G]
          -i input file [default trace.dat]
          -e show file endianess
          -f show function list
          -P show printk list
          -E show event files stored
          -F filter to filter output on
          -I filter out events with the HARDIRQ flag set
          -S filter out events with the SOFTIRQ flag set
          -t print out full timestamp. Do not truncate to 6 places.
          -R raw format: ignore print format and only show field data
          -r raw format the events that match the option
          -v will negate all -F after it (Not show matches)
          -T print out the filter strings created and exit
          -V verbose (shows plugins being loaded)
          -L load only local (~/.trace-cmd/plugins) plugins
          -N do not load any plugins
          -n ignore plugin handlers for events that match the option
          -w show wakeup latencies
          -l show latency format (default with latency tracers)
          -O plugin option -O [plugin:]var[=val]
          --check-events return whether all event formats can be parsed
          --stat - show the buffer stats that were reported at the end of the record.
          --uname - show uname of the record, if it was saved
          --profile report stats on where tasks are blocked and such
          -G when profiling, set soft and hard irqs as global
          -H Allows users to hook two events together for timings
             (used with --profile)
          --by-comm used with --profile, merge events for related comms
          --ts-offset will add amount to timestamp of all events of the
                     previous data file.
          --ts2secs HZ, pass in the timestamp frequency (per second)
                     to convert the displayed timestamps to seconds
                     Affects the previous data file, unless there was no
                     previous data file, in which case it becomes default
           --ts-diff Show the delta timestamp between events.

By default trace-cmd report read trace.data file, if using different trace file, use -i option, or just put the trace file as the last argument, the following two will show the same result:

rpi3:/d/tracing # trace-cmd report -i /data/mytrace.data
rpi3:/d/tracing # trace-cmd report /data/mytrace.data

With trace.data file we can easily filter what events we want or pick a specific CPU to see:

rpi3:/data # trace-cmd record -e kmem
rpi3:/data # trace-cmd report --cpu 0 -F 'kmalloc: bytes_req > 1000'                         <
CPU 1 is empty
cpus=4
 jbd2/mmcblk0p4--114   [000]  1876.824662: kmalloc:              call_site=c076dc54 ptr=0xd6d77000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_NOWAIT|__GFP_NOWARN|__GFP_NORETRY|__GFP_NOMEMALLOC
 jbd2/mmcblk0p4--114   [000]  1876.827745: kmalloc:              call_site=c076dc54 ptr=0xd6d77000 bytes_req=2048 bytes_alloc=2048 gfp_flags=GFP_NOWAIT|__GFP_NOWARN|__GFP_NORETRY|__GFP_NOMEMALLOC

Show latency format -l option for report command can show latency format:

rpi3:/data # trace-cmd record -e sched
rpi3:/data # trace-cmd report -l
cpus=4
trace-cm-1394    1d..3  2535.384180: sched_waking:         comm=kworker/u8:2 pid=1333 prio=120 target_cpu=000
trace-cm-1394    1d..3  2535.384196: sched_migrate_task:   comm=kworker/u8:2 pid=1333 prio=120 orig_cpu=0 dest_cpu=1
trace-cm-1394    1dN.4  2535.384211: sched_wakeup:         comm=kworker/u8:2 pid=1333 prio=120 target_cpu=001
trace-cm-1394    1dNh3  2535.384219: sched_waking:         comm=kworker/1:2 pid=679 prio=120 target_cpu=001
trace-cm-1394    1dNh4  2535.384225: sched_wakeup:         comm=kworker/1:2 pid=679 prio=120 target_cpu=001
trace-cm-1394    1dNh3  2535.384233: sched_waking:         comm=trace-cmd pid=1396 prio=120 target_cpu=000
trace-cm-1394    1dN.2  2535.384244: sched_stat_runtime:   comm=trace-cmd pid=1394 runtime=260938 [ns] vruntime=18863605893 [ns]
  <idle>-0       0dNh2  2535.384251: sched_wakeup:         comm=trace-cmd pid=1396 prio=120 target_cpu=000
trace-cm-1394    1d..2  2535.384253: sched_switch:         prev_comm=trace-cmd prev_pid=1394 prev_prio=120 prev_state=R+ ==> next_comm=kworker/u8:2 next_pid=1333 next_prio=120

Take

trace-cm-1394 1dNh3 2535.384219: sched_waking: comm=kworker/1:2 pid=679 prio=120 target_cpu=001

for example:

1: The first character represents what CPU the trace was recorded on (CPU 1).
d: The 'd' denotes that interrupts were disabled.
N: The "need_resched" flag is set
h: The 'h' means that this was called inside an interrupt handler.
3: kernel lock counter

Recording with filter

trace-cmd report has -l and -n option for function filtering, just like echoing functions to set_ftrace_filter and set_ftrace_notrace:

-l filter function name
-n do not trace function

More than one -l or -n options can be specified on the command line if needed:

rpi3:/data # trace-cmd record -p function -l 'sched_*' -n 'sched_slice'

A list of functions can be used for function filter can be found in file:

/sys/kernel/debug/tracing/available_filter_functions

or by below command:

trace-cmd list -f

Note this only useful with the function and function_graph plugins.

The events can also be filtered. To know what fields can be used for filtering a specific event, look in the format file from /sys/kernel/debug/tracing/events/<subsystem>/<event>/format, or run:

rpi3:/data # trace-cmd report --events |busybox less

on a trace.dat file that was created by the local system. The –events argument will list the event formats of all events that were available in the system that created the tracing file:

system: kmem
name: kmalloc
ID: 185
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:unsigned long call_site;	offset:8;	size:4;	signed:0;
	field:const void * ptr;	offset:12;	size:4;	signed:0;
	field:size_t bytes_req;	offset:16;	size:4;	signed:0;
	field:size_t bytes_alloc;	offset:20;	size:4;	signed:0;
	field:gfp_t gfp_flags;	offset:24;	size:4;	signed:0;

Using the kmalloc event, we can filter on all requests that were greater than 1000 bytes:

rpi3:/data # trace-cmd record -e kmalloc -f 'bytes_req > 1000'
rpi3:/data # trace-cmd report
CPU 3 is empty
cpus=4
        watchdog-713   [002]   127.479896: kmalloc:              call_site=c02d0fec ptr=0xed1d2000 bytes_req=6144 bytes_alloc=8192 gfp_flags=GFP_KERNEL_ACCOUNT|__GFP_NOWARN|__GFP_ZERO
        watchdog-713   [002]   127.479945: kmalloc:              call_site=c0855400 ptr=0xed2e5c00 bytes_req=1020 bytes_alloc=1024 gfp_flags=GFP_KERNEL
    crash_dump32-941   [000]   127.517092: kmalloc:              call_site=c02770bc ptr=0xeb157000 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNEL

The -f option specifies a filter for the event (specified in a -e option) preceding it.

Tracing without writing to file

rpi3:/d/tracing # trace-cmd start -p function_graph -g schedule
rpi3:/d/tracing # sleep 5
rpi3:/d/tracing # trace-cmd stop

rpi3:/d/tracing # cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 3)               |  schedule() {
 3)   0.729 us    |    preempt_count_add();
 3)               |    rcu_note_context_switch() {
 3)   0.573 us    |      rcu_sched_qs();
 3)   0.625 us    |      rcu_preempt_qs();
 3)   9.270 us    |    }
 3)               |    _raw_spin_lock() {
 3)   0.521 us    |      preempt_count_add();
 3)   4.844 us    |    }

In case of we need the trace file for further investigation, the extract command with -o option can be used:

trace-cmd extract -o trace.dat

Disabling ftrace

To disable all tracing, which will ensure that no overhead is left from using the function tracers or events, the reset command can be used. It will disable all of Ftrace and bring the system back to full performance.

trace-cmd reset

View trace.data with kernelshark

KernelShark is a front end reader of trace-cmd output, here is how it ooks like: kernelshark

The hollow green bar that is shown in front of some events in the task plot represents when the task was woken up from a sleeping state to when it actually ran. The hollow red bar between some events shows that the task was preempted by another task even though that task was still runnable. wakeup-latency

The latency can be measured with A,B markers, the above shows that ed.touchatency with PID 1325 had a 418 microsecond wake up latency. The same can be done with the preemption latency.

Latency - The latency is broken into 5 fields:

  • Interrupts disabled - ‘d’ if interrupts are disabled, otherwise ‘.’
  • Need reschedule - ‘N’ if the kernel was notified that a schedule is needed, otherwise ‘.’
  • In IRQ - ‘h’ if in a hard IRQ (hardware triggered), ‘s’ if in a soft IRQ (context where the kernel initiated a the IRQ handler) or if soft IRQs are disabled, ‘H’ if in a hard IRQ and soft IRQs are disabled or the hard IRQ triggered while processing a soft IRQ, otherwise ‘.’
  • Preemption counter - The index of the preemption counter. If it is other than zero, then the kernel will not preempt the running tasks, even if a schedule has been requested by some event. If the counter is zero, then ‘.’ is shown.
  • Lock depth - The depth of the big kernel lock being held. The big kernel lock is recursive (same task may acquire it multiple times). On the first acquisition, the depth is zero. This field will be zero or greater, or ‘.’ if the big kernel lock is not held. When the big kernel lock is finally removed from the kernel, this field will go away as well.

KernelShark provides filters for events and tasks via menu, and advanced filter option can be enabled through Filter » advanced events, refer to Advanced Event Filter section of official document.

New features in future version of kernelshark

The author of kernelshark, Steven Rostedt now working for VMware, he made an announcement in VMware’s Open Source Blog that this fall, version 1.0 will be released with the following exciting features:

  • Completely rewritten based on Qt
  • Improved loading trace file time, the new release(1.0) will be able to load gigabyte file within seconds compared to the GTK version that takes minutes.
  • Making KernelShark user modifiable.
  • There are no more pop-ups over the mouse; instead, the data is displayed in a constant location at the top. But the interaction is much faster than it was before. Zooming in and out is much smoother and can be done with the use of the mouse wheel.
  • Something similar with flamegraphs which show what functions are taking up the most processing time in a CPU.
  • Save sessions in JSON format file, that way KernelShark can come back to the exact place where it left off when exited.
  • A new capturing feature with no need to run as root.
  • Will be able to trace guests along with the host events.

And many more…

References