Kernel Debugging With Kgdb

10 minute read

In Kernel memory debugging techniques, we talked about using script decode_stacktrace.sh to translate addresses to lines, or if only interested in one entry in the stack, then addr2line should be enough, this works sometimes, but in most cases we may need to dig deeper to figure out what’s going on with the kernel, we may need to know the variables involved, or value of registers, in this situation we need to do online kernel debugging, and kgdb/kdb is designed for this purpose.

Both kgdb and kdb are kernel debugger front ends interfacing to the kernel debug core, you can switch between them if necessary.

kdb is shell-like debugger, you can use it to dump memory contents, backtrace, do lsmod etc, kdb is not a source level debugger, here is a full list of commands supported by kernel (v4.19):

Command Usage Description
bc <bpnum> Clear Breakpoint
be <bpnum> Enable Breakpoint
bd <bpnum> Disable Breakpoint
bl [<vaddr>] Display breakpoints
bp [<vaddr>] Set/Display breakpoints
bt [<vaddr>] Stack traceback
btp <pid> Display stack for process <pid>
bta [D|R|S|T|C|Z|E|U|I|M|A] Backtrace all processes matching state flag
btc   Backtrace current process on each cpu
btt <vaddr> Backtrace process given its struct task address
cpu <cpunum> Switch to new cpu
defcmd name “usage” “help” Define a set of commands, down to endefcmd
dmesg [lines] Display kernel log
dumpcommon   Common kdb debugging
dumpall   First line debugging
dumpcpu   Same as dumpall but only tasks on cpus
ef <vaddr> Display exception frame
env   Show environment variables
ftdump [skip_#lines] [cpu] Dump ftrace log
go [<vaddr>] Continue Execution
grephelp   Display help on | grep
help(?)   Display Help Message
kgdb   Enter kgdb mode
kill <-signal> <pid> Send a signal to a process
lsmod   List loaded kernel modules
md <vaddr> Display Memory Contents, also mdWcN, e.g. md8c1
mdr <vaddr> <bytes> Display Raw Memory
mdp <paddr> <bytes> Display Physical Memory
mds <vaddr> Display Memory Symbolically
mm <vaddr> <contents> Modify Memory Contents
ps [<flags>|A] Display active task list
pid <pidnum> Switch to another task
per_cpu <sym> [<bytes>] [<cpu>] Display per_cpu variables
rd   Display Registers
rm <reg> <contents> Modify Registers
reboot   Reboot the machine immediately
set   Set environment variables
sr <key> Magic SysRq key
ss   Single Step
summary   Summarize the system

kgdb serves as kernel gdb server, you must use it along with gdb, kgdb know nothing about Linux kernel.

Build kernel with appropriate configure options

In order to make full use of kdb/kgdb, the following options are recommended:

CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_FRAME_POINTER=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
CONFIG_KGDB_KDB=y
CONFIG_KDB_DEFAULT_ENABLE=0x1
CONFIG_KDB_KEYBOARD=y
CONFIG_KDB_CONTINUE_CATASTROPHIC=0
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF4=y
CONFIG_CONSOLE_POLL=y
CONFIG_GDB_SCRIPTS=y
# CONFIG_STRICT_KERNEL_RWX is not set

You can check your current build with zcat /proc/config.gz.

Build gdb client for arm64 (optional)

This part is optional, use gdb released with toolchain, only build your own if needed, download the latest version from gnu website, and make sure to specify the right target and have python supported:

mkdir build
cd build
../configure --target=aarch64-linux-gnu-gcc --with-python=/usr/bin/python
make
sudo make install

Using kdb

Before using kdb, first thing need to do is to enable it by register kgdb I/O driver:

echo ttyS0 >/sys/module/kgdboc/parameters/kgdboc
[  100.232070] KGDB: Registered I/O driver kgdboc

At the end, you may want to disable it by echoing empty string to kgdboc:

echo "" >/sys/module/kgdboc/parameters/kgdboc
[  102.221318] KGDB: Unregistered I/O driver kgdboc, debugger disabled

kgdboc stands for kgdb over console.

In this section, we will be using below example for oops analysis:

#include <linux/module.h>

static noinline int hello_oops_init(void)
{
    printk("hello oops\n");

    *(int*)0x150912 = 0x5a5a;

    return 0;
}

static void hello_oops_exit(void)
{
    printk("goodbye oops\n");
}

module_init(hello_oops_init);
module_exit(hello_oops_exit);

MODULE_AUTHOR("oops");
MODULE_DESCRIPTION("oops example");
MODULE_LICENSE("GPL");

When a oops or panic occurs, kernel will enter kdb automatically:

[   82.581270] oops: loading out-of-tree module taints kernel.
[   82.619378] hello oops
[   82.622143] Unable to handle kernel paging request at virtual address dfffff900002a122
[   82.637165] Mem abort info:
[   82.647074]   ESR = 0x96000004
[   82.654150]   Exception class = DABT (current EL), IL = 32 bits
[   82.670107]   SET = 0, FnV = 0
[   82.676473]   EA = 0, S1PTW = 0
[   82.681947] Data abort info:
[   82.685457]   ISV = 0, ISS = 0x00000004
[   82.689841]   CM = 0, WnR = 0
[   82.693166] [dfffff900002a122] address between user and kernel address ranges
[   82.701019] Internal error: Oops: 96000004 [#1] PREEMPT SMP

Entering kdb (current=0xffffffc02a00bd00, pid 727) on processor 3 Oops: (null)
due to oops @ 0xffffff9001c50044
CPU: 3 PID: 727 Comm: insmod Tainted: G    B      O      4.19.108-v8+ #28
Hardware name: Raspberry Pi 3 Model B Rev 1.2 (DT)
pstate: 80000005 (Nzcv daif -PAN -UAO)
pc : hello_oops_init+0x44/0x8c [oops]
lr : hello_oops_init+0x2c/0x8c [oops]
sp : ffffffc02da0f8c0
x29: ffffffc02da0f8c0 x28: dfffff9000000000
x27: ffffff9001c523f8 x26: ffffff900af91b40
x25: 0000000000000000 x24: ffffffc02a00bd00
x23: ffffff900af91b08 x22: ffffffc02a00bd10
x21: 1ffffff805b41f28 x20: ffffff9001c50000
x19: 0000000000150912 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000
x15: 0000000000000000 x14: ffffff90080b49f4
x13: ffffff90082bb190 x12: ffffff820170284c
x11: 1ffffff20170284b x10: ffffff820170284b
x9 : 0000000000000000 x8 : dfffff9000000000
x7 : ffffff820170284c x6 : ffffff900b814259
x5 : ffffffc02a00bd00 x4 : 0000000000000000
more>

Using kgdb

gdb uses serial driver to communicate with kgdb, if you want to use both serial console and kgdb at the same time, you need to use a proxy, I tried both kdmx and agent-proxy, and find out agent-proxy is the right choice for me, although Doug Anderson says kdmx is more reliable in his talk at ELC19, I am using Ubuntu 18.04 Desktop, maybe the environment matters, setting up with agent-proxy as follows:

agent-proxy 4440^4441 0 /dev/ttyUSB0,115200

Then use telnet to connect to serial console:

telnet localhost 4440

To quit telnet session, press Ctrl-] then type quit.

Before using kgdb, you need to switch kgdb mode from kgb with command kgdb:

[2]kdb> kgdb
Entering please attach debugger or use $D#44+ or $3#33

NOTE: If kgdboc parameter was set in kernel cmdline, then gdb can be connected to kgdb directly without the need of entering kgdb command, I’ve noticed this prompt when kgdboc was set in kernel cmdline parameter:

[    8.213562] KGDB: Waiting for connection from remote gdb...

Now we are ready for using gdb to debug kernel/modules:

cd /opt/lineageos/kernel/kernel_rpi
aarch64-linux-gnu-gdb samples/hello/oops.o -ex "target remote localhost:4441"

Then list source code around pc to get line number:

(gdb) list *(hello_oops_init+0x44)
0xe4 is in hello_oops_init (samples/hello/oops.c:7).
2
3	static noinline int hello_oops_init(void)
4	{
5	    printk("hello oops\n");
6
7	    *(int*)0x150912 = 0x5a5a;
8
9	    return 0;
10	}

To get back to kgb mode, blindly typing $3#33 or sending maintenance packet:

(gdb) maintenance packet 3
sending: "3"
received: "OK"

Enable kgdb on boot

In order to make this work on Raspberry Pi 3B, add below to config.txt in boot partition:

dtoverlay=pi3-disable-bt

And add kernel parameters to cmdline.txt:

kgdboc=serial0,115200 kgdbwait nokaslr

There is a slab-out-of-bounds bug in dwc_otg driver, and with KASAN enabled, kernel will enter kgdb mode during system rebooting:

[    6.623132] ==================================================================
[    6.649052] console [ttyAMA0] enabled
[    6.655946] BUG: KASAN: slab-out-of-bounds in dwc_otg_hcd_is_bandwidth_allocated+0x68/0x70
[    6.656005] Read of size 8 at addr ffffffc02f6b3a68 by task kworker/1:1/34
[    6.656044]
[    6.677007] CPU: 1 PID: 34 Comm: kworker/1:1 Not tainted 4.19.108-v8+ #57
[    6.677042] Hardware name: Raspberry Pi 3 Model B Rev 1.2 (DT)
[    6.677124] Workqueue: events_power_efficient hub_init_func3
[    6.703594] Call trace:
[    6.703655]  dump_backtrace+0x0/0x3f8
[    6.703706]  show_stack+0x28/0x38
[    6.703776]  dump_stack+0x100/0x168
[    6.703864]  print_address_description+0x58/0x2b0
[    6.722722]  kasan_report+0x174/0x2f8
[    6.722786]  __asan_report_load8_noabort+0x30/0x40
[    6.722869]  dwc_otg_hcd_is_bandwidth_allocated+0x68/0x70
[    6.722943]  dwc_otg_urb_enqueue+0x6c0/0xd78
[    6.723045]  usb_hcd_submit_urb+0x1d8/0x19c8
[    6.745473] mmc-bcm2835 3f300000.mmc: mmc_debug:0 mmc_debug2:0
[    6.746319]  usb_submit_urb+0x560/0x11e8
[    6.746386]  hub_activate+0xa08/0x1348
[    6.746452]  hub_init_func3+0x28/0x38
[    6.746541]  process_one_work+0x6c0/0x1360
[    6.755999] mmc-bcm2835 3f300000.mmc: DMA channel allocated
[    6.766139]  worker_thread+0x400/0xe70
[    6.766208]  kthread+0x278/0x350
[    6.766271]  ret_from_fork+0x10/0x18
[    6.766310]
[    7.823149] Allocated by task 34:
[    7.826573]  kasan_kmalloc.part.0+0x44/0x108
[    7.830935]  kasan_kmalloc+0xb0/0xc8
[    7.834611]  __kmalloc+0x170/0x358
[    7.838118]  usb_get_configuration+0x1a0c/0x48d8
[    7.842841]  usb_new_device+0x89c/0xf68
[    7.846779]  hub_event+0x14fc/0x2f48
[    7.850452]  process_one_work+0x6c0/0x1360
[    7.854644]  worker_thread+0x400/0xe70
[    7.858494]  kthread+0x278/0x350
[    7.861821]  ret_from_fork+0x10/0x18
[    7.865445]
[    7.866999] Freed by task 0:
[    7.869932] (stack is not available)
[    7.873556]
[    7.875133] The buggy address belongs to the object at ffffffc02f6b3a00
[    7.875133]  which belongs to the cache kmalloc-128 of size 128
[    7.887781] The buggy address is located 104 bytes inside of
[    7.887781]  128-byte region [ffffffc02f6b3a00, ffffffc02f6b3a80)
[    7.899604] The buggy address belongs to the page:
[    7.904492] page:ffffffbf00bdacc0 count:1 mapcount:0 mapping:ffffffc031403c00 index:0x0
[    7.912579] flags: 0x200(slab)
[    7.915766] raw: 0000000000000200 dead000000000100 dead000000000200 ffffffc031403c00
[    7.923634] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[    7.931440] page dumped because: kasan: bad access detected
[    7.937061]
[    7.938607] Memory state around the buggy address:
[    7.943494]  ffffffc02f6b3900: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc
[    7.950810]  ffffffc02f6b3980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[    7.958126] >ffffffc02f6b3a00: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc
[    7.965412]                                                           ^
[    7.972119]  ffffffc02f6b3a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[    7.979434]  ffffffc02f6b3b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    7.986716] ==================================================================
[    7.993995] Disabling lock debugging due to kernel taint
...
[    8.207453] KGDB: Registered I/O driver kgdboc
[    8.213562] KGDB: Waiting for connection from remote gdb...

Entering kdb (current=0xffffffc031588000, pid 1) on processor 2 due to Keyboard Entry
[2]kdb>

Example: slab-out-of-bounds

The KASAN report says, the OOB was caused by dwc_otg_hcd_is_bandwidth_allocated+ 0x68/0xa8, so let’s checkout what is it:

# aarch64-linux-gnu-gdb vmlinux -ex "target remote localhost:4441"

(gdb) list *(dwc_otg_hcd_is_bandwidth_allocated+0x68)
0xffffff900914eba8 is in dwc_otg_hcd_is_bandwidth_allocated (drivers/usb/host/dwc_otg/dwc_otg_hcd.c:4083).
4078	{
4079		int allocated = 0;
4080		dwc_otg_qh_t *qh = (dwc_otg_qh_t *) ep_handle;
4081
4082		if (qh) {
4083			if (!DWC_LIST_EMPTY(&qh->qh_list_entry)) {
4084				allocated = 1;
4085			}
4086		}
4087		return allocated;

The relevant definition is in drivers/usb/host/dwc_common_port/dwc_list.h:

#define DWC_LIST_FIRST(link)	((link)->next)
#define DWC_LIST_END(link)	(link)
#define DWC_LIST_EMPTY(link)	\
	(DWC_LIST_FIRST(link) == DWC_LIST_END(link))

Set breakpoint to see what we have in qh_list_entry:

(gdb) l dwc_otg_hcd.c:4082
4077	__attribute__((optimize("O0"))) int dwc_otg_hcd_is_bandwidth_allocated(dwc_otg_hcd_t * hcd, void *ep_handle)
4078	{
4079		int allocated = 0;
4080		dwc_otg_qh_t *qh = (dwc_otg_qh_t *) ep_handle;
4081
4082		if (qh) {
4083			if (!DWC_LIST_EMPTY(&qh->qh_list_entry)) {
4084				allocated = 1;
4085			}
4086		}

(gdb) b dwc_otg_hcd.c:4082
Breakpoint 1 at 0xffffff900914eb70: file drivers/usb/host/dwc_otg/dwc_otg_hcd.c, line 4082.
(gdb) c
Continuing.
[Switching to Thread 36]

Thread 40 hit Breakpoint 1, dwc_otg_hcd_is_bandwidth_allocated (hcd=0xffffffc02fa6c380, ep_handle=0xffffffc02f682828)
    at drivers/usb/host/dwc_otg/dwc_otg_hcd.c:4082
4082		if (qh) {
(gdb) p &qh->qh_list_entry
$1 = (dwc_list_link_t *) 0xffffffc02f682868
(gdb) p qh->qh_list_entry
$2 = {next = 0x0, prev = 0x0}
[    6.577056] Read of size 8 at addr ffffffc02f682868 by task kworker/2:2/84

[    7.686121] The buggy address belongs to the object at ffffffc02f682800
[    7.686121]  which belongs to the cache kmalloc-128 of size 128
[    7.698765] The buggy address is located 104 bytes inside of
[    7.698765]  128-byte region [ffffffc02f682800, ffffffc02f682880)

The guilty address is pointing to qh_list_entry.

From above result, we know qh_list_entry was not initialized, so KASAN will be triggered while reading its member.

Troubleshooting

Timout when connecting to kgdb

Q: When using aarch64-linux-gnu-gdb vmlinux -ex "target remote localhost:4441

Reading symbols from vmlinux...done.
Remote debugging using localhost:4441
Ignoring packet error, continuing...
warning: unrecognized item "timeout" in "qSupported" response
Ignoring packet error, continuing...
Remote replied unexpectedly to 'vMustReplyEmpty': timeout

A: Make sure kernel is in kgdb mode.

Auto-loading has been declined by…

Q: Auto-loading has been declined by…

warning: File "/opt/lineageos/kernel/kernel_rpi/scripts/gdb/vmlinux-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /opt/lineageos/kernel/kernel_rpi/scripts/gdb/vmlinux-gdb.py
line to your configuration file "/home/fdbai/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/fdbai/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
Remote debugging using localhost:4441
0xffffff832b40901c in ?? ()
(gdb) list *(do_oops+0x18)
No symbol "do_oops" in current context.

A: Add follow to $HOME/.gdbinit

add-auto-load-safe-path /opt/lineageos/kernel/kernel_rpi

gdb report no symbol when trying to list source code

Q: Source code not shown with list command.

(gdb) list *(do_oops+0x18)
No symbol "do_oops" in current context.

A: Make sure you have loaded the right file, do not load vmlinux if you are debugging kernel module:

# load module object
aarch64-linux-gnu-gdb samples/hello/oops.o -ex "target remote localhost:4441"

How to exit gdb layout window

Q: After entering tui, how can I get out of TUI window?

A: ctrl+x a

You can find more key bindings in 25.2 TUI Key Bindings.

The variable was optimized out

Q: When I want to print some variables, gdb print instead of value, why?

(gdb) p qh
$1 = <optimized out>

A: To avoid code optimization, you can either use GCC pragma:

+#pragma GCC optimize ("O0")

or gcc attribute:

-int dwc_otg_hcd_is_bandwidth_allocated(dwc_otg_hcd_t * hcd, void *ep_handle)
+__attribute__((optimize("O0"))) int dwc_otg_hcd_is_bandwidth_allocated(dwc_otg_hcd_t * hcd, void *ep_handle)

There are other ways to get optimized out value of a variable, undo has a great writeup about how to do this with the help of debugging data in ELF files.

Another interesting project is Mozilla rr, currently aarch64 architecture was not supported yet, fortunately, Keno has made some progress on this, see this issue for updated progress.

References