1. Kdump 工作原理介绍

内核崩溃转储指的是在内核异常情况时,将 RAM 中部分内容进行转储到磁盘或者其他存储中。当内核发生 panic 时,内核依靠 kexec 机制在预先保留的内存区域快速重启一个新的内核实例,预留内存区域大小可通过内核启动参数 crashkernel 指定。

为了实现 “双内核” 布局,Kdump 在内核崩溃后立即使用 kexec 引导到转储捕获内核(capture kernel),使用 kexec 引导 “覆盖” 当前运行的内核。转储捕获内核可以是专门构建的单独 Linux 内核映像,也可以在支持可重定位内核的系统架构上重用主内核映像。

kexec(kernel execution,类似于 Unix 或 Linux 的系统调用 exec)是 Linux 内核的一种机制,其允许从当前运行的内核启动新内核。kexec 会跳过由系统固件(BIOS或UEFI)执行的引导加载程序阶段和硬件初始化阶段,直接将新内核加载到主内存并立即开始执行。这避免了完全重新启动的漫长时间,并且可以通过最小化停机时间来满足系统高可用性要求。

kdump_arch

图 1-1 Kdump 原理架构图

Kdump 的功能不仅仅在于分析内核崩溃,在内核学习时,如果我们需了解内核运行状态或结构的详情时,(如果又不想编写内核模块或者使用 gdb 单步调试)也可以使用 Kdump 进行转储,后续使用 Crash 工具对照源码进行分析总结。

2. Ubuntu 20.04 Kdump + Crash 安装

1
2
$ sudo apt install linux-crashdump
$ sudo apt install crash

安装完成后,需要重启服务器生效。

通过相关文件查看,可得知安装过程中,内核启动参数中的 crashkernel 已经进行了设置。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ sudo cat /etc/default/grub.d/kdump-tools.cfg
GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M"

$ sudo cat /boot/grub.cfg
...
menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-f278a3a6-9739-4b30-b8a1-5de870e7288a' {
	...
	linux	/vmlinuz-5.4.0-80-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro   crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M
	initrd	/initrd.img-5.4.0-80-generic
}
...

在文件 /boot/grub.cfg 增加了一行 crashkernel 的配置,会根据主机的内存设置预留的 RAM 区域大小。

1
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M

在服务重启成功后,我们可在内核 dmesg 中查看到相关信息,本机保留了 512M RAM 内存区供转储捕获内核使用。同时我们通过命令 kdump-config show 查看到 Kdump 的状态已经 Readyservice kdump-tools status 显示 kdump-tools 状态为 Active

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
$ sudo reboot

# 查看内核日志
$ dmesg -T | grep -i crash
[Sun Aug  1 01:07:01 2021] crashkernel reserved: 0x00000000dfe00000 - 0x00000000ffe00000 (512 MB)
[Sun Aug  1 01:07:01 2021] Kernel command line: BOOT_IMAGE=/vmlinuz-5.4.0-80-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M

$ sudo kdump-config show
DUMP_MODE:        kdump
USE_KDUMP:        1
KDUMP_SYSCTL:     kernel.panic_on_oops=1
KDUMP_COREDIR:    /var/crash
crashkernel addr: 0xdfe00000
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-5.4.0-80-generic
kdump initrd:
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-5.4.0-80-generic
current state:    ready to kdump   # 已经 ready 

kexec command:
  /sbin/kexec -p --command-line="BOOT_IMAGE=/vmlinuz-5.4.0-80-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

# 查看启动命令行
$ sudo cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.4.0-80-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M

# 查看 kdump-tools 服务状态 service --status-all|grep kdump
$ sudo service kdump-tools status
● kdump-tools.service - Kernel crash dump capture service
     Loaded: loaded (/lib/systemd/system/kdump-tools.service; enabled; vendor preset: enabled)
     Active: active (exited) since Sat 2021-07-31 13:31:21 UTC; 12min ago
    Process: 937 ExecStart=/etc/init.d/kdump-tools start (code=exited, status=0/SUCCESS)
   Main PID: 937 (code=exited, status=0/SUCCESS)
   
# 查看 crashkernel 内存分配的地址空间
$ cat /proc/iomem | grep -i crash
  dfe00000-ffdfffff : Crash kernel
  
# 查看 crashkernel 内存分配的大小
$ sudo  cat /sys/kernel/kexec_crash_size
536870912

至此,kdump 服务已生效,在系统遇到崩溃的情况即可生成对应的转储文件,保存目录为 /var/crash

Crash 工具为 Red Hat 公司开发用于分析转储文件的工具,等同于对于内核快照进行类似于 gdb 调试的体验。

3. 测试验证

Linux sysrq 工具可手工触发内核 panic,我们可用于临时测试:

1
2
$ sudo echo 1 > /proc/sys/kernel/sysrq
$ sudo echo c > /proc/sysrq-trigger

命令运行成功后,/var/carsh 目录中会生成了一个以当前日期命名的目录,包含 dmesg.xdump.x 两个文件,其中 demsg.x 为崩溃时候的系统内核日志,dump.x 文件则为转储的内核快照文件。

1
2
3
4
$ sudo ls -hl /var/crash/202107311331/
total 86M
-rw------- 1 root root 48K Jul 31 13:31 dmesg.202107311331
-rw------- 1 root root 86M Jul 31 13:31 dump.202107311331

为了使用 Crash 工具,我们还需要安装带有调试信息的 vmlinux 文件,安装命令如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 设置 repo 仓库
$ echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse
 deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse
 deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/ddebs.list

$ sudo apt install ubuntu-dbgsym-keyring
$ sudo apt-get update

$ sudo apt -y install linux-image-$(uname -r)-dbgsym
The following additional packages will be installed:
  linux-image-unsigned-5.4.0-80-generic-dbgsym
The following NEW packages will be installed:
  linux-image-5.4.0-80-generic-dbgsym linux-image-unsigned-5.4.0-80-generic-dbgsym
0 upgraded, 2 newly installed, 0 to remove and 63 not upgraded.
Need to get 896 MB of archives.
After this operation, 6,225 MB of additional disk space will be used.
Get:1 http://ddebs.ubuntu.com focal-updates/main arm64 linux-image-unsigned-5.4.0-80-generic-dbgsym arm64 5.4.0-80.90 [896 MB]
2% [1 linux-image-unsigned-5.4.0-80-generic-dbgsym 24.8 MB/896 MB 3%]

# 安装完成后,查看文件
$ sudo  ls -hl /usr/lib/debug/boot/
total 350M
-rw-r--r-- 1 root root 350M Jul  9 15:49 vmlinux-5.4.0-80-generic

ubuntu-dbgsym-keyring 包安装成功后,我们可以看到在目录 /usr/lib/debug/boot/ 中已经安装了 vmlinux-5.4.0-80-generic 文件。

至此,我们已经万事俱备,可以愉快地使用 Crash 工具进行调试:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
$ sudo crash /usr/lib/debug/boot/vmlinux-5.4.0-80-generic /var/crash/202107311331/dump.202107311331
...
      KERNEL: /usr/lib/debug/boot/vmlinux-5.4.0-80-generic
    DUMPFILE: /var/crash/202107311331/dump.202107311331  [PARTIAL DUMP]
        CPUS: 8
        DATE: Sat Jul 31 13:30:59 2021
      UPTIME: 00:05:02
LOAD AVERAGE: 0.74, 0.54, 0.25
       TASKS: 393
    NODENAME: headfirstbpf
     RELEASE: 5.4.0-80-generic
     VERSION: #90-Ubuntu SMP Fri Jul 9 17:43:26 UTC 2021
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 8 GB
       PANIC: "Kernel panic - not syncing: sysrq triggered crash"
         PID: 8139
     COMMAND: "bash"
        TASK: ffff0001e3d7bc00  [THREAD_INFO: ffff0001e3d7bc00]
         CPU: 6
       STATE: TASK_RUNNING (PANIC)

# 使用 bt 命令查看崩溃时候的运行栈      
crash> bt
PID: 8139   TASK: ffff0001e3d7bc00  CPU: 6   COMMAND: "bash"
 #0 [ffff8000140eba00] machine_kexec at ffff8000100aba84
 #1 [ffff8000140eba60] __crash_kexec at ffff8000101d4e44
 #2 [ffff8000140ebbf0] panic at ffff800010df9c94
 #3 [ffff8000140ebcd0] sysrq_handle_crash at ffff80001089a9fc
 #4 [ffff8000140ebce0] __handle_sysrq at ffff80001089b3fc
 #5 [ffff8000140ebd30] write_sysrq_trigger at ffff80001089babc
 #6 [ffff8000140ebd50] proc_reg_write at ffff800010459d74
 #7 [ffff8000140ebd90] __vfs_write at ffff8000103974b8
 #8 [ffff8000140ebdc0] vfs_write at ffff800010398794
 #9 [ffff8000140ebe00] ksys_write at ffff80001039b6a0
#10 [ffff8000140ebe50] __arm64_sys_write at ffff80001039b754
#11 [ffff8000140ebe70] el0_svc_common.constprop.0 at ffff80001009e958
#12 [ffff8000140ebea0] el0_svc_handler at ffff80001009ea9c
#13 [ffff8000140ebff0] el0_svc at ffff80001008464c
     PC: 0000ffff80556ed0   LR: 0000ffff8050329c   SP: 0000ffffeb06cfd0
    X29: 0000ffffeb06cfd0  X28: 0000aaaabf405000  X27: 0000000000000000
    X26: 0000aaaabf3cc000  X25: 0000ffff805fe630  X24: 0000000000000002
    X23: 0000aaaaeadebaa0  X22: 0000ffff80679710  X21: 0000ffff805fe548
    X20: 0000aaaaeadebaa0  X19: 0000000000000001  X18: 0000000000000000
    X17: 0000ffff804ffc20  X16: 0000ffff805046a0  X15: 000000007fffffde
    X14: 0000000000000000  X13: 0000000000000000  X12: 0000000000000000
    X11: 0000ffffeb06cf98  X10: 0000000000000001   X9: 00000000ffffff80
     X8: 0000000000000040   X7: 0000000000000063   X6: 0000000000000063
     X5: 0000000155510004   X4: 000000000000000a   X3: 0000ffff80678f10
     X2: 0000000000000002   X1: 0000aaaaeadebaa0   X0: 0000000000000001
    ORIG_X0: 0000000000000001  SYSCALLNO: 40  PSTATE: 20001000

这里我们输入 bt 命令就可以查看到内核运行崩溃时候的栈。

4. Crash 子命令使用

子命令的运行与 bash 运行类似,可以使用文件重定向、grep/awk 等命令,分析起来非常方便。

具体使用格式,可通过 man subcommand 来了解子命令的详细用法。

bt

用于查看进程的栈和寄存器状态。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
crash> bt 4468
PID: 4468   TASK: ffff0001e7fa4b00  CPU: 2   COMMAND: "containerd"
 #0 [ffff800014cf3b30] __switch_to at ffff800010089960
 #1 [ffff800014cf3b60] __schedule at ffff800010e0fdc0
 #2 [ffff800014cf3bf0] schedule at ffff800010e102c4
 #3 [ffff800014cf3c10] futex_wait_queue_me at ffff8000101c0e58
 #4 [ffff800014cf3c60] futex_wait at ffff8000101c3540
 #5 [ffff800014cf3da0] do_futex at ffff8000101c6868
 #6 [ffff800014cf3df0] __arm64_sys_futex at ffff8000101c6aa4
 #7 [ffff800014cf3e70] el0_svc_common.constprop.0 at ffff80001009e958
 #8 [ffff800014cf3ea0] el0_svc_handler at ffff80001009ea9c
 #9 [ffff800014cf3ff0] el0_svc at ffff80001008464c
     PC: 0000aaaabcfcdce8   LR: 0000aaaabcf9bb60   SP: 0000ffff873207e0
    X29: 0000ffff873207d8  X28: 0000004000000900  X27: 0000aaaabf01af80
    X26: 0000aaaabe44e130  X25: 0000ffffe6a26498  X24: 0000000000001000
    X23: 0000000000000000  X22: 0000ffffe6a2637f  X21: 0000004000076380
    X20: 0000ffff87320800  X19: 0000aaaabcfa2928  X18: 0000ffff896e6a70
    X17: 0000000000000118  X16: 0000ffff87320898  X15: 0000000000000000
    X14: 0000000000000002  X13: 0000000000000001  X12: 00000044c17200e9
    X11: 003655cd891a685f  X10: 0000000000000018   X9: 000000000001eef0
     X8: 0000000000000062   X7: 0000000029aab5ba   X6: 0000ffff8976f010
     X5: 0000000000000000   X4: 0000000000000000   X3: 0000ffff87320818
     X2: 0000000000000000   X1: 0000000000000080   X0: 0000aaaabeffde70
    ORIG_X0: 0000aaaabeffde70  SYSCALLNO: 62  PSTATE: 80001000

files

files pid 查看指定进程打开文件录详情。

1
2
3
4
5
6
7
crash> files 4468
PID: 4468   TASK: ffff0001e7fa4b00  CPU: 2   COMMAND: "containerd"
ROOT: /    CWD: /var/snap/var/snap/docker/800
 FD       FILE            DENTRY           INODE       TYPE PATH
  0 ffff0001e644d000 ffff0001f2043a80 ffff0001f166b310 CHR  /dev/null
  1 ffff0001e40c9e00 ffff000194e75cc0 ffff0001f2e72440 SOCK UNIX
  2 ffff0001e40c9e00 ffff000194e75cc0 ffff0001f2e72440 SOCK UNIX

task

用于显示 task_struct 结构体。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
crash> task -x  4468
PID: 4468   TASK: ffff0001e7fa4b00  CPU: 2   COMMAND: "containerd"
struct task_struct {
  thread_info = {
    flags = 0x0,
    addr_limit = 0xffffffffffff,
    ttbr0 = 0xdff8000226569000,
    {
      preempt_count = 0x100000000,
      preempt = {
        count = 0x0,
        need_resched = 0x1
      }
    }
  },
  state = 0x1,

如果只想查看个别子字段,可以使用 -R 来指定,支持逗号分割多个子字段:

1
2
3
4
crash> task -x -R files,state  4468
PID: 4468   TASK: ffff0001e7fa4b00  CPU: 2   COMMAND: "containerd"
  files = 0xffff0001911d5b80,
  state = 0x1,

struct

struct 命令可以查看对应结构的详细字段,如果需要查看字段的偏移量添加 -o 参数即可。

1
2
3
4
5
6
crash> struct task_struct -o
struct task_struct {
     [0] struct thread_info thread_info;
    [32] volatile long state;
    [40] void *stack;
    [48] refcount_t usage;

如果明确知道某个地址对应的数据结构,也可通过 struct 来打印:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
crash> struct task_struct ffff0001e7fa4b00
struct task_struct {
  thread_info = {
    flags = 0,
    addr_limit = 281474976710655,
    ttbr0 = 16138649273915314176,
    {
      preempt_count = 4294967296,
      preempt = {
        count = 0,
        need_resched = 1
      }
    }
  },

ps

ps 命令查看系统中的全部进程,其中 ST 字段表示状态, RU = “Running”, IN = “Interruptable” UN = “UnInterruptable”, ID = “Idle” 。TASK 字段表示 task_struct 的地址。

1
2
3
4
5
6
7
8
crash> ps
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
>     0      0   0  ffff800011b82e40  RU   0.0       0      0  [swapper/0]
>     0      0   1  ffff0001f10ce900  RU   0.0       0      0  [swapper/1]
>     0      0   2  ffff0001f10cda00  RU   0.0       0      0  [swapper/2]
>     0      0   3  ffff0001f10cad00  RU   0.0       0      0  [swapper/3]

crash> ps|grep RU  # 只显示 RU 状态

vm

vm 查看指定进程的虚拟内存。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
crash> vm 4468
PID: 4468   TASK: ffff0001e7fa4b00  CPU: 2   COMMAND: "containerd"
       MM               PGD          RSS    TOTAL_VM
ffff0001e7b82940  ffff0001e6569000  44256k  1180404k
      VMA           START       END     FLAGS FILE
ffff0001e54636c0 4000000000 4000800000 40100073
ffff0001e369edd0 4000800000 4004000000 100073
ffff0001edf4b450 aaaabc72c000 aaaabdec2000    875 /snap/snap/docker/800/bin/containerd
ffff0001edf4b6c0 aaaabded1000 aaaabef81000 100871 /snap/snap/docker/800/bin/containerd
ffff0001e5463ee0 aaaabef81000 aaaabeff3000 100873 /snap/snap/docker/800/bin/containerd

irq

irq 命令查看中断。

1
2
3
4
5
6
7
crash> irq
 IRQ   IRQ_DESC/_DATA      IRQACTION      NAME
  0       (unused)          (unused)
  1   ffff0001fc4f3000      (unused)
  2   ffff0001fc4f0c00  ffff0001fc5f3980  "arch_timer"
  3   ffff0001fc4f0400      (unused)
  4   ffff0001f13d3e00  ffff0001e4fab100  "uart-pl011"

keme

kmem 用于查看系统内存信息。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
crash> kmem -i
                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM  1901306       7.3 GB         ----
         FREE   948095       3.6 GB   49% of TOTAL MEM
         USED   953211       3.6 GB   50% of TOTAL MEM
       SHARED   323270       1.2 GB   17% of TOTAL MEM
      BUFFERS    53277     208.1 MB    2% of TOTAL MEM
       CACHED   620919       2.4 GB   32% of TOTAL MEM
         SLAB    68520     267.7 MB    3% of TOTAL MEM

   TOTAL HUGE        0            0         ----
    HUGE FREE        0            0    0% of TOTAL HUGE

   TOTAL SWAP  1048575         4 GB         ----
    SWAP USED        0            0    0% of TOTAL SWAP
    SWAP FREE  1048575         4 GB  100% of TOTAL SWAP

 COMMIT LIMIT  1999228       7.6 GB         ----
    COMMITTED   671869       2.6 GB   33% of TOTAL LIMIT

5. 参考

  1. wiki kdump kernel kdump doc
  2. Linux Kernel Crash Book
  3. Ubuntu Kernel Crash Dump
  4. ubuntu 20.04 启用kdump服务及下载vmlinux
  5. 在ubuntu上开启kdump-tools服务