This article is available at: https://www.ebpf.top/post/bpf_capabilities_debug

Author: kira skyler

Introduction

In the Linux operating system, “capabilities” are a permission mechanism used for all privileges in the Linux system, finely divided into multiple independent permission bits. This way, users or processes can be granted only the specific permissions needed to perform specific tasks, without requiring all permissions.

In the Linux capabilities system, permission assignments are divided into different sets, such as the Inheritable set, Permitted set, Effective set, Bounding set, and Ambient set. Each set controls the permissions of processes or threads in different scenarios. These capabilities may change under different circumstances, such as switching users, where a new user may likely have a different set of capabilities, and these sets may change according to different rules when creating child processes or executing new programs.

Example: Granting a user the cap_chown capability allows them to change the owner of a file. For example, only a user with this capability can freely designate the owner of a file in the system to another user or user group.

I once encountered an issue related to capabilities when troubleshooting a custom operating system for a company. The operations team reported that root was unable to use tcpdump, throwing the error tcpdump: Couldn't change ownership of savefile.

When using tcpdump in the command line, indeed an error would occur:

1
2
# tcpdump -i ens32 -w a.pcap
tcpdump: Couldn't change ownership of savefile

First, let’s use strace to see at which point in the execution tcpdump throws an error. It turns out that the chown system call is returning an error due to permission denied when trying to change the user owner to 72, which is the uid and gid of the tcpdump user in my operating system. Apparently, tcpdump changes the owner of the file when specifying the output file.

1
2
3
4
5
6
strace tcpdump -i ens32 -w a.pcap
......
chown("a.pcap", 72, 72) = -1 EPERM ( Operation not permitted )
write(2, "tcpdump: ", 9tcpdump: ) = 9
write(2, "Couldn't change ownership of sav"..., 37Couldn't change ownership of savefile) = 37
......

In cases of exceptions returned by system calls, I often use ftrace to trace the calling path in the kernel. Following this path to find where in the kernel the EPERM error is happening is a different topic. This was the first time I faced a problem related to capabilities.

1
2
3
4
5
6
bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
{
    struct user_namespace *ns = current_user_ns();

    return ns_capable(ns, cap) && privileged_wrt_inode_uidgid(ns, inode);
}

After searching through a search engine, it was confirmed that the current terminal did not have the cap_chown capability. Further searches revealed that in this system, root was customized without the cap_chown capability.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[root@localhost ~]# capsh --print
Current: =ep cap_fowner,cap_audit_control-ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,
cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,
cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,
cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,
cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,
cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,
cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,
cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,
cap_bpf,cap_checkpoint_restore
Ambient set =
Current IAB: !cap_chown,!cap_fowner,!cap_audit_control
Securebits: 00/0x0/1'b0 (no-new-privs=0)
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root) euid=0(root)
gid=0(root)
groups=0(root)
Guessed mode: UNCERTAIN (0)
``````bash
# cat /etc/security/capability.conf
!cap_chown,!cap_fowner,!cap_audit_control root

Unleash the Power of eBPF to Track Capability Changes

When you want to trace how the capabilities of a process change and how they are passed, traditional tools almost cannot do it. It is known that modifying capabilities must be triggered by system calls, whether executing another program with execve or switching users with setuid since almost all interactions between the application layer and the kernel in Linux are through system calls.

Developing a tool similar to strace with ptrace to trace all process system calls would severely impact machine performance. Ptrace performance is extremely poor, causing programs to slow down by tens to hundreds of times. Moreover, it needs to frequently read attributes from /proc to obtain process capabilities. Additionally, there is a process attribute like securebits that cannot be obtained through /proc/pid/status nor by ptrace in my 5.10 kernel.

eBPF has significant advantages:

  • It can trace all system calls by tracking the tracepoint/raw_syscalls/sys_enter and tracepoint/raw_syscalls/sys_exit raw tracepoints for system calls, without manually listing each system call, thus preventing changes due to different kernel versions. It has much lower performance overhead compared to ptrace.

  • It allows eBPF programs to access the current process’s task_struct, which contains almost all information about the process. It is almost like having the Sword of Damocles. This way, real-time process capability information and securebits can be obtained.

Let’s take a look at cap.bpf.h. First, there is the s_filter structure used to filter pid and uid during tracing, followed by definitions of parameters for multiple system calls. These parameters will be collected when entering the system call. Lastly, the s_event structure will pass the collected information to the user space, including basic process properties, user-space stack, where stack information will show how the code was called here, cap_before and cap_after representing capability changes before and after the system call.

In cap.bpf.c, part of the libbpf-core code, the default filtering values for pid and uid are set to -1 to indicate no filtering. If filtering is needed, these two values can be directly modified in the user-level code.