This article can be found at: https://www.ebpf.top/post/bpf_sched_ext_dive_into

0d71b31f-305c-4f1e-8df1-02f3f7a71e8e

In the article Linus Strongly Pushes for Inclusion: BPF Empowers Scheduler Success, we reviewed the journey of BPF in the scheduler integration process within the community. Patch V7 is prepared for merging into 6.11, and subsequently, the code repository has also changed to kernel git address. It is only a matter of time for the merge to happen. This blog post will focus on the implementation principles of sched_ext. sched_ext is an extensible scheduler class jointly introduced by Meta and Google, referred to as ext_sched_class or sched_ext. This mechanism allows users to optimize scheduler strategies for specific workloads or scenarios by implementing scheduling classes through defined BPF programs.

Linux Process Scheduler

The process scheduler is an essential part of the operating system.

In general, CPU time slices are multiplexed among multiple tasks. Fundamentally, the scheduler makes the following decisions (who to run next, where to run, for how long):

  • Task Selection: Which task should run next?
  • CPU Core Selection: Determine which CPU core or cores the next running task should run on.
  • Time Slice: How long should the selected task run?

For an individual task, running on a specific set of CPU cores for a longer time is ideal. However, considering that CPU time is shared among multiple tasks, allowing one task to run for an extended period can potentially impact other tasks running in the system. Therefore, the scheduler needs to ensure efficiency and fairness for all tasks from a global perspective to achieve overall optimization, making the scheduler’s complexity evident.

Moreover, the scheduler may aim to fully utilize multiple CPU cores by assigning tasks to any available idle CPU core. Although this scheduling pattern may increase overall CPU utilization, higher CPU utilization does not always equate to higher performance. Frequent task migrations between CPUs can lead to cache cold-start issues, as newly assigned CPUs need time to warm up, resulting in decreased performance. In systems with multiple cache domains (such as NUMA and AMD CCX architectures), the cost of cache warm-up increases significantly.

CFS Scheduler

Since version 2.6.23 (2007), the Linux system scheduler has been using the Completely Fair Scheduler (CFS), which has been the default scheduler in the mainline kernel. CFS manages processes using a red-black tree data structure, aiming to provide completely fair scheduling. For more detailed information, refer to Linux Process Management and Linux CFS Scheduler: Principles, Design, and Kernel Implementation (2023).

The CFS scheduler’s goal is to grant each task a fair share of CPU resources:

  • Task Selection: Each task has a virtual runtime (vruntime), and CFS always selects the process with the smallest vruntime, ensuring a relative level of fairness.
  • Time Slice: CFS does not use fixed time slices; instead, it dynamically adjusts based on task priority and CPU time used. This method ensures that high-priority tasks receive more CPU time while low-priority tasks are not starved.

CFS leverages a red-black tree for priority selection, which efficiently handles the insertion, deletion, and retrieval of tasks:

cfs-red-black-tree

EEVDF Scheduler

CFS does not enforce scheduling deadlines and often fails to schedule delay-sensitive tasks on time, leading to high tail latency. After 15 years of service, CFS was recently phased out. EEVDF (Earliest eligible virtual deadline first) scheduler was introduced as the new default scheduler in Linux kernel version 6.6 (November 2023). EEVDF, designed by the scheduler’s maintainer Peter Zijlstra, in 2023, is based on a research paper from the late 1990s, aiming to optimize scheduling decisions by considering processes’ virtual deadlines and fairness.

  • Task Selection: Each task has a virtual deadline representing the time it should be completed. EEVDF prioritizes running tasks with the earliest virtual deadlines. The scheduler calculates virtual deadlines based on task priorities and CPU time received, ensuring that delay-sensitive processes receive CPU time promptly without being preempted by other tasks.
  • Time Slice: Similar to CFS, time slice selection is based on task priority and CPU time used, dynamically adjusting the allocation.

The implementation details of the EEVDF scheduler include calculating virtual deadlines, maintaining a red-black tree sorted by virtual deadlines, handling process migrations and awakenings, and collaborating with other scheduler classes. Results have shown that EEVDF outperforms CFS in certain scenarios, notably in environments with many delay-sensitive processes. EEVDF can significantly reduce scheduling latency, improve responsiveness without sacrificing throughput and energy efficiency.

Predicament of Generic SchedulersThe scheduler in the kernel has evolved from CFS to EEVDF, aiming to improve the scheduling of latency-sensitive tasks in certain scenarios. However, the primary goal of the kernel scheduler is to ensure compatibility and efficient operation across most workloads (such as databases, games, microservices) and hardware platforms (such as x86, ARM, NUMA, non-NUMA, CCX, etc.). While scheduler optimizations perform well in some workload and hardware combinations, unexpected performance degradation in other scenarios is not uncommon. To avoid such regression errors in performance, the kernel community has set high standards for changing scheduler code.

Utilizing a custom scheduler tailored for specific workloads/hardware combinations is more beneficial than using a generic scheduler. However, implementing and maintaining a custom scheduler can be challenging. This is not only due to the complexities of replacing or customizing the kernel scheduler but also because custom scheduling strategies are unlikely to be accepted upstream, leading to a heavy maintenance burden in the long term.

To address these issues, sched_ext was introduced. It allows users to write custom scheduling policies using BPF without modifying kernel code. You no longer need to worry about maintaining a custom scheduler outside the kernel. Additionally, BPF provides a secure kernel programming environment. Specifically, the BPF verifier ensures that your custom scheduler does not have memory errors or infinite loops. Moreover, if your custom scheduler behaves improperly - for example, failing to schedule tasks within a specified time frame (e.g., 30 seconds), the kernel portion of sched_ext will terminate your custom scheduler and revert to the default kernel scheduler (CFS or EEVDF).

Last but equally important, you can update the BPF scheduler without reinstalling the kernel or restarting the server. In large-scale data centers with tens of thousands of servers, this is quite appealing.

Implementation Mechanism of BPF Scheduler Extender sched_ext

Addition 1: SCHED_EXT Scheduling Class

The implementation of schedulers in the kernel is achieved through scheduling classes, which provide specific functionalities for different scenarios. Scheduling classes can be understood as a generic abstract structure, similar to a base class in object-oriented languages. The scheduling implementations for different scenarios are realized through distinct scheduling classes, where specific functions defined by the scheduling class are implemented, and different scheduling classes have a concept of priority. The specific scheduling class assigned to a task is either set by default when the process is created or adjusted using the sched_setscheduler function.

SCHED_EXT is a non-privileged class, meaning any process can be set to SCHED_EXT. SCHED_EXT is positioned between the priorities of SCHED_IDLE and SCHED_NORMAL. Therefore, the SCHED_EXT scheduler cannot take over the system in a way that prevents normal shell sessions running under SCHED_NORMAL, for example. The relationship between the scheduling class interface, the scheduler class, and the new ext_sched_cls is depicted in the following diagram:

sched_cls_and_sched_ext

The latest version deprecates the scx_bpf_switch_all() function: In earlier versions, there was a magical enhancement function scx_bpf_switch_all(), which would add all newly created tasks to the global list scs_tasks. When a user-defined BPF scheduler is registered, it could switch processes other than dl_sched_cls/rt_shec_cls to ext_sched_cls with a single command. For more details, refer to removing scx_bpf_switch_all.

Addition 2: eBPF Custom Scheduler Functions

In the implementation of the SCHED_EXT scheduling class, an interface for user-defined extensions is added. Within the functions of the SCHED_EXT class, a set of eBPF-based extension functions is defined. Taking enqueue_task_scx as an example, during execution, it checks if the corresponding runnable interface defined in the sched_ext_ops structure has been registered (commonly referred to as ops.runnable). If the loaded BPF program defines this operation function, it will be called; otherwise, the original flow continues.

sched_ext_and_ops

For example, the implementation of the ext_sched_cls.enqueue_task_scx function is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
{
    // …
    /* see dequeue_task_scx() on why we skip when !QUEUED */
    if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
        SCX_CALL_OP_TASK(SCX_KF_REST, running, p); // Check if sched_ext_ops.running has been registered

    clr_task_runnable(p, true); 
    // …
}

The ops.running operation function in the structure can be implemented through a user-written eBPF program. Below is an example of the initialization of the official sample code scx_simple:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# tools/sched_ext/scx_simple.bpf.c
SCX_OPS_DEFINE(simple_ops,
           .select_cpu      = (void *)simple_select_cpu,
           .enqueue         = (void *)simple_enqueue,
           .dispatch        = (void *)simple_dispatch,
           .running   = (void *)simple_running,  // <==== Specify the eBPF program function
           // …
           .name            = "simple"
 );

void BPF_STRUCT_OPS(simple_running, struct task_struct *p) // <== Implement ops.running function
{
    if (fifo_sched)
        return;

    if (vtime_before(vtime_now, p->scx.dsq_vtime))
        vtime_now = p->scx.dsq_vtime;
}

By extending the structure of BPF programs, user-defined scheduling logic is implemented.

Workflow of SCHED_EXT Scheduling Class

Dispatch Queues (DSQ)To adapt to the scheduler core and BPF scheduler, sched_ext uses Dispatch Queues (DSQ) as the scheduling queue, which can function both as FIFO and priority queues.

Currently, the built-in global DSQ queue and CPU local DSQ do not support priority features, only supporting FIFO scheduling.

By default, there is a global FIFO (SCX_DSQ_GLOBAL) and per-CPU local DSQ (SCX_DSQ_LOCAL). The BPF scheduler can manage any number of DSQs using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().

In the initialization of kernel sample code, the implementation of user-defined queue example is as follows:

1
2
3
4
5
6
# tools/sched_ext/scx_simple.bpf.c
#define SHARED_DSQ 0
s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
{
	return scx_bpf_create_dsq(SHARED_DSQ, -1);
}

In the function simple_init, a DSQ with ID 0 is created using scx_bpf_create_dsq, with the second parameter being the NUMA node Id, for subsequent kernel task scheduling management.

Scheduling Cycle Workflow

When a CPU is ready, it will prioritize selecting tasks from the local queue. If the local DSQ is not empty, the first task will be selected. Otherwise, the CPU will attempt to use the built-in global DSQ. If no runnable tasks are generated, then ops.dispatch() will be called for scheduling or task consumption.

The workflow of the BPF scheduler in sched_ext can be analyzed from two dimensions: task wakeup and CPU readiness, with only the core process diagram provided.

  1. When a task is awakened, ops.select_cpu() is the first operation function called.

    Here, there are mainly two purposes:

    • CPU selection optimization hint.

    • If the target CPU is idle, wake up the selected CPU.

The CPU returned by the ops.select_cpu() operation function is an optimization hint and does not have binding constraints. The actual decision is made in the final step of the scheduling. However, if the CPU selected by ops.select_cpu() matches the CPU where the task eventually runs, considering various CPU caches, there will be a slight performance improvement. The side effect of selecting a CPU is waking it from idle state. Although the BPF scheduler can wake up any CPU using the auxiliary function scx_bpf_kick_cpu(), using ops.select_cpu() appropriately can be simpler and more efficient. Calling the kernel function scx_bpf_dispatch() within the operation function ops.select_cpu() can immediately allocate the task to DSQ. If the flag SCX_DSQ_LOCAL is set when calling the function scx_bpf_dispatch(), the task will be scheduled to the local DSQ of the CPU returned from ops.select_cpu().

The understanding above might be a bit abstract. Here is the implementation of the ops.select_cpu function in scx_simple:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#tools/sched_ext/scx_simple.bpf.c
s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
{
	bool is_idle = false;
	s32 cpu;

	cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle);
	if (is_idle) {
		scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); // Branch 1
	}

	return cpu;  // Branch 2
}

Further, directly scheduling from ops.select_cpu() will bypass the ops.enqueue() callback.

Please note that the scheduler core will ignore invalid CPU selections, such as if it exceeds the cpumask range allowed for the task.

  1. If it does not go through priority scheduling distribution (i.e., Branch 1 above), when choosing the target CPU, the subsequent ops.enqueue() will be called (Branch 2 in the code). ops.enqueue() can make one of the following decisions:

    • Dispatch the task to the global or local DSQ by respectively using SCX_DSQ_GLOBAL or SCX_DSQ_LOCAL with the function call scx_bpf_dispatch().

    • Dispatch the task immediately to a custom DSQ by calling scx_bpf_dispatch() with a DSQ ID less than 2^63.

    • Queue tasks on the BPF side.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# tools/sched_ext/scx_simple.bpf.c
/*
 * Built-in DSQs such as SCX_DSQ_GLOBAL cannot be used as priority queues
 * (meaning, cannot be dispatched to with scx_bpf_dispatch_vtime()). We
 * therefore create a separate DSQ with ID 0 that we dispatch to and consume
 * from. If scx_simple only supported global FIFO scheduling, then we could
 * just use SCX_DSQ_GLOBAL.
 */
 #define SHARED_DSQ 0

void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
{
	if (fifo_sched) { // For SHARED_DSQ defined queue, do sequential dispatch
		scx_bpf_dispatch(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
	} else {  // For SHARED_DSQ defined queue, do priority dispatch
		u64 vtime = p->scx.dsq_vtime;

		/*
		 * Limit the amount of budget that an idling task can accumulate
		 * to one slice.
		 */
		if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL))
			vtime = vtime_now - SCX_SLICE_DFL;
	
		scx_bpf_dispatch_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime,
				       enq_flags);
	}
  1. When a CPU is ready (schedulable), it first checks its local DSQ. If the local DSQ is empty, it will check the global DSQ. If there are still no tasks to run, it calls ops.dispatch(), which can be called to dispatch tasks to the local DSQ using the following two functions.

    • The scx_bpf_dispatch() function dispatches tasks to a DSQ. Any target DSQ can be used - SCX_DSQ_LOCAL, SCX_DSQ_LOCAL_ON | cpu, SCX_DSQ_GLOBAL, or a custom DSQ. While it is currently not possible to call scx_bpf_dispatch() while holding a BPF lock, this issue is under development and will be supported. scx_bpf_dispatch() schedules the dispatch instead of executing it immediately. There can be up to ops.dispatch_max_batch tasks pending for processing.

    • scx_bpf_consume() moves tasks from a specified non-local DSQ to the scheduled DSQ. This function cannot be called while holding any BPF locks. scx_bpf_consume() flushes pending dispatched tasks before attempting to use the specified DSQ.

1
2
3
4
5
     # tools/sched_ext/scx_simple.bpf.c
     void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev)
     {
     	scx_bpf_consume(SHARED_DSQ);
     }
  1. After ops.dispatch() returns, if there are tasks in the local DSQ queue, the CPU runs the first task. If it is empty, it follows these steps:

    • Try to consume the global DSQ. If successful, run the task.

    • If ops.dispatch() has dispatched any tasks, retry #3.

    • If the previous task was an SCX task and can still run, continue running it (see SCX_OPS_ENQ_LAST).

    • Remain idle.

Note that the BPF scheduler can always choose to dispatch tasks immediately in ops.enqueue(). If only using built-in DSQs, there is no need to implement ops.dispatch(), as tasks will never queue on the BPF scheduler, and both local and global DSQs will be utilized automatically.

scx_bpf_dispatch() queues tasks on the FIFO of the target DSQ. Use scx_bpf_dispatch_vtime() for priority queueing. Internal DSQs like SCX_DSQ_LOCAL and SCX_DSQ_GLOBAL do not support priority queue scheduling and must use scx_bpf_dispatch() for scheduling. Refer to the function documentation and usage in tools/sched_ext/scx_simple.bpf.c for more information.

Switching to sched_ext

CONFIG_SCHED_CLASS_EXT is the configuration option to enable sched_ext, and tools/sched_ext contains sample schedulers. When compiling the kernel, the following configuration options should be enabled to use sched_ext:

1
2
3
4
5
6
7
8
9
CONFIG_BPF=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_PAHOLE_HAS_SPLIT_BTF=y
CONFIG_PAHOLE_HAS_BTF_TAG=y

The latest Patch V7, the code repository address is https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/. sched_ext is only used when the BPF scheduler is loaded and running.When explicitly setting the scheduling strategy of a task to SCHED_EXT, it will be treated as SCHED_NORMAL and scheduled by the CFS until the BPF scheduler is loaded. Upon loading, such tasks will switch to sched_ext and be scheduled by the sched_ext scheduler program.

Terminating the sched_ext scheduler program, triggering SysRq-S, or detecting any internal errors (including stopped runnable tasks) will abort the BPF scheduler program and restore all tasks back to CFS.

1
2
3
4
5
6
7
8
# make -j16 -C tools/sched_ext
# tools/sched_ext/scx_simple
local=0 global=3
local=5 global=24
local=9 global=44
local=13 global=56
local=17 global=72
^CEXIT: BPF scheduler unregistered

The current state of the BPF scheduler can be determined as follows:

1
2
3
4
# cat /sys/kernel/sched_ext/state
enabled
# cat /sys/kernel/sched_ext/root/ops
simple

tools/sched_ext/scx_show_state.py is a drgn script that provides more detailed information:

1
2
3
4
5
6
7
8
# tools/sched_ext/scx_show_state.py
ops           : simple
enabled       : 1
switching_all : 1
switched_all  : 1
enable_state  : enabled (2)
bypass_depth  : 0
nr_rejected   : 0

If the CONFIG_SCHED_DEBUG option is set during compilation, you can determine if a task is running on sched_ext as follows:

1
2
# grep ext /proc/self/sched
ext.enabled                                  :                    1

Summary

This document briefly introduces the kernel schedulers CFS/EEVDF, the implementation mechanism, and workflow of sched_ext, along with specific details on the scx_simple BPF scheduler in the kernel. If you are interested in the implementation and application of sched_ext, you can study and experiment with the schedulers implemented in the tools/sched_ext/ directory by compiling the kernel. It is believed that you will make more discoveries and gain insights into scheduling. With the integration of sched_ext into the kernel code, it will bring more convenience to our testing and implementation.

I have limited knowledge. Welcome to provide criticism and corrections if any mistakes are found.