BPF Scheduler sched_ext Implementation Mechanism, Scheduling Process, and Examples
This article can be found at: https://www.ebpf.top/post/bpf_sched_ext_dive_into
In the article Linus Strongly Pushes for Inclusion: BPF Empowers Scheduler Success, we reviewed the journey of BPF in the scheduler integration process within the community. Patch V7 is prepared for merging into 6.11, and subsequently, the code repository has also changed to kernel git address. It is only a matter of time for the merge to happen. This blog post will focus on the implementation principles of sched_ext. sched_ext is an extensible scheduler class jointly introduced by Meta and Google, referred to as ext_sched_class or sched_ext. This mechanism allows users to optimize scheduler strategies for specific workloads or scenarios by implementing scheduling classes through defined BPF programs.
Linux Process Scheduler
The process scheduler is an essential part of the operating system.
In general, CPU time slices are multiplexed among multiple tasks. Fundamentally, the scheduler makes the following decisions (who to run next, where to run, for how long):
- Task Selection: Which task should run next?
- CPU Core Selection: Determine which CPU core or cores the next running task should run on.
- Time Slice: How long should the selected task run?
For an individual task, running on a specific set of CPU cores for a longer time is ideal. However, considering that CPU time is shared among multiple tasks, allowing one task to run for an extended period can potentially impact other tasks running in the system. Therefore, the scheduler needs to ensure efficiency and fairness for all tasks from a global perspective to achieve overall optimization, making the scheduler’s complexity evident.
Moreover, the scheduler may aim to fully utilize multiple CPU cores by assigning tasks to any available idle CPU core. Although this scheduling pattern may increase overall CPU utilization, higher CPU utilization does not always equate to higher performance. Frequent task migrations between CPUs can lead to cache cold-start issues, as newly assigned CPUs need time to warm up, resulting in decreased performance. In systems with multiple cache domains (such as NUMA and AMD CCX architectures), the cost of cache warm-up increases significantly.
CFS Scheduler
Since version 2.6.23 (2007), the Linux system scheduler has been using the Completely Fair Scheduler (CFS), which has been the default scheduler in the mainline kernel. CFS manages processes using a red-black tree data structure, aiming to provide completely fair scheduling. For more detailed information, refer to Linux Process Management and Linux CFS Scheduler: Principles, Design, and Kernel Implementation (2023).
The CFS scheduler’s goal is to grant each task a fair share of CPU resources:
- Task Selection: Each task has a virtual runtime (vruntime), and CFS always selects the process with the smallest vruntime, ensuring a relative level of fairness.
- Time Slice: CFS does not use fixed time slices; instead, it dynamically adjusts based on task priority and CPU time used. This method ensures that high-priority tasks receive more CPU time while low-priority tasks are not starved.
CFS leverages a red-black tree for priority selection, which efficiently handles the insertion, deletion, and retrieval of tasks:
EEVDF Scheduler
CFS does not enforce scheduling deadlines and often fails to schedule delay-sensitive tasks on time, leading to high tail latency. After 15 years of service, CFS was recently phased out. EEVDF (Earliest eligible virtual deadline first) scheduler was introduced as the new default scheduler in Linux kernel version 6.6 (November 2023). EEVDF, designed by the scheduler’s maintainer Peter Zijlstra, in 2023, is based on a research paper from the late 1990s, aiming to optimize scheduling decisions by considering processes’ virtual deadlines and fairness.
- Task Selection: Each task has a virtual deadline representing the time it should be completed. EEVDF prioritizes running tasks with the earliest virtual deadlines. The scheduler calculates virtual deadlines based on task priorities and CPU time received, ensuring that delay-sensitive processes receive CPU time promptly without being preempted by other tasks.
- Time Slice: Similar to CFS, time slice selection is based on task priority and CPU time used, dynamically adjusting the allocation.
The implementation details of the EEVDF scheduler include calculating virtual deadlines, maintaining a red-black tree sorted by virtual deadlines, handling process migrations and awakenings, and collaborating with other scheduler classes. Results have shown that EEVDF outperforms CFS in certain scenarios, notably in environments with many delay-sensitive processes. EEVDF can significantly reduce scheduling latency, improve responsiveness without sacrificing throughput and energy efficiency.
Predicament of Generic SchedulersThe scheduler in the kernel has evolved from CFS to EEVDF, aiming to improve the scheduling of latency-sensitive tasks in certain scenarios. However, the primary goal of the kernel scheduler is to ensure compatibility and efficient operation across most workloads (such as databases, games, microservices) and hardware platforms (such as x86, ARM, NUMA, non-NUMA, CCX, etc.). While scheduler optimizations perform well in some workload and hardware combinations, unexpected performance degradation in other scenarios is not uncommon. To avoid such regression errors in performance, the kernel community has set high standards for changing scheduler code.
Utilizing a custom scheduler tailored for specific workloads/hardware combinations is more beneficial than using a generic scheduler. However, implementing and maintaining a custom scheduler can be challenging. This is not only due to the complexities of replacing or customizing the kernel scheduler but also because custom scheduling strategies are unlikely to be accepted upstream, leading to a heavy maintenance burden in the long term.
To address these issues, sched_ext was introduced. It allows users to write custom scheduling policies using BPF without modifying kernel code. You no longer need to worry about maintaining a custom scheduler outside the kernel. Additionally, BPF provides a secure kernel programming environment. Specifically, the BPF verifier ensures that your custom scheduler does not have memory errors or infinite loops. Moreover, if your custom scheduler behaves improperly - for example, failing to schedule tasks within a specified time frame (e.g., 30 seconds), the kernel portion of sched_ext will terminate your custom scheduler and revert to the default kernel scheduler (CFS or EEVDF).
Last but equally important, you can update the BPF scheduler without reinstalling the kernel or restarting the server. In large-scale data centers with tens of thousands of servers, this is quite appealing.
Implementation Mechanism of BPF Scheduler Extender sched_ext
Addition 1: SCHED_EXT Scheduling Class
The implementation of schedulers in the kernel is achieved through scheduling classes, which provide specific functionalities for different scenarios. Scheduling classes can be understood as a generic abstract structure, similar to a base class in object-oriented languages. The scheduling implementations for different scenarios are realized through distinct scheduling classes, where specific functions defined by the scheduling class are implemented, and different scheduling classes have a concept of priority. The specific scheduling class assigned to a task is either set by default when the process is created or adjusted using the sched_setscheduler
function.
SCHED_EXT
is a non-privileged class, meaning any process can be set to SCHED_EXT
. SCHED_EXT
is positioned between the priorities of SCHED_IDLE
and SCHED_NORMAL
. Therefore, the SCHED_EXT
scheduler cannot take over the system in a way that prevents normal shell sessions running under SCHED_NORMAL
, for example. The relationship between the scheduling class interface, the scheduler class, and the new ext_sched_cls
is depicted in the following diagram:
The latest version deprecates the
scx_bpf_switch_all()
function: In earlier versions, there was a magical enhancement functionscx_bpf_switch_all()
, which would add all newly created tasks to the global listscs_tasks
. When a user-defined BPF scheduler is registered, it could switch processes other thandl_sched_cls
/rt_shec_cls
toext_sched_cls
with a single command. For more details, refer to removing scx_bpf_switch_all.
Addition 2: eBPF Custom Scheduler Functions
In the implementation of the SCHED_EXT
scheduling class, an interface for user-defined extensions is added. Within the functions of the SCHED_EXT
class, a set of eBPF-based extension functions is defined. Taking enqueue_task_scx
as an example, during execution, it checks if the corresponding runnable
interface defined in the sched_ext_ops
structure has been registered (commonly referred to as ops.runnable
). If the loaded BPF program defines this operation function, it will be called; otherwise, the original flow continues.
For example, the implementation of the ext_sched_cls.enqueue_task_scx
function is as follows:
|
|
The ops.running
operation function in the structure can be implemented through a user-written eBPF program. Below is an example of the initialization of the official sample code scx_simple
:
|
|
By extending the structure of BPF programs, user-defined scheduling logic is implemented.
Workflow of SCHED_EXT Scheduling Class
Dispatch Queues (DSQ)To adapt to the scheduler core and BPF scheduler, sched_ext
uses Dispatch Queues (DSQ) as the scheduling queue, which can function both as FIFO and priority queues.
Currently, the built-in global DSQ queue and CPU local DSQ do not support priority features, only supporting FIFO scheduling.
By default, there is a global FIFO (SCX_DSQ_GLOBAL
) and per-CPU local DSQ (SCX_DSQ_LOCAL
). The BPF scheduler can manage any number of DSQs using scx_bpf_create_dsq()
and scx_bpf_destroy_dsq()
.
In the initialization of kernel sample code, the implementation of user-defined queue example is as follows:
|
|
In the function simple_init
, a DSQ with ID 0 is created using scx_bpf_create_dsq
, with the second parameter being the NUMA node Id, for subsequent kernel task scheduling management.
Scheduling Cycle Workflow
When a CPU is ready, it will prioritize selecting tasks from the local queue. If the local DSQ is not empty, the first task will be selected. Otherwise, the CPU will attempt to use the built-in global DSQ. If no runnable tasks are generated, then ops.dispatch()
will be called for scheduling or task consumption.
The workflow of the BPF scheduler in sched_ext
can be analyzed from two dimensions: task wakeup and CPU readiness, with only the core process diagram provided.
-
When a task is awakened,
ops.select_cpu()
is the first operation function called.Here, there are mainly two purposes:
-
CPU selection optimization hint.
-
If the target CPU is idle, wake up the selected CPU.
-
The CPU returned by the ops.select_cpu()
operation function is an optimization hint and does not have binding constraints. The actual decision is made in the final step of the scheduling. However, if the CPU selected by ops.select_cpu()
matches the CPU where the task eventually runs, considering various CPU caches, there will be a slight performance improvement. The side effect of selecting a CPU is waking it from idle state. Although the BPF scheduler can wake up any CPU using the auxiliary function scx_bpf_kick_cpu()
, using ops.select_cpu()
appropriately can be simpler and more efficient. Calling the kernel function scx_bpf_dispatch()
within the operation function ops.select_cpu()
can immediately allocate the task to DSQ. If the flag SCX_DSQ_LOCAL
is set when calling the function scx_bpf_dispatch()
, the task will be scheduled to the local DSQ of the CPU returned from ops.select_cpu()
.
The understanding above might be a bit abstract. Here is the implementation of the ops.select_cpu
function in scx_simple
:
|
|
Further, directly scheduling from ops.select_cpu()
will bypass the ops.enqueue()
callback.
Please note that the scheduler core will ignore invalid CPU selections, such as if it exceeds the cpumask range allowed for the task.
-
If it does not go through priority scheduling distribution (i.e., Branch 1 above), when choosing the target CPU, the subsequent
ops.enqueue()
will be called (Branch 2 in the code).ops.enqueue()
can make one of the following decisions:-
Dispatch the task to the global or local DSQ by respectively using
SCX_DSQ_GLOBAL
orSCX_DSQ_LOCAL
with the function callscx_bpf_dispatch()
. -
Dispatch the task immediately to a custom DSQ by calling
scx_bpf_dispatch()
with a DSQ ID less than 2^63. -
Queue tasks on the BPF side.
-
|
|
-
When a CPU is ready (schedulable), it first checks its local DSQ. If the local DSQ is empty, it will check the global DSQ. If there are still no tasks to run, it calls
ops.dispatch()
, which can be called to dispatch tasks to the local DSQ using the following two functions.-
The
scx_bpf_dispatch()
function dispatches tasks to a DSQ. Any target DSQ can be used -SCX_DSQ_LOCAL
,SCX_DSQ_LOCAL_ON | cpu
,SCX_DSQ_GLOBAL
, or a custom DSQ. While it is currently not possible to callscx_bpf_dispatch()
while holding a BPF lock, this issue is under development and will be supported.scx_bpf_dispatch()
schedules the dispatch instead of executing it immediately. There can be up toops.dispatch_max_batch
tasks pending for processing. -
scx_bpf_consume()
moves tasks from a specified non-local DSQ to the scheduled DSQ. This function cannot be called while holding any BPF locks.scx_bpf_consume()
flushes pending dispatched tasks before attempting to use the specified DSQ.
-
|
|
-
After
ops.dispatch()
returns, if there are tasks in the local DSQ queue, the CPU runs the first task. If it is empty, it follows these steps:-
Try to consume the global DSQ. If successful, run the task.
-
If
ops.dispatch()
has dispatched any tasks, retry #3. -
If the previous task was an SCX task and can still run, continue running it (see
SCX_OPS_ENQ_LAST
). -
Remain idle.
-
Note that the BPF scheduler can always choose to dispatch tasks immediately in ops.enqueue()
. If only using built-in DSQs, there is no need to implement ops.dispatch()
, as tasks will never queue on the BPF scheduler, and both local and global DSQs will be utilized automatically.
scx_bpf_dispatch()
queues tasks on the FIFO of the target DSQ. Use scx_bpf_dispatch_vtime()
for priority queueing. Internal DSQs like SCX_DSQ_LOCAL
and SCX_DSQ_GLOBAL
do not support priority queue scheduling and must use scx_bpf_dispatch()
for scheduling. Refer to the function documentation and usage in tools/sched_ext/scx_simple.bpf.c
for more information.
Switching to sched_ext
CONFIG_SCHED_CLASS_EXT
is the configuration option to enable sched_ext, and tools/sched_ext
contains sample schedulers. When compiling the kernel, the following configuration options should be enabled to use sched_ext:
|
|
The latest Patch V7, the code repository address is https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/. sched_ext is only used when the BPF scheduler is loaded and running.When explicitly setting the scheduling strategy of a task to
SCHED_EXT
, it will be treated asSCHED_NORMAL
and scheduled by the CFS until the BPF scheduler is loaded. Upon loading, such tasks will switch tosched_ext
and be scheduled by thesched_ext
scheduler program.
Terminating the sched_ext
scheduler program, triggering SysRq-S
, or detecting any internal errors (including stopped runnable tasks) will abort the BPF scheduler program and restore all tasks back to CFS.
|
|
The current state of the BPF scheduler can be determined as follows:
|
|
tools/sched_ext/scx_show_state.py
is a drgn script that provides more detailed information:
|
|
If the CONFIG_SCHED_DEBUG
option is set during compilation, you can determine if a task is running on sched_ext
as follows:
|
|
Summary
This document briefly introduces the kernel schedulers CFS/EEVDF, the implementation mechanism, and workflow of sched_ext
, along with specific details on the scx_simple
BPF scheduler in the kernel. If you are interested in the implementation and application of sched_ext
, you can study and experiment with the schedulers implemented in the tools/sched_ext/
directory by compiling the kernel. It is believed that you will make more discoveries and gain insights into scheduling. With the integration of sched_ext
into the kernel code, it will bring more convenience to our testing and implementation.
I have limited knowledge. Welcome to provide criticism and corrections if any mistakes are found.
- Author: DavidDi
- Link: https://www.ebpf.top/en/post/bpf_sched_ext_dive_into/
- License: This work is under a Attribution-NonCommercial-NoDerivs 4.0 International. Kindly fulfill the requirements of the aforementioned License when adapting or creating a derivative of this work.
- Last Modified Time: 2024-09-16 11:38:10.402366081 +0800 CST