Read more at: https://www.ebpf.top/post/bpf_sched_ext

1. The Emergence of Pluggable Scheduler [2004]

In 2004, Con Kolivas from the Linux community proposed the idea of a pluggable scheduler, envisioning multiple schedulers in the kernel that users could choose during boot. The principle behind the patch submission involved splitting a significant amount of code into a common part in kernel/sched.c and a private part. It also included pointers in the scheduler.c file that directed functions handling scheduling tasks, which were invoked for various process events (fork(), exit(), etc.), to gather scheduling-related information. Implementing a new scheduler simply required writing replacement functions and integrating them. However, this submission faced strong opposition from community developer Ingo Molnar, who believed that having pluggable schedulers would discourage patches for scheduling domains and instead lead to separate schedulers for specific scenarios like NUMA scheduling and SMP scheduling.

Ingo Molnar’s standpoint was clear: If everyone focuses on their own little family, the scheduler as a big family will lack organization and code contributions, leading to the existence of schedulers tailored to specific scenarios.

While I couldn’t find more detailed information on this discussion, it was evident that this attempt was destined to fail.

2. Roman’s First Attempt at Leveraging BPF Programmability for Scheduler [2021]

Fast forward to 2021, ten years after BPF technology, hailed as a kernel superpower, was incorporated into the kernel in 2014. BPF made remarkable strides in observability, networking, security, and more, elevating to a top-tier sub-module within the kernel. Utilizing BPF’s programmable capabilities in the scheduler, transitioning it from being configurable to programmable, emerged as a highly appealing direction.

In September 2021, Facebook’s Roman Gushchin submitted a set of BPF-related patches related to the scheduler (comprising six small patches). Through this patch set, Roman Gushchin aimed to initiate a discussion on integrating BPF into the scheduler. However, some scheduler experts expressed concerns that the mechanisms provided by the patches might result in micro-optimizations tailored to specific scenarios, hindering broader optimizations in the scheduler (note: similar to concerns raised in the 2004 discussion). Naturally, this was a contentious topic as communal and individual needs coexisted, making personalized customization a preferable solution in most cases.

RomanGushchin

Image of Roman Gushchin (from GitHub)

Upon encountering the scheduler-related BPF submission for the first time, I even wrote an article on When BPF Meets CPU Scheduler to delve into the subject matter. However, after several rounds of discussion on Roman’s submission, it failed to grab significant attention within the scheduler community. I also noticed a July 2022 email thread where Yafang Shao provided some suggestions, with Roman indicating plans for fixes in the next version, yet no updated patch release was seen. In subsequent email responses, Roman mentioned his current focus was not on this area, but he might revisit it in the future. Despite the inadequacy in preparation evident in this endeavor, Roman’s attempt indeed initiated a dialogue between BPF and kernel schedulers. Roman successfully planted the seed, which is bound to blossom eventually.

3. Tejun Heo Takes the Baton and Resumes the Battle [2022]

[Tejun Heo]

Image of Tejun Heo (2019)

In November 2022, Tejun Heo (Meta) took up the mantle and, along with David Vernet (Meta), Josh Don (Google), and Barret Rhoden (Google), submitted a patch set consisting of 30 small patches to the Linux community (61 files changed, 9672 insertions(+), 136 deletions(-)). This move reignited the BPF offensive on the scheduler, naming it sched_ext, with the core idea being to allow scheduling policies to be implemented as BPF programs. The submission also mentioned receiving support from some active kernel developers at Google.

In the patch submission, Tejun elaborated on the value proposition of introducing sched_ext, dedicating considerable content to its justification. Key points included:

  • Ease of Experimentation and Exploration: Facilitating rapid iteration for implementing new scheduling policies.

    In this section, Tejun provided substantial insights on the necessity, and I’ll summarize some key points:

    • Complexity of Schedulers: Changes in CPU architectures, system evolution, and varying use cases have significantly increased scheduler complexity.

    • Significance of Experimentation: Both AMD and Google have conducted extensive experiments related to schedulers, with Huawei also engaged in similar scheduler programmability efforts.

    • Challenges with CFS Scheduler: Expansion is difficult and time-consuming, with high entry barriers and slow effectiveness, becoming an obstacle to scheduler evolution aided by the academic community.- Advantages and Trade-offs of sched_ext: sched_ext addresses these challenges by providing an accessible experimental scheduling framework. It offers convenient callbacks and helper functions to simplify common operations, reducing complexity. By utilizing BPF (Berkeley Packet Filter), security is ensured by preventing system crashes through static analysis of programs, enabling secure experiments and rapid iterations. sched_ext streamlines scheduling experiments, facilitating the integration of machine learning and achieving a 15% throughput improvement in Nginx benchmark tests.

    Tejun also mentioned that a key advantage of sched_ext is its use of BPF. BPF provides robust security guarantees by statically analyzing programs at load time to ensure they do not damage or crash the system. Regardless of the BPF scheduling program being loaded, sched_ext can guarantee system integrity and provide a mechanism to securely disable the current BPF scheduling program and migrate tasks back to trusted scheduling programs.

    In short, due to the increasing complexity of modern systems, exploration in scheduling is crucial. Tools like sched_ext significantly lower the entry barriers, support fast experiments, and have the potential to bring revolutionary advancements in scheduler design and performance.

  • Customizable: Building application-specific schedulers for scenarios that are not supported by generic scheduling strategies.

    With flexible scheduling policies, sched_ext enables users to implement application-specific schedulers. Unlike improving CFS, custom schedulers can be optimized for specific applications or hardware and better adapt to specific industry scheduling scenarios like aviation. While sched_ext may increase fragmentation in scheduler implementations, its experimental platform helps enhance overall performance and development.

  • Rapid Scheduler Deployment: Switching scheduling strategies in production environments without interruption.

    Kernel upgrades are often slow, sometimes taking months, especially when fixing errors. Livepatch has limited applicability and cannot patch scheduling strategies. sched_ext can rapidly propagate new scheduling strategies, such as resolving core scheduling for the L1TF vulnerability. Although the upstream process for core scheduling is lengthy, sched_ext can be quickly deployed to cope. Google has previously mitigated performance issues caused by low priority loads through sched affinity; sched_ext allows for rapid testing and deployment of new strategies until upstream solutions are available.

Tejun’s excellent summarizing ability cannot be overlooked. He elaborately explained the value and achievement path of sched_ext in various scenarios. In simple terms: Core schedulers are complex, increasingly unsuitable for modern hardware architectures and specific scenarios, and their evolution is slow, hindering research in the scheduler field. sched_ext, on the other hand, can address these issues. By its flexibility, sched_ext quickly deploys solutions to production problems and injects new vitality into the academic field of schedulers. These statements are believed to inspire interest and support for sched_ext.

4. Tejun Struggles with Repeated Rejections from Scheduler Gurus【2023】

Subsequently, Tejun persisted and submitted respective V2, V3, and V4 versions in January, April, and July.

In the V4 submission, after over half a year since the initial patchset release, Tejun expressed confidence that previous feedback had been addressed, indicating that sched_ext now seems mature enough to be merged into the mainline. So, what are everyone’s thoughts? I’m ready, and I hope for your widespread support as well.

Barret Rhoden (Google) [one of the authors of the submission] was the first to respond with support. Barret mentioned ongoing experimental scheduler work based on ghost at Google, which can all be ported to sched_ext; internal testing of sched_ext patches at Google has begun, though not yet enabled due to patch dependencies, with updates planned.

Linux scheduler maintainer Peter Zijlstra also joined the discussion. On July 21, 2023, Peter Zijlstra explicitly stated that he will not merge the submission (So, since you wanted it in writing, here goes: NAK). Additionally, Peter Zijlstra remarked that he sees no value in merging the code and withdrew from the conversation. Peter Zijlstra

Image of Peter Zijlstra in 2009> I’m still hating the whole thing with a passion.

It’s crystal clear from the rampant abuse of SCHED_DEBUG; folks in general aren’t much into doing things the right way. They tinker with random numbers (some are downright bonkers) until their workload shows a flicker of improvement and call it a wrap.

Without a shadow of a doubt, if I merge this, there will be Enterprise software out there demanding its own BPF scheduler thingamajig to run or it simply won’t budge.

They won’t give a hoot, won’t lend a hand, might even pull a RedHat and keep the code under lock and key for customers only.

We all end up on the losing side in this scenario. Especially me, lugging around the extra burden of maintenance.

And I fail to see any silver lining in incorporating this. Just tinker with off-tree schedulers till you strike gold and then throw in what actually works.

So, since you wanted it in writing, here it goes:

NAK

Naturally, Tejun doesn’t agree with many of the concerns raised. He asserts that certain scheduling issues cannot be resolved simply by tweaking the current scheduler, especially in environments like “super large scale” setups such as Meta. He doesn’t share Peter’s concern about sched_ext adding to the maintenance load, citing BPF working well alongside other kernel modules. Tejun believes enabling users to explore new possibilities is beneficial, even if it sometimes leads to “silly scenarios” where people opt for new functionalities.

In essence, Tejun argues that the opponents are fixated on the potential costs of sched_ext (which he believes are exaggerated) without taking into account the benefits it could bring. He responds:

In many production settings, aspects of workload behavior can be challenging to fully grasp. Workloads are often highly intricate, evolving with contributions from many individuals and dynamically interacting with external entities. Scheduling is an area of keen interest when striving to optimize system performance. Most individuals are not sufficiently familiar with the scheduler codebase to make modifications. Even if they are, setting up benchmarks in production and iterating with different kernels can be nontrivial. It’s not surprising to opt for adjustable parameters as they are the only available option, and tweaking these parameters often yields some gains. However, the widespread availability of sched_ext will facilitate easier and more extensive experimentation, aiding us in acquiring a deeper understanding of scheduling.

Both Meta and Google are committed to sharing what they learn, including code and experiences.

As time progresses, Tejun pings on 2023.8.3 and 8.10 respectively, hoping for more discussions.

However, Mel Gorman from Suse Labs also voices support for Peter Zijlstra’s stance. In response to Mel Gorman’s views, Tejun offers a targeted reply.

Lastly, Josh Don (Google) [one of the contributors] responds showing support for sched_ext, and mentions that Google and Meta are collaborating to drive this initiative:

I’d like to reiterate Google’s support for this proposal (speaking on behalf of Google) and express that pluggable schedulers have undergone significant experimentation within the ghost framework, displaying conspicuous results. Looking ahead, we plan to redesign the ghost user-space infrastructure to operate atop the sched_ext kernel infrastructure. We believe sched_ext design offers numerous advantages, especially with tightly integrated BPF. We are committed to the concept of pluggable scheduling and are closely partnered with Meta to advance this effort, also deploying it internally.

Following the conclusion of this dialogue in October 2023, the integration of sched_ext hit a standstill.

5. Tejun Heo Refines the Version and Rekindles the Drive, Dawn Appears in 2024

In November 2023 and May 2024, Tejun Heo persists in optimizing the code, submitting V5 and V6 versions respectively, showing relentless determination to achieve the end goal (backed with Meta and Google’s support, the confidence is undoubtedly there). The V5 submission involves modifications in 74 files, adding 17207 lines of code and removing 105 lines, whereas the V6 version modifies 96 files, adds 15056 lines of code and removes 139 lines.

In early May 2024, Tejun Heo, based on the V6 version, initiated a new round of sched_ext discussions, mentioning Valve’s plans to utilize sched_ext for better game scheduling on the Steam Deck. Ubuntu is contemplating releasing support in their 24.10 version. Meta and Google are gradually adopting it for production use. Additionally, there is interest in employing it in ChromeOS, and Occulus is also considering adoption. Tejun Heo wraps it up by stating:> Given that there already is substantial adoption which continues to grow and sched_ext doesn’t affect the built-in schedulers or the rest of the kernel in an invasive manner, I believe it’s reasonable to consider sched_ext for inclusion.

Considering the significant adoption and growth trends, as well as the non-invasive nature of sched_ext towards the built-in schedulers and the kernel’s other components, it seems justifiable to contemplate integrating sched_ext.

At the LSFMM + BPF conference in early May 2024, David Vernet (Meta), one of Tejun Heo’s collaborators, provided a summary of the progress and developments of sched_ext.

In the version 6 submission, although the scheduler maintainer guru Peter Zijlstra remains somewhat opposed, there seems to be a slight loosening of stance. He mentioned that this patch won’t be considered until the cgroup issue is resolved, and also voiced some complaints indicating that since both Meta and Google are willing to jointly introduce sched_ext, they should first address the lingering cgroup issue they left behind before discussing further.

I fundamentally believe the approach to be detrimental to the scheduler eco-system. Witness the metric ton of toy schedulers written for it, that’s all effort not put into improving the existing code.

Despite Tejun’s willingness (albeit reluctantly) to address the cgroup issue (clarifying that it is unrelated to the sched_ext submission), Tejun is willing to provide assistance to urge progress in resolving the cgroup issue to pave the way for integrating sched_ext.

However, Tejun strongly disagrees with Peter’s characterization of the sched_ext scheduler as “toy schedulers” and the assertion that using sched_ext would divert attention from the mainline scheduler. Tejun argues that there is no perfect CPU scheduler, and as such, the mainline scheduler must cater to all users’ needs. This makes exploring “radical ideas” nearly impossible and severely limits the number of people working on scheduler tasks.

Despite the disagreements, with the growing influence of sched_ext and Tejun’s (Google + Meta) continuous pressure, scheduler authority Peter had to make some compromises. After all, the issues in the open-source community are not solely technical. A glimmer of hope for the integration of sched_ext is finally visible.

6. Plot Twist, Linus Makes the Ultimate Decision to Integrate in 6.11 Version [2024.6.11]

Upon the initial submission of sched_ext, Linus participated in some technical discussions. Watching his team members go back and forth in the mailing list discussions, the big boss Linus couldn’t stay idle.

On June 11, 2024, Linus finally spoke up, deciding to integrate sched_ext into the 6.11 version. Linus Torvalds made this compelling decision despite opposition from many kernel developers. In his response, he mentioned:

linus-trovalds

I honestly see no reason to delay this any more. This whole patchset was the major (private) discussion at last year’s kernel maintainer summit, and I don’t find any value in having the same discussion (whether off-list or as an actual event) at the upcoming maintainer summit one year later, so to make any kind of sane progress, my current plan is to merge this for 6.11.

Finally, the BPF implementation of the programmable scheduling, sched_ext, was resolved. Moving forward, we will continue monitoring the integration of sched_ext in the 6.11 kernel version. In the next article, we will delve into the mechanism and implementation examples of sched_ext. Stay tuned, and feel free to provide constructive feedback as the author’s expertise is limited.