Notes from the Real-time Micro conference at Linux Plumbers 2019

Core Scheduling for RT – Peter Zijlstra

The L1TF/MDS vulnerabilities turn not safe using HT when two applications that do not trust each other share the same core. The current solution for this problem is to disable HT. Core scheduling serves to allow SMT when it is safe, for instance, if two threads trust each other. SMT/HT is not always good for determinism as the execution speed of individual hardware threads can vary a lot. Core scheduling can be used to force-idle siblings for RT tasks while allowing non-RT to use all threads of the same core (when safe).

Core-scheduling will work for any task (not just RT) and is currently implemented by cgroups, as Peter figured it was the most natural way. People want this for various reasons, so it will eventually be merged.

Regarding real-time schedulers, SCHED_DEADLINE’s admission control could only allow tasks on a single thread of a core, by limiting the amount of runtime/period available in the system. But if we allow trusted threads to share the CPU time, we can double the amount available for that core.

Thomas suggested making the part to allow multiple threads (SMT) to be part of setting the scheduling policy.

Load balancing might start to be tricky since it doesn’t currently take into account where bandwidth is scheduled. Luca’s capacity awareness might help there.

RT_RUNTIME_SHARE has been removed from RT and should have been removed from upstream. RT_THROTTLING goes away once the server is available.

RCU configuration, operation, and upcoming changes for real-time workloads – Paul McKenney

RCU callback self-throttling: Could we prevent somebody from queue callbacks while servicing them? It would be valuable, but there are valid use-cases (e.g., refcounting). Slowly increase the limit of how many callbacks to invoke, until it gets to a considerable limit to do them all (Eric Dumazet’s patch).

For RCU callback offloading, if CPU0 has its thread managing the callbacks for other CPUs, but then CPU0 starts calling a bunch of callbacks, it can cause the manager thread to become too busy processing its callbacks, and this starves the callbacks for the other threads. One change is to create another thread that handles the grace periods and not having a dual role in managing the callbacks and also invoking them.

Peter mentioned that rcuogp0 with slight a higher priority of rcuop0 so that it can always handle grace periods while the rcuop(s) are invoking callbacks. Verify whether having a single rcuogp0 thread, pinned to a CPU, is not better than the current approach that creates square(max available CPUs) threads. The current impression is that on large systems, the cost to wake up all the threads that should handle callbacks would prevent the single rcuogp0 thread from tackling (observing) grace periods that could be expiring. The number of GP threads uses total CPUs as base calculation so that max would be sqrt(total-CPUs).

One issue is if the admin offloads all CPUs to one CPU and then that CPU gets overloaded enough that it can’t keep up, the system can OOM (as the garbage collection becomes too slow and we use up all memory).

First choice: OOM
Second choice: print warn during OOM
Third: detect offloading issues and delay call_rcu()
Forth: Detect offloading issues and stop OL (ignoring what the admin asked for)

If #4 is implemented, probably need a command line parameter to enable it as not everyone will want that (some prefer to OOM in this case).

Mathematizing the latency – Daniel Bristot de Oliveira

For most people, it is not clear what the latency components are. Daniel proposes to improve the situation by breaking the latency into independent variables, to apply a measurement or probabilistic methods to get values for each variable. Then, sum up individual variables to determine the possible worst case.

How do we define latency? Until the context switch takes place or until the point in which the scheduled process effectively starts running. Daniel proposed the context switch, but Thomas said that it is when the process effectively starts running. Daniel then pointed to the return of the scheduler. Peter Zijlstra then noted that the most important thing is to define the model, with the individual pieces, and then we will be able to see things more clearly.

The main problem comes with the interrupts account. There’s no way to timestamp when the actual hard interrupt triggered in the hardware. One may be able to use something like a timestamped networking packet to infer the time, but there’s currently no utility to do so. Also, interrupts are not periodic, so it is hard to define their behavior with simple models used in the RT literature. However, the IRQ prediction used for the idle duration estimation in the mainline kernel is a good starting point for analysis.

Real-Time Softirq Mainlining – Frederic Weisbecker

Softirq design hasn’t changed in about three decades and is plenty of hacks (Frederic shows some of them in the slides).

Softirq is now annoying for latency-sensitive tasks as users want a higher priority softirq to interrupt other softirqs in RT. Thomas adds that not only -rt people, but vanilla as well. Mainline kernel people want to find a balance between softirq processing and handling the interrupt itself (networking case). Currently, the RT kernel has one softirq kthread. In the past, multiple were tried, but we faced issues.

Make softirq disable more fine-grained (as opposed to all on/off) is a wish, and it makes sense outside of RT as well. The question is: does softirqs share data among them? We do not know, so we don’t know what will break, so it is a real open problem.

The following points were raised:

Problem: if softirq run in the return of interrupt, you will never get anything done. If you push it out to other threads, it is perfectly fine for networking, but it breaks others.
For mainline, it would be to have a ksoftirq pending mask. Thomas agrees but adds that then people would like to have a way to go back to that model in RT, where we have a thread per vector.
RCU should go out of softirqd.
Good use case: networking has conflicts with block/fs today, and this should solve a lot of it – long term process, good for Frederic’s patchcount.
lockdep support is a big chunk.
Some drivers might never get converted, which is the same problem we had with BKL. We need to work with the subsystem experts and fix them.

Full Dynticks – Frederic Weisbecker

Some users are requesting Full Dynticks for full CPU isolation. Their use case usually involves staying in userspace pooling on devices like PCI or networking, not only in “bare-metal,” but also in virtual-machines, where people want to run the kernel-rt in the host, plus a VM running the kernel-rt while polling inside the VM. They want both host/guest to not have ticks.

Currently, there is one tick every 4 seconds for the timer watchdog that is not exactly a tick. The tsc=reliable on the kernel command helps to reduce the tick, but above two sockets, it is a lie (steve shows the graph of this happening).

Full Dyanticks to work, it requires:

fixing /proc/stat for NOHZ.
Appropriate RCU lifecycle for a task
Clean up code in tick-sched

Another suggestion was to make nohz_full mutable via cpusets, but that is black magic!

PREEMPT_RT: status and Q&A – Thomas Gleixner

CONFIG_PREEMPT_RT switch is now in the mainline but is not functional yet. It took 15 yrs on LKML and 20 yrs for Thomas.

There are still a few details to be hashed out for the final merge, but having the CONFIG_PREEMPT_RT switch mainline helps because people did not want to change code, not knowing about preempt rt going mainline. There are lots of stuff queued for v5.4, taking a significant amount out from rt patchset, including an outstanding clean up on printk (discussed in a BoF, including Linus). There is also a more substantial chunk to land with printk, hopefully for 5.5.

Q: What do you expect the new mainline with rt without rt enabled?

A: Should be the same.

Q: Once the mainline kernel builds a functional PREEMPT_RT – what’s next?

A: Fix functionalities that are disabled with RT, which people want, and the main is the eBPF.

BPF hardly relies on preempt_disable, mainly because of spinlock embedded in the bytecode. Nowadays, the bytecode is small and should not affect the latency too much, but there are already plans to accept more extensive code, causing non-acceptable latencies for the RT.

The problem with preempt_disable and friends is that they are “scope less,” making it hard to define what they are protecting. So a possible solution is to create specific protections, e.g., for eBPF. A candidate is the usage of the local lock. On non-rt, it allocates zero space. On RT, it allocates a lock and then behaves semantically like preempt_disable. And then in RT, you have scope to know what is being protected. Once you have a scope, lockdep will work. (percpu variable access bug detected by the local lock).