Two new stalld releases: 1.18 and 1.17.2

Since day 1, stalld has had a limitation: it consumes too much CPU time on very large systems.

The main culprit was parsing the sched/debug file, which is also one of the main characteristics of stalld: it offloads all the work to the user space without touching the monitored CPUs.

Also, since day 1, I thought about using tracing to collect the wakeups in the monitored CPUs, but I would prefer not to have the overhead of tracing processing, as it could consume as much CPU time as parsing sched/debug.

So, to have the best balance, I had to use eBPF.

Instead of tracing, stalld can now use an eBPF program to track the queue/dequeue of tasks in the per-CPU runqueue, saving the minimum required information into a map. This map is processed in user space so that stalld can detect stalls in a housekeeping CPU.

I will write a post about the challenges of integrating eBPF on stalld soon, probably after vacations.

As some distros might not support eBPF well, I will keep stalld 1.17 as a long-term version. It is the last version before adding eBPF to stalld and will receive fixes for a while.

Published by Daniel