Notes from the Real-time Micro conference at Linux Plumbers 2019

Core Scheduling for RT – Peter Zijlstra

The L1TF/MDS vulnerabilities turn not safe using HT when two applications that do not trust each other share the same core. The current solution for this problem is to disable HT. Core scheduling serves to allow SMT when it is safe, for instance, if two threads trust each other. SMT/HT is not always good for determinism as the execution speed of individual hardware threads can vary a lot. Core scheduling can be used to force-idle siblings for RT tasks while allowing non-RT to use all threads of the same core (when safe).

Core-scheduling will work for any task (not just RT) and is currently implemented by cgroups, as Peter figured it was the most natural way. People want this for various reasons, so it will eventually be merged.

Regarding real-time schedulers, SCHED_DEADLINE’s admission control could only allow tasks on a single thread of a core, by limiting the amount of runtime/period available in the system. But if we allow trusted threads to share the CPU time, we can double the amount available for that core.

Thomas suggested making the part to allow multiple threads (SMT) to be part of setting the scheduling policy. 

Load balancing might start to be tricky since it doesn’t currently take into account where bandwidth is scheduled. Luca’s capacity awareness might help there.

RT_RUNTIME_SHARE has been removed from RT and should have been removed from upstream. RT_THROTTLING goes away once the server is available.

RCU configuration, operation, and upcoming changes for real-time workloads – Paul McKenney

RCU callback self-throttling: Could we prevent somebody from queue callbacks while servicing them? It would be valuable, but there are valid use-cases (e.g., refcounting). Slowly increase the limit of how many callbacks to invoke, until it gets to a considerable limit to do them all (Eric Dumazet’s patch).

For RCU callback offloading, if CPU0 has its thread managing the callbacks for other CPUs, but then CPU0 starts calling a bunch of callbacks, it can cause the manager thread to become too busy processing its callbacks, and this starves the callbacks for the other threads. One change is to create another thread that handles the grace periods and not having a dual role in managing the callbacks and also invoking them.

Peter mentioned that rcuogp0 with slight a higher priority of rcuop0 so that it can always handle grace periods while the rcuop(s) are invoking callbacks. Verify whether having a single rcuogp0 thread, pinned to a CPU, is not better than the current approach that creates square(max available CPUs) threads. The current impression is that on large systems, the cost to wake up all the threads that should handle callbacks would prevent the single rcuogp0 thread from tackling (observing) grace periods that could be expiring. The number of GP threads uses total CPUs as base calculation so that max would be sqrt(total-CPUs).

One issue is if the admin offloads all CPUs to one CPU and then that CPU gets overloaded enough that it can’t keep up, the system can OOM (as the garbage collection becomes too slow and we use up all memory).

  • First choice: OOM
  • Second choice: print warn during OOM
  • Third: detect offloading issues and delay call_rcu()
  • Forth: Detect offloading issues and stop OL (ignoring what the admin asked for)

 If #4 is implemented, probably need a command line parameter to enable it as not everyone will want that (some prefer to OOM in this case).

Mathematizing the latency – Daniel Bristot de Oliveira 

For most people, it is not clear what the latency components are. Daniel proposes to improve the situation by breaking the latency into independent variables, to apply a measurement or probabilistic methods to get values for each variable. Then, sum up individual variables to determine the possible worst case.

How do we define latency? Until the context switch takes place or until the point in which the scheduled process effectively starts running. Daniel proposed the context switch, but Thomas said that it is when the process effectively starts running. Daniel then pointed to the return of the scheduler. Peter Zijlstra then noted that the most important thing is to define the model, with the individual pieces, and then we will be able to see things more clearly.

The main problem comes with the interrupts account. There’s no way to timestamp when the actual hard interrupt triggered in the hardware. One may be able to use something like a timestamped networking packet to infer the time, but there’s currently no utility to do so. Also, interrupts are not periodic, so it is hard to define their behavior with simple models used in the RT literature. However, the IRQ prediction used for the idle duration estimation in the mainline kernel is a good starting point for analysis.

Real-Time Softirq Mainlining – Frederic Weisbecker

Softirq design hasn’t changed in about three decades and is plenty of hacks (Frederic shows some of them in the slides).

Softirq is now annoying for latency-sensitive tasks as users want a higher priority softirq to interrupt other softirqs in RT. Thomas adds that not only -rt people, but vanilla as well. Mainline kernel people want to find a balance between softirq processing and handling the interrupt itself (networking case). Currently, the RT kernel has one softirq kthread. In the past, multiple were tried, but we faced issues.

Make softirq disable more fine-grained (as opposed to all on/off) is a wish, and it makes sense outside of RT as well. The question is: does softirqs share data among them? We do not know, so we don’t know what will break, so it is a real open problem.

The following points were raised:

  • Problem: if softirq run in the return of interrupt, you will never get anything done. If you push it out to other threads, it is perfectly fine for networking, but it breaks others.
  • For mainline, it would be to have a ksoftirq pending mask. Thomas agrees but adds that then people would like to have a way to go back to that model in RT, where we have a thread per vector.
  • RCU should go out of softirqd.
  • Good use case: networking has conflicts with block/fs today, and this should solve a lot of it – long term process, good for Frederic’s patchcount.
  • lockdep support is a big chunk.
  • Some drivers might never get converted, which is the same problem we had with BKL. We need to work with the subsystem experts and fix them.

Full Dynticks – Frederic Weisbecker

Some users are requesting Full Dynticks for full CPU isolation. Their use case usually involves staying in userspace pooling on devices like PCI or networking, not only in “bare-metal,” but also in virtual-machines, where people want to run the kernel-rt in the host, plus a VM running the kernel-rt while polling inside the VM. They want both host/guest to not have ticks.

Currently, there is one tick every 4 seconds for the timer watchdog that is not exactly a tick. The tsc=reliable on the kernel command helps to reduce the tick, but above two sockets, it is a lie (steve shows the graph of this happening).

Full Dyanticks to work, it requires:

  • fixing /proc/stat for NOHZ. 
  • Appropriate RCU lifecycle for a task
  • Clean up code in tick-sched

Another suggestion was to make nohz_full mutable via cpusets, but that is black magic!

PREEMPT_RT: status and Q&A – Thomas Gleixner

CONFIG_PREEMPT_RT switch is now in the mainline but is not functional yet. It took 15 yrs on LKML and 20 yrs for Thomas.

There are still a few details to be hashed out for the final merge, but having the CONFIG_PREEMPT_RT switch mainline helps because people did not want to change code, not knowing about preempt rt going mainline. There are lots of stuff queued for v5.4, taking a significant amount out from rt patchset, including an outstanding clean up on printk (discussed in a BoF, including Linus). There is also a more substantial chunk to land with printk, hopefully for 5.5.

Q: What do you expect the new mainline with rt without rt enabled?

A: Should be the same.

Q: Once the mainline kernel builds a functional PREEMPT_RT – what’s next?

A: Fix functionalities that are disabled with RT, which people want, and the main is the eBPF.

BPF hardly relies on preempt_disable, mainly because of spinlock embedded in the bytecode. Nowadays, the bytecode is small and should not affect the latency too much, but there are already plans to accept more extensive code, causing non-acceptable latencies for the RT.

The problem with preempt_disable and friends is that they are “scope less,” making it hard to define what they are protecting. So a possible solution is to create specific protections, e.g., for eBPF.  A candidate is the usage of the local lock. On non-rt, it allocates zero space. On RT, it allocates a lock and then behaves semantically like preempt_disable. And then in RT, you have scope to know what is being protected. Once you have a scope, lockdep will work. (percpu variable access bug detected by the local lock).

Some jargon of the complexity classes

In the context of formal verification:

PTIME: problems can be solved with time complexity polynomials to the
input sizes in bit counts;

PSPACE: problems may incur memory consumption polynomials to the
input sizes in bit counts;

NP: problems mean that we can guess a solution in time complexity polynomials to the input sizes;

NP-complete: problems are the hardest ones in NP and are in general considered untamable problems in computer sciences;

EXPTIME: problems consume CPU times exponential to the sizes of inputs in bit counts;

EXPSPACE: is the set of problems that at most consume memory capacity exponential to the input sizes in bit counts;

EXPSPACE-complete: problems are harder than EXPTIME problems;

Nonelementary complexities are like 2^2^[stack] with the heights of the exponent stacks at least proportional to the input sizes in bit counts;

Undecidable problems do not guarantee termination. In general, it is not possible to design algorithms (procedures that guarantee termination) for undecidable problems.

Copy and Paste from: Formal Verification of Timed Systems: A Survey and Perspective; Farn Wang.

Real-time Linux Summit 2019: Schedule

9:00 – 9:30Opening Talk Real-time Linux: what is, what is not and what is next!
Daniel Bristot de Oliveira, Red Hat.
9:30 – 10:15Real-time Linux in Financial Markets
Adrien Mahieux, Orness.
10:15 – 10:30Pause
10:30 – 11:15Supporting Real-Time Hardware Acceleration on Dynamically Reconfigurable SoC FPGAs
Marco Pagani, Scuola Superiore Sant’Anna – Université de Lille.
11:15 – 12:00 Real-time usage in the BeagleBone community
Drew Fustini, BeagleBoard.org Foundation.
12:00 – 13:30Lunch Pause
13:30 – 14:15 SCHED_DEADLINE: What is next (?)
Daniel Bristot de Oliveira & Juri Lelli, Red Hat.
14:15 – 15:00Synthetic events and basic histograms
Steven Rostedt, VMware.
15:00 – 15:15 Pause
15:15 – 16:00State of the PREEMPT_RT
Sebastian Andrzej Siewior, Linutronix GmbH.
16:00 – 16:45PREEMPT RT is Upstream! Q&A Session
Thomas Gleixner, Linutronix GmbH.
16:45 – 17:00Closing

Real-time Linux – What is, what is not and what is next.

Daniel Bristot de Oliveira, Red Hat

Description

This talk is a reflexion about the current state of the Real-time Linux, regarding the kind of determinism that is possible to obtain with Linux, and the kind of determinism that is still not possible to achieve. Knowing what is not possible is not a bad thing – rather, it opens the possibilities for the next opportunities in the development of Real-time Linux.

Bio

Daniel is a Principal Software Engineer at Red Hat, working in the real-time kernel team. He is also a researcher in the Retis Lab at the Scuola Superiore Sant’Anna (Pisa – Italy). He works in the research and development of new real-time features and runtime formal verification methods for the Linux kernel.

Real-time Linux in Financial Markets

Adrien Mahieux, Orness

Description

Provide an insight into how Linux Realtime is used, and demystify some ideas about electronic finance. Introduction to finance and stock markets:

– What’s a stock exchange
– What strategies
– Current Status of the markets

Why RT is needed (5min)
– HFT is not what you think: front-running, moving orders
– Usage from the market: gateway & matching engine
– Usage from the client: fight against front-running

Challenges & Tools (10min)
– Work with newer hardware (processor, NICs, FPGA)
– Very custom hardware on x86 systems
– kthread stalls (rcu, rt_mutex), kernel bypass, interrupt silencing, and help from Redhat kernel engineers
– Measurements: rt_test, sys_jitter, netdata, pcm.

Bio

Before the internet was a worldwide public standard, he used to organize LAN Parties: ephemeral infrastructures for hundreds of players. Every available resource was scarce and optimized. Now in this era of opulence, he helps companies in this search of parsimony through analysis and optimization of their whole stack. This is useful for HPC (throughput) and HFT (latency).

Supporting Real-Time Hardware Acceleration on Dynamically Reconfigurable SoC FPGAs

Marco Pagani, Scuola Superiore Sant’Anna – Université de Lille

Description

SoCs including multiple CPU tightly coupled with FPGA fabric, like the Xilinx’s Zynq and Zynq UltraScale, are popular for developing high-performance applications like computer vision and signal processing. On these platforms, software activities can be accelerated using custom hardware accelerators. Nowadays, dynamic partial reconfiguration allows virtualizing the FPGA resources to host, in time-sharing, more hardware accelerators than what statically possible. However, without a proper scheduling policy, FPGA reconfiguration may introduce unbounded delays unsuitable for real-time applications like autonomous vehicles. This talk presents FRED, a Linux based framework that enables real-time scheduling of hardware accelerators on the FPGA fabric. FRED exploits dynamic partial reconfiguration and recurrent execution to virtualize the FPGA fabric at production time in a predictable fashion.

Bio

Marco Pagani received his M.Sc. degree in Embedded Computing Systems cum Laude in 2016 jointly from Scuola Superiore Sant’Anna and the University of Pisa. He is currently pursuing the Ph.D. degree in Emerging Digital Technologies at Scuola Superiore Sant’Anna and Université de Lille. His main research interests are software support for real-time applications on heterogeneous computing platforms, real-time hardware acceleration, and real-time operating systems for embedded platforms.

Real-time usage in the BeagleBone community

Drew Fustini, BeagleBoard.org Foundation

Description

Many open-source hardware projects have been produced by the BeagleBone community that have real-time requirements.  These include motion control for CNC, laser cutting, and 3d printing. Other projects include autopilots, low latency audio processing, driving large LED displays, and high-speed data acquisition.  I’ll provide a look at projects like MachineKit (LinuxCNC fork) which migrating from Xenomai from PREEMPT_RT Kernel.

Bio

Drew Fustini is an Open Source Hardware designer at OSH Park, board member of the BeagleBoard Foundation, vice president of the Open Source Hardware Association, and maintainer of the Adafruit BeagleBone Python library

Maintaining out of tree patches over the long term

Daniel Wagner, SUSE

Description

The PREEMPT_RT patchset is the longest existing large patchset living outside the Linux kernel. Over the years, the realtime developers had to maintain several stable kernel versions of the patchset. This talk will present the lessons learned from this experience, including workflow, tooling and release management that has proven over time to scale. The workflow deals with upstream changes and changes to the patchset itself. Now that the PREEMPT_RT patchset is about the be merged upstream, we want to share our toolset and methods with others who may be able to benefit from our experience. This talk is for people who want to maintain an external patchset with stable releases.

Bio

Daniel Wagner is a stable-rt tree maintainer and contributor to various upstream projects (preempt_rt, ConnMan, FFADO,…)

Synthetic events and basic histograms

Steven Rostedt, VMware

Description

Synthetic events were introduced in 4.17 which allow for passing data from one trace event to another and using it to calculate deltas between fields and time stamps. This allows a user to create their own custom latency histograms or other data. In 5.0, easier access to arguments was added to kprobes. The combination allows for extracting data from the kernel and using it as part of the synthetic events and histogram logic. This talk will describe what is in the kernel today and some features that are coming in userspace tools that will make this easier to work with.

Bio

Steven Rostedt currently works for VMware in their Open Source Technology Center. He’s the maintainer of the stable releases for the Real-Time patch (PREEMPT_RT). He is also one of the original developers for the Real-Time patch. Steven is the main developer and maintainer for ftrace, the official tracer of the Linux kernel, as well as the userspace tools trace-cmd and kernelshark. He also develops ktest.pl (in the kernel) and make localmodconfig.

State of the PREEMPT_RT

Sebastian Andrzej Siewior, Linutronix GmbH.

Description

The RT patch is maintained for a long time. More and more bits and pieces were merged in the upstream kernel and were removed from the RT patch since. As a result of this, the RT queue got more and more RT specific and it was harder to argue that the non-RT benefits from the change. In v5.3 upstream, the CONFIG_PREEMPT_RT option finally appeared. This talk shows what big changes occurred in the recent release and what is planned for the feature.

Bio

Sebastian maintains the PREEMPT_RT patch since around v3.8 and contributes to the upstream kernel by posting patches from the RT queue which are beneficial to the kernel without the RT patch.

[CFP] Real-Time Summit 2019 Call for Presentations

The Real-Time Summit is organized by the Linux Foundation Real-Time Linux (RTL) collaborative project. The event is intended to gather developers and users of Linux as a Real-Time Operating System. The main intent is to provide room for discussion between developers, tooling experts, and users.

The summit will take place alongside the Open Source Summit + Embedded Linux Conference Europe 2019 in Lyon, France. The summit is planned the day after the main conference, Thursday, October 31st, 2019, from 8:00 to 17:00 at the conference venue. If you are already considering your travel arrangements for the Open Source Summit + Embedded Linux Conference Europe 2019 in Lyon, France, and you have a general interest in this topic, please extend your travel by one day to be in Lyon on Thursday, 31st.

If you are interested to present, please submit a proposal [1] before September 14th, 2019, at 23:59 EST. Please provide a title, an abstract describing the proposed talk (900 characters maximum), a short biography (900 characters maximum), and describe the targeted audience (900 characters maximum). Please indicate the slot length you are aiming for: The format is a single track with presentation slots of 30, 45 or 60 minutes long. Considering that the presentation should use at most half of the slot time, leaving the rest of the slot reserved for discussion. The focus of this event is the discussion.

We are welcoming presentations from both end-users and developers, on topics covering, but not limited to:

  • Real-time Linux development
  • Real-time Linux evaluation
  • Real-time Linux use cases (Success and failures)
  • Real-time Linux tooling (tracing, configuration, …)
  • Real-time Linux academic work, already presented or under development, for direct feedback from practitioners community.

Those can cover recently available technologies, ongoing work, and new ideas.

Important Notes for Speakers:

  • All speakers are required to adhere to the Linux Foundation events’ Code of Conduct. We also highly recommend that speakers take the Linux Foundation online Inclusive Speaker Orientation Course.
  • Avoid sales or marketing pitches and discussing unlicensed or potentially closed-source technologies when preparing your proposal; these talks are almost always rejected due to the fact that they take away from the integrity of our events, and are rarely well-received by conference attendees.
  • All accepted speakers are required to submit their slides prior to the event.

Submission must be received by 11:59 pm PST on September 14th, 2019

[1] Submission page: https://forms.gle/yQeqyrtJYezM5VRJA

Important Dates:

  • CFP Close: Saturday, September 14th, 2019, 11:59PM PST
  • Speaker notification: September 21st, 2019
  • Conference: Thursday, October 31st, 2019

Questions on submitting a proposal? Email Daniel Bristot de Oliveira <bristot@redhat.com>

Paper accepted at SEFM 2019

I had the paper “Efficient Formal Verification for the Linux Kernel” at the 17th International Conference on Software Engineering and Formal Methods (SEFM 2019).

It has the following abstract:

Formal verification of the Linux kernel has been receiving increasing attention in recent years, with the development of many models, from memory subsystems to the synchronization primitives of the real-time kernel. The effort in developing formal verification methods is justified considering the large code-base, the complexity in synchronization required in a monolithic kernel and the support for multiple architectures, along with the usage of Linux on critical systems, from high-frequency trading to self-driven cars. Despite recent developments in the area, none of the proposed approaches are suitable and flexible enough to be applied in an efficient way to a running kernel. Aiming to fill such a gap, this paper proposes a formal verification approach for the Linux kernel, based on automata models. It presents a method to auto-generate verification code from an automaton, which can be integrated into a module and dynamically added into the kernel for efficient on-the-fly verification of the system, using in-kernel tracing features. Finally, a set of experiments demonstrate verification of three models, along with performance analysis of the impact of the verification, in terms of latency and throughput of the system, showing the efficiency of the approach.

Sincerely, this is the paper I most enjoyed writing. It has a very easy reading and good practical results. But the best thing is that, by having my approach recognized by the formal methods community, I feel more conformable about using the word “formal“, which is a very strong word. This is one of the cherries of my Ph.D. Now it is time to work in the second cherry…. and that we are done.

I am also curious to attend the conference, the list of paper is very interesting, and I need to deeper my knowledge in the area.

Real-time Micro-conference accepted at the Linux Plumbers Conference 2019

From the announcement:

We are pleased to announce that the Real-Time Microconference has been
accepted into the 2019 Linux Plumbers Conference! The PREEMPT_RT patch
set (aka “The Real-Time Patch”) was created in 2004 in the effort to
make Linux into a hard real-time designed operating system. Over the
years much of the RT patch has made it into mainline Linux, which
includes: mutexes, lockdep, high-resolution timers, Ftrace,
RCU_PREEMPT, priority inheritance, threaded interrupts and much more.
There’s just a little left to get RT fully into mainline, and the light
at the end of the tunnel is finally in view. It is expected that the RT
patch will be in mainline within a year, which changes the topics of
discussion. Once it is in Linus’s tree, a whole new set of issues must
be handled. The focus on this year’s Plumbers events will include:

 – Real-time containers
 – Rework of softirqs (Requirement for the PREEMPT-RT merge)  [2]
 – An in-kernel view of latency [1]
 – Improvements in the locking determinism [4]
 – Advances in the RCU for reducing the per-cpu workload
 – The effects of BPF in the kernel latency
 – Core scheduling and Real-time schedulers [5]
 – Maintaining the RT stable trees [3]
 – New tools to test RT kernels [6]
 – New bootup self-tests
 – New types of failures that lockdep can detect after RT is merged

Come and join us in the discussion of making the LWN prediction of RT
coming into mainline “this year” a reality!
We hope to see you there[7]!
LPC[8] will be held in Lisbon, Portugal from Monday, September 9
through Wednesday, September 11.

[1]  Continuation of last year’s “New metrics for the PREEMPT RT”
discussion: 
http://bristot.me/wp-content/uploads/2018/11/new_metrics_for_the_rt.pdf
[2] 
https://lore.kernel.org/lkml/20190228171242.32144-1-frederic@kernel.org/
[3] 
https://wiki.linuxfoundation.org/realtime/preempt_rt_versions
[4] 
https://lwn.net/Articles/767953/ & continuation of last year’s
“SCHED_DEADLINE desiderata and slightly crazy ideas.”
[5] 
https://lwn.net/Articles/780703/
[6] Continuation of last year’s discussion: “How can we catch problems that can break the PREEMPT_RT preemption model?”
http://bristot.me/wp-content/uploads/2018/11/model_checker.pdf
[7] 
https://www.linuxplumbersconf.org/event/4/page/34-accepted-microconferences#realtime
[8] 
https://linuxplumbersconf.org

Kernel Recipes 2019 Talk!

I was invited to give a talk a the Kernel Recipes Conference in Paris. So, I decided to talk about formal modeling, but in an easy way. Here is the description of the talk. I hope people enjoy it.

Formal modeling made easy

Modeling parts of Linux has become a recurring topic. For instance, the memory model, the model for PREEMPT_RT synchronization, and so on.
But the term “formal model” causes panic for most of the developers. Mainly because of the complex notations and reasoning that involves formal languages. It seems to be a very theoretical thing, far from our day-by-day reality.

Believe me. Modeling can be more practical than you might guess!

This talk will discuss the challenges and benefits of modeling, based on the experience of developing the PREEMPT_RT model. It will present a methodology for modeling the Linux behavior as Finite-State Machines (automata), using terms that are very known by kernel developers: tracing events! With the particular focus on how to use models for the formal verification of Linux kernel, at runtime, with low overhead, and in many cases, without even modifying Linux kernel!