Real-time micro-conference proposal for Linux Plumbers 2019

Last year we had a very fun and productive real-time micro-conference at the Linux Plumbers Conference, so the idea is to repeat it this year!

Are you interested in participating? Do you have any suggestion of topic? Let us know!

This is the current proposal of micro-conference, we can still edit it and add your thoughts.

Proposal

Since 2004 a project has improved the Real-time and low-latency features for Linux. This project has become know as PREEMPT_RT, formally the real-time patch. Over the past decade, many parts of the PREEMPT RT became part of the official Linux code base. Examples of what came from PREEMPT_RT include: Real-time mutexes, high-resolution timers, lockdep, ftrace, RT scheduling, SCHED_DEADLINE, RCU_PREEMPT, generic interrupts, priority inheritance futexes, threaded interrupt handlers and more. The number of patches that needs integration has been reduced on the last years, and the pieces left are now mature enough to make its way into mainline Linux. This year could possibly be the year PREEMPT_RT is merged ™!

In the final lap of this race, the last patches are on the way to be merged, but there are still some pieces missing. When the merge occurs, the preempt-rt will start to follow a new pace: the Linus one. So, it is possible to raise the following discussions:

  1.  The status of the merge, and how can we resolve the last issues that block the merge;
  2. How can we improve the testing of the -rt, to follow the problems raised as Linus tree advances;
  3. What’s next?

Possible topics:

  • Status of the PREEMPT_RT Merge
  • Merge – what is missing and who can help?
  • How do we teach the rest of the kernel developers how not to break PREEMPT_RT?
  • Stable maintainers tools discussion & improvements.
  • Interrupt threads are RT and are not protected by the RT Throttling. How can we prevent interrupt thread starvation from a rogue RT task?
  • Improvements on full CPU isolation
  • Newer methods like proxy execution, hierarchical scheduler?
  • What tools can we add into tools/ that other kernel developers can use to test and learn about PREEMMPT_RT?
  • What tests can we add into tools/testing/selftests?
  • New tools for timing regression test, e.g. locking, overheads…
  • What kernel boot self-tests can be added?
  • Discuss various types of failures that can happen with PREEMPT_RT that normally would not happen in the vanilla kernel, e.g, with lockdep, preemption model.

I will suggest the continuation of the discussions of the topics I presented last year, based in the results of the work I did this year (spoiler: I am doing model verification in the kernel, and it is very efficient!).

Paul already told that he could talk about ongoing work to make RCU’s forward progress be more robust in overloaded cloud deployments. There will be some connection to real-time response involving rcu_poll and the infamous RCU kthread priority!

The continuation of the discussion of topics from last year’s micro-conference, including the development done during this (almost) year, are also welcome!

Early context tracking patch set: fixing perf & ftrace losing events

Some time ago, while using perf to check the automaton model, I noticed that perf was losing events. The same was reproducible with ftrace.

Steve pointed to a problem in the identification of the context execution used by the recursion control.

Currently, recursion control uses the preempt_counter to identify the current context. The NMI/HARD/SOFT IRQ counters are set in the preempt_counter in the irq_enter/exit functions.

In a trace, they are set like this:

 0)   ==========> |
 0)               |  do_IRQ() {		/* First C function */
 0)               |    irq_enter() {
 0)               |      		/* set the IRQ context. */
 0)   1.081 us    |    }
 0)               |    handle_irq() {
 0)               |     		/* IRQ handling code */
 0) + 10.290 us   |    }
 0)               |    irq_exit() {
 0)               |      		/* unset the IRQ context. */
 0)   6.657 us    |    }
 0) + 18.995 us   |  }
 0)   <========== |

As one can see, functions (and events) that take place before the set and after unset the preempt_counter are identified in the wrong context, causing the miss interpretation that recursion is taking place. When this happens, events are dropped.

To resolve this problem, the set/unset of the IRQ/NMI context needs to be done before the execution of the first C execution, and after its return. By doing so, and using this method to identify the context in the trace recursion protection, no more events are lost.

A possible solution is to use a per-cpu variable set and unset in the entry point of NMI/IRQs, before calling the C handler.

This possible solution is presented in this patch series as a proof of concept, for x86_64. Let’s see what kind of comments we will receive!

New tracepoints for External IRQ and NMIs

The irq_vectors class of tracepoints adds two events for each interrupt: One at the entry and one at the return of each interrupt handler. They are very useful in the analysis of latency, as they point to the occurrence of an interrupt, and how long did it take to run.

However, there are two cases of interrupt which these tracepoints were not present: NMIs and regular device interrupt. So I suggested two new tracepoints, for these cases.

The output of these new tracepoints looks like this:

 idle-0     [000] d.h.   102.890935: external_interrupt_entry: vector=35
 idle-0     [000] d.h.   102.890960: external_interrupt_exit: vector=35
 idle-0     [000] d.Z.   179.594315: nmi_entry: vector=2
 idle-0     [000] d.Z.   179.594396: nmi_exit: vector=2

The discussion of the patch set can be found here:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1969152.html

V5: x86/jump_label: Bound IPIs sent when updating a static key

Today I’ve sent the v5 of the x86/jump_label: Bound IPIs sent when updating a static key patch set. The list of changes from the previous version is:

  • Renamed jump_label_can_update_check() to jump_label_can_update() (Borislav Petkov)
  • Switch jump_label_can_update() return values to bool (Borislav Petkov)
  • Accept the fact that some lines will be > 80 characters (Borislav Petkov)
  • Local variables are now in the inverted Christmas three order (Borislav Petkov)
  • Removed superfluous helpers. (Borislav Petkov)
  • Renamed text_to_poke to text_patch_loc, and handler to detour (Borislav Petkov & Steven Rostedt)
  • Re-applied the suggestion of not using BUG() from steven (Masami Hiramatsu)
  • arch_jump_label_transform_queue() now returns 0 on success and
    -ENOSPC/EINVAL on errors (Masami Hiramatsu)

I think we are not far from an acceptance. The code introduces a lot of optimizations suggested by Masami, Steven, and Borislav! But it might have one or two rounds of small fixes/polishment.

Here is the link for the submission:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1968209.html

Another bug found with the model

While running the model against the 4.19-rt kernel, the following non-possible event chain was faced:

+ T   32019    2564.541336    [000]              preempt_disable ->    q8250 
+ T   32019    2564.541336    [000]             local_irq_enable ->   q13544 
+ T   32019    2564.541337    [000]               preempt_enable ->   q19599 
+ T   32019    2564.541337    [000]                   write_lock ->   q21325 
+ T   32019    2564.541337    [000]                   mutex_lock ->   q16217 
+ T   32019    2564.541338    [000]               mutex_acquired ->   q21325 
+ T   32019    2564.541338    [000]    sched_set_state_sleepable ->    q8561 
+ T   32019    2564.541338    [000]            local_irq_disable ->    q3392 
+ T   32019    2564.541338    [000]              preempt_disable ->   q20336 
+ T   32019    2564.541339    [000]     sched_set_state_runnable ->    q9970 
+ T   32019    2564.541339    [000]             local_irq_enable ->   q15342 
+ T   32019    2564.541339    [000]               preempt_enable ->   q21325 
+ T   32019    2564.541339    [000]               write_acquired ->   q19599 
+ T   32019    2564.541340    [000]            local_irq_disable ->   q14441 
+ T   32019    2564.541340    [000]              preempt_disable ->    q8250 
+ T   32019    2564.541342    [000]             local_irq_enable ->   q13544 
+ I   32019    2564.541344    [000]         hw_local_irq_disable ->   q18001 
+ I   32019    2564.541345    [000]                   mutex_lock
        2564.541345 event mutex_lock is not expected in state q18001
===== reseting model =====

In summary, a thread was running, then IRQs are disabled by the hardware (due to an interrupt), and a mutex lock was tried. This operation is not permitted.

The raw trace of these events shows the following events:

bash 32019 [000]  2564.541337:                   lock:rw_sem_request: write 0xffff9e992418fa90F unlink_file_vma+0x2d
bash 32019 [000]  2564.541337:                 lock:rt_mutex_request: 0xffff9e992418fa98F __down_write_common+0x40
bash 32019 [000]  2564.541338:                lock:rt_mutex_acquired: 0xffff9e992418fa98F __down_write_common+0x40
bash 32019 [000]  2564.541338:                 sched:sched_set_state: sleepable
bash 32019 [000]  2564.541338:                preemptirq:irq_disable: caller=__down_write_common+0x8b parent=(nil)F
bash 32019 [000]  2564.541338:            preemptirq:preempt_disable: caller=__down_write_common+0x95 parent=__down_write_common+0x95
bash 32019 [000]  2564.541339:                 sched:sched_set_state: runnable
bash 32019 [000]  2564.541339:                 preemptirq:irq_enable: caller=__down_write_common+0x337 parent=(nil)F
bash 32019 [000]  2564.541339:             preemptirq:preempt_enable: caller=__down_write_common+0x34b parent=__down_write_common+0x34b
bash 32019 [000]  2564.541339:                  lock:rw_sem_acquired: write 0xffff9e992418fa90F unlink_file_vma+0x2d
bash 32019 [000]  2564.541340:                   lock:rw_sem_release: write 0xffff9e992418fa90F free_pgtables+0xbe
bash 32019 [000]  2564.541340:                preemptirq:irq_disable: caller=__up_write+0x2c parent=(nil)F
bash 32019 [000]  2564.541340:            preemptirq:preempt_disable: caller=__up_write+0x36 parent=__up_write+0x36
bash 32019 [000]  2564.541342:                 preemptirq:irq_enable: caller=__up_write_unlock+0x75 parent=(nil)F
bash 32019 [000]  2564.541344:                preemptirq:irq_disable: caller=trace_hardirqs_off_thunk+0x1a parent=interrupt_entry+0xda
bash 32019 [000]  2564.541344:         irq_vectors:local_timer_entry: vector=236
bash 32019 [000]  2564.541345:                 lock:rt_mutex_request: pendingb_lock+0x0 queue_work_on+0x41

And voilà! a real-time mutex is being taken in the timer IRQ.

This violates two specifications, that specifies that it is not possible to take mutexes in with the interrupts disabled and that IRQ handlers do not take mutexes, as in Figure bellow:


Mutex cannot be taken with interrupts disabled.

Operations that are disabled in the interrupt handlers.


It is possible then to observe the code path that takes the rt_spin_lock (a mutex, in the practice):

 0)   ==========> |
 0)               |  smp_apic_timer_interrupt() {
 0)               |    __irq_enter_early_tp() {
 0)               |      rcu_irq_enter() {
 0)   0.690 us    |        rcu_nmi_enter();
 0)   2.032 us    |      }
 0)   3.361 us    |    }
 0)               |    irq_enter() {
 0)               |      rcu_irq_enter() {
 0)   0.635 us    |        rcu_nmi_enter();
 0)   1.792 us    |      }
 0)               |      tick_irq_enter() {
 0)   0.792 us    |        tick_check_oneshot_broadcast_this_cpu();
 0)   0.897 us    |        ktime_get();
 0)               |        update_ts_time_stats.constprop.32() {
 0)   0.594 us    |          nr_iowait_cpu();
 0)   1.926 us    |        }
 0)               |        tick_do_update_jiffies64.part.17() {
 0)   0.668 us    |          preempt_count_add();
 0)   0.594 us    |          preempt_count_add();
 0)               |          do_timer() {
 0)   0.676 us    |            calc_global_load();
 0)   1.808 us    |          }
 0)   0.638 us    |          preempt_count_sub();
 0)   0.551 us    |          preempt_count_sub();
 0)               |          update_wall_time() {
 0)               |            timekeeping_advance() {
 0)   0.567 us    |              preempt_count_add();
 0)   0.554 us    |              ntp_tick_length();
 0)   0.552 us    |              ntp_tick_length();
 0)   0.560 us    |              preempt_count_add();
 0)               |              timekeeping_update() {
 0)   0.552 us    |                ntp_get_next_leap();
 0)   0.769 us    |                update_vsyscall();
 0)               |                raw_notifier_call_chain() {
 0)   0.554 us    |                  notifier_call_chain();
 0)   1.621 us    |                }
 0)   0.565 us    |                update_fast_timekeeper();
 0)   0.576 us    |                update_fast_timekeeper();
 0)   7.533 us    |              }
 0)   0.614 us    |              preempt_count_sub();
 0)   0.608 us    |              preempt_count_sub();
 0) + 16.109 us   |            }
 0) + 17.268 us   |          }
 0) + 25.610 us   |        }
 0) + 32.431 us   |      }
 0)               |      irqtime_account_irq() {
 0)   0.547 us    |        in_serving_softirq();
 0)   2.045 us    |      }
 0)   0.561 us    |      preempt_count_add();
 0) + 39.845 us   |    }
 0)               |    hrtimer_interrupt() {
 0)   0.626 us    |      preempt_count_add();
 0)   0.786 us    |      ktime_get_update_offsets_now();
 0)               |      __hrtimer_run_queues.constprop.28() {
 0)   1.021 us    |        __remove_hrtimer();
 0)   0.574 us    |        preempt_count_sub();
 0)               |        watchdog_timer_fn() {
 0)               |          completion_done() {
 0)   0.541 us    |            preempt_count_add();
 0)   0.554 us    |            preempt_count_sub();
 0)   2.946 us    |          }
 0)               |          queue_work_on() {
 0)   0.597 us    |            migrate_disable();
 0)               |            rt_spin_lock() {
 0)   0.554 us    |              migrate_disable();
 0)               |              /* rt_mutex_request: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */
 0)               |              /* rt_mutex_acquired: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */
 0)   3.350 us    |            }
 0)               |            __queue_work() {
 0)   0.542 us    |              __rcu_read_lock();
 0)   0.593 us    |              get_work_pool();
 0)               |              rt_spin_lock() {
 0)   0.562 us    |                migrate_disable();
 0)               |                /* rt_mutex_request: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */
 0)               |                /* rt_mutex_acquired: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */
 0)   2.446 us    |              }
 0)               |              insert_work() {
 0)   0.563 us    |                get_pwq.isra.32();
 0)               |                wake_up_worker() {
 0)   0.547 us    |                  preempt_count_add();
 0)               |                  wake_up_process() {
 0)               |                    try_to_wake_up() {
 0)   0.560 us    |                      preempt_count_add();
 0)   0.558 us    |                      preempt_count_add();
 0)               |                      update_rq_clock() {
 0)   0.590 us    |                        kvm_steal_clock();
 0)   1.820 us    |                      }
 0)               |                      activate_task() {
 0)               |                        enqueue_task_fair() {
 0)               |                          enqueue_entity() {
 0)   0.580 us    |                            update_curr();
 0)   0.626 us    |                            update_cfs_group();
 0)   0.587 us    |                            __enqueue_entity();
 0)   4.352 us    |                          }
 0)   0.553 us    |                          hrtick_update();
 0)   6.598 us    |                        }
 0)   7.895 us    |                      }
 0)               |                      check_preempt_curr() {
 0)   0.629 us    |                        resched_curr();
 0)   2.035 us    |                      }
 0)   0.555 us    |                      preempt_count_sub();
 0)   0.548 us    |                      preempt_count_sub();
 0) + 19.316 us   |                    }
 0) + 20.400 us   |                  }
 0)   0.567 us    |                  preempt_count_sub();
 0) + 23.781 us   |                }
 0) + 26.107 us   |              }
 0)               |              rt_spin_unlock() {
 0)               |                /* rt_mutex_release: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */
 0)   0.580 us    |                migrate_enable();
 0)   2.343 us    |              }
 0)   0.576 us    |              __rcu_read_unlock();
 0) + 36.708 us   |            }
 0)               |            queue_delayed_work_on.part.49() {
 0)               |              rt_spin_unlock() {
 0)               |                /* rt_mutex_release: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */
 0)   0.547 us    |                migrate_enable();
 0)   2.034 us    |              }
 0)   0.968 us    |              migrate_enable();
 0)   4.576 us    |            }
 0) + 48.171 us   |          }
 0)   0.704 us    |          ktime_get();
 0)   0.552 us    |          hrtimer_forward();
 0)   0.622 us    |          __touch_watchdog();
 0) + 57.152 us   |        }
 0)   0.543 us    |        preempt_count_add();
 0)   0.771 us    |        enqueue_hrtimer();
 0) + 63.902 us   |      }
 0)               |      __hrtimer_get_next_event() {
 0)   1.038 us    |        __hrtimer_next_event_base();
 0)   0.673 us    |        __hrtimer_next_event_base();
 0)   3.437 us    |      }
 0)   0.558 us    |      preempt_count_sub();
 0)               |      tick_program_event() {
 0)               |        clockevents_program_event() {
 0)   0.658 us    |          ktime_get();
 0)   8.836 us    |          lapic_next_deadline();
 0) + 11.393 us   |        }
 0) + 12.489 us   |      }
 0) + 86.907 us   |    }
 0)               |    irq_exit() {
 0)   0.722 us    |      irqtime_account_irq();
 0)   0.602 us    |      preempt_count_sub();
 0)   0.572 us    |      idle_cpu();
 0)               |      rcu_irq_exit() {
 0)   0.588 us    |        rcu_nmi_exit();
 0)   1.648 us    |      }
 0)   6.500 us    |    }
 0)               |    __irq_exit_late_tp() {
 0)               |      rcu_irq_exit() {
 0)   0.572 us    |        rcu_nmi_exit();
 0)   1.624 us    |      }
 0)   2.690 us    |    }
 0) ! 146.366 us  |  }
 0)   <========== |

In the single core, the following chain of events can take place:

smp_apic_timer_interrupt(){
	hrtimer_interrupt() {
		__hrtimer_run_queues() {
			watchdog_timer_fn() {
				stop_one_cpu_nowait() {
					#ifdef !CONFIG_SMP
						schedule_work() {
							queue_work() {
								queue_work_on() {
								/* phew, long, ah!?.... */	
									local_lock_irqsave() {
										__local_lock_irqsave() {
											__local_lock_irq() {
												spin_lock_irqsave() {
													rt_spin_lock()...
														/* from here on you already know...*/
														/* a lot of } */

And this can cause a schedling while in atomic. Tracking down, the commit that introduced the stop_one_cpu_nowait() to the path was:

commit 9cf57731b63e37ed995b46690adc604891a9a28f
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu Jun 7 10:52:03 2018 +0200

    watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work
    
    Oleg suggested to replace the "watchdog/%u" threads with
    cpu_stop_work. That removes one thread per CPU while at the same time
    fixes softlockup vs SCHED_DEADLINE.
    
    But more importantly, it does away with the single
    smpboot_update_cpumask_percpu_thread() user, which allows
    cleanups/shrinkage of the smpboot interface.
    
    Suggested-by: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
</mingo@kernel.org></tglx@linutronix.de></peterz@infradead.org></torvalds@linux-foundation.org></peterz@infradead.org></oleg@redhat.com></peterz@infradead.org>

later modified by:

commit be45bf5395e0886a93fc816bbe41a008ec2e42e2
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jul 13 12:42:08 2018 +0200

    watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug
    
    When scheduling is delayed for longer than the softlockup interrupt
    period it is possible to double-queue the cpu_stop_work, causing list
    corruption.
    
    Cure this by adding a completion to track the cpu_stop_work's
    progress.
    
    Reported-by: kernel test robot <lkp@intel.com>
    Tested-by: Rong Chen <rong.a.chen@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
    Link: http://lkml.kernel.org/r/20180713104208.GW2494@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
</mingo@kernel.org></tglx@linutronix.de></peterz@infradead.org></torvalds@linux-foundation.org></peterz@infradead.org></rong.a.chen@intel.com></lkp@intel.com></peterz@infradead.org>

The BUG report was made, and you can find it here:
https://www.spinics.net/lists/linux-rt-users/msg20376.html

ISORC 2019 Material

Here is the companion material for the paper: Untangling the Intricacies of Thread Synchronization in the PREEMPT_RT Linux Kernel by Daniel Bristot de Oliveira, Rômulo Silva de Oliveira, and Tommaso Cucinotta, to be presented at the ISORC 2019.

NOTE: The latest version of the model and patches can be found here: http://bristot.me/linux-task-model/

The model was developed using the Supremica tool. You can find more information about Supremica here.

The source code of the model can be found here. It can be opened with Supremica. From Supremica it is possible to visualize, edit, analyze, simulate and export the model.

In the Linux part, you can find the patch with perf task_model and tracepoints here. It applies in the version v4.14.15-rt13 of the development tree of the PREEMPT RT.

It is possible to use a virtual machine to run the experiments. Here is the definition of a QEMU/KVM virtual machine and the kernel config file that works with this VM

ERRATA: Prof. Martin Fabian, found that despite the minimized model was “non-blocking” the non-minimized was not. The problem was caused by two missing events (sched_switch_in, sched_switch_in_o) in the state any_thread_running in the specification s19. This difference causes a change in the number of states from 13906 to 13786, and in the transitions from 31708 to 31221. This problem was already resolved, in a version earlier than the submission of the paper. But I (Daniel) forgot to update Table II and the reference model. This error does not change any of the results. The link above points to the correct model. The broken model is here.

Linux Plumbers 2018 report

It was a success!

This year I ended up making five talks! All related to real-time Linux, as expected. As feedback, I think that the most important are:

– The idea of using automata model for explaining Linux’s dynamics and, mainly, to verify the behavior of the system was very welcome by the community. They want even more: To extend the idea to other subsystems of the kernel.

– The semi-partitioned scheduler is a wish for Linux!

– We need new metrics for the RT, and we need to integrate this with probabilistic execution time analysis. I was already planning to do that, but now I am even more motivated!

Deliver at least part of these ideas is my main motivation for the next year, and I hope to talk about them at the Linux Plumbers 2019!

Here are the slides of my presentations:

Mind the gap between real-time Linux and real-time theory

SCHED_DEADLINE desiderata and slightly crazy ideas

How can we catch problems that can break the PREEMPT_RT preemption model?

Beyond the latency: New metrics for the real-time kernel

The talks were recorded, I will post links here as soon as they post it!

Real-time Linux Summit 2018 – My slides

Hi!

Here are the slides of my presentation in this year’s Real-time Linux Summit.

The title is “Mind the gap: between real-time Linux and real-time theory (Part I).” It is about the creation of a formal model for tasks on Linux, in the intersection of the viewpoint of real-time Linux developers and real-time researchers.

I received excellent feedback from the Linux community. There is interest in applying my method for the verification of the PREEMPT-RT kernel, even for areas that I was not expecting, like static code analysis.

For a computer scientist, seeing practical results of applying a formal method (automata) to a real and complex system is the proof that those long hours studying theory are not in vain :-).