While running the model against the 4.19-rt kernel, the following non-possible event chain was faced:
+ T 32019 2564.541336 [000] preempt_disable -> q8250 + T 32019 2564.541336 [000] local_irq_enable -> q13544 + T 32019 2564.541337 [000] preempt_enable -> q19599 + T 32019 2564.541337 [000] write_lock -> q21325 + T 32019 2564.541337 [000] mutex_lock -> q16217 + T 32019 2564.541338 [000] mutex_acquired -> q21325 + T 32019 2564.541338 [000] sched_set_state_sleepable -> q8561 + T 32019 2564.541338 [000] local_irq_disable -> q3392 + T 32019 2564.541338 [000] preempt_disable -> q20336 + T 32019 2564.541339 [000] sched_set_state_runnable -> q9970 + T 32019 2564.541339 [000] local_irq_enable -> q15342 + T 32019 2564.541339 [000] preempt_enable -> q21325 + T 32019 2564.541339 [000] write_acquired -> q19599 + T 32019 2564.541340 [000] local_irq_disable -> q14441 + T 32019 2564.541340 [000] preempt_disable -> q8250 + T 32019 2564.541342 [000] local_irq_enable -> q13544 + I 32019 2564.541344 [000] hw_local_irq_disable -> q18001 + I 32019 2564.541345 [000] mutex_lock 2564.541345 event mutex_lock is not expected in state q18001 ===== reseting model =====
In summary, a thread was running, then IRQs are disabled by the hardware (due to an interrupt), and a mutex lock was tried. This operation is not permitted.
The raw trace of these events shows the following events:
bash 32019 [000] 2564.541337: lock:rw_sem_request: write 0xffff9e992418fa90F unlink_file_vma+0x2d bash 32019 [000] 2564.541337: lock:rt_mutex_request: 0xffff9e992418fa98F __down_write_common+0x40 bash 32019 [000] 2564.541338: lock:rt_mutex_acquired: 0xffff9e992418fa98F __down_write_common+0x40 bash 32019 [000] 2564.541338: sched:sched_set_state: sleepable bash 32019 [000] 2564.541338: preemptirq:irq_disable: caller=__down_write_common+0x8b parent=(nil)F bash 32019 [000] 2564.541338: preemptirq:preempt_disable: caller=__down_write_common+0x95 parent=__down_write_common+0x95 bash 32019 [000] 2564.541339: sched:sched_set_state: runnable bash 32019 [000] 2564.541339: preemptirq:irq_enable: caller=__down_write_common+0x337 parent=(nil)F bash 32019 [000] 2564.541339: preemptirq:preempt_enable: caller=__down_write_common+0x34b parent=__down_write_common+0x34b bash 32019 [000] 2564.541339: lock:rw_sem_acquired: write 0xffff9e992418fa90F unlink_file_vma+0x2d bash 32019 [000] 2564.541340: lock:rw_sem_release: write 0xffff9e992418fa90F free_pgtables+0xbe bash 32019 [000] 2564.541340: preemptirq:irq_disable: caller=__up_write+0x2c parent=(nil)F bash 32019 [000] 2564.541340: preemptirq:preempt_disable: caller=__up_write+0x36 parent=__up_write+0x36 bash 32019 [000] 2564.541342: preemptirq:irq_enable: caller=__up_write_unlock+0x75 parent=(nil)F bash 32019 [000] 2564.541344: preemptirq:irq_disable: caller=trace_hardirqs_off_thunk+0x1a parent=interrupt_entry+0xda bash 32019 [000] 2564.541344: irq_vectors:local_timer_entry: vector=236 bash 32019 [000] 2564.541345: lock:rt_mutex_request: pendingb_lock+0x0 queue_work_on+0x41
And voilĂ ! a real-time mutex is being taken in the timer IRQ.
This violates two specifications, that specifies that it is not possible to take mutexes in with the interrupts disabled and that IRQ handlers do not take mutexes, as in Figure bellow:
It is possible then to observe the code path that takes the rt_spin_lock (a mutex, in the practice):
0) ==========> | 0) | smp_apic_timer_interrupt() { 0) | __irq_enter_early_tp() { 0) | rcu_irq_enter() { 0) 0.690 us | rcu_nmi_enter(); 0) 2.032 us | } 0) 3.361 us | } 0) | irq_enter() { 0) | rcu_irq_enter() { 0) 0.635 us | rcu_nmi_enter(); 0) 1.792 us | } 0) | tick_irq_enter() { 0) 0.792 us | tick_check_oneshot_broadcast_this_cpu(); 0) 0.897 us | ktime_get(); 0) | update_ts_time_stats.constprop.32() { 0) 0.594 us | nr_iowait_cpu(); 0) 1.926 us | } 0) | tick_do_update_jiffies64.part.17() { 0) 0.668 us | preempt_count_add(); 0) 0.594 us | preempt_count_add(); 0) | do_timer() { 0) 0.676 us | calc_global_load(); 0) 1.808 us | } 0) 0.638 us | preempt_count_sub(); 0) 0.551 us | preempt_count_sub(); 0) | update_wall_time() { 0) | timekeeping_advance() { 0) 0.567 us | preempt_count_add(); 0) 0.554 us | ntp_tick_length(); 0) 0.552 us | ntp_tick_length(); 0) 0.560 us | preempt_count_add(); 0) | timekeeping_update() { 0) 0.552 us | ntp_get_next_leap(); 0) 0.769 us | update_vsyscall(); 0) | raw_notifier_call_chain() { 0) 0.554 us | notifier_call_chain(); 0) 1.621 us | } 0) 0.565 us | update_fast_timekeeper(); 0) 0.576 us | update_fast_timekeeper(); 0) 7.533 us | } 0) 0.614 us | preempt_count_sub(); 0) 0.608 us | preempt_count_sub(); 0) + 16.109 us | } 0) + 17.268 us | } 0) + 25.610 us | } 0) + 32.431 us | } 0) | irqtime_account_irq() { 0) 0.547 us | in_serving_softirq(); 0) 2.045 us | } 0) 0.561 us | preempt_count_add(); 0) + 39.845 us | } 0) | hrtimer_interrupt() { 0) 0.626 us | preempt_count_add(); 0) 0.786 us | ktime_get_update_offsets_now(); 0) | __hrtimer_run_queues.constprop.28() { 0) 1.021 us | __remove_hrtimer(); 0) 0.574 us | preempt_count_sub(); 0) | watchdog_timer_fn() { 0) | completion_done() { 0) 0.541 us | preempt_count_add(); 0) 0.554 us | preempt_count_sub(); 0) 2.946 us | } 0) | queue_work_on() { 0) 0.597 us | migrate_disable(); 0) | rt_spin_lock() { 0) 0.554 us | migrate_disable(); 0) | /* rt_mutex_request: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */ 0) | /* rt_mutex_acquired: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */ 0) 3.350 us | } 0) | __queue_work() { 0) 0.542 us | __rcu_read_lock(); 0) 0.593 us | get_work_pool(); 0) | rt_spin_lock() { 0) 0.562 us | migrate_disable(); 0) | /* rt_mutex_request: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */ 0) | /* rt_mutex_acquired: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */ 0) 2.446 us | } 0) | insert_work() { 0) 0.563 us | get_pwq.isra.32(); 0) | wake_up_worker() { 0) 0.547 us | preempt_count_add(); 0) | wake_up_process() { 0) | try_to_wake_up() { 0) 0.560 us | preempt_count_add(); 0) 0.558 us | preempt_count_add(); 0) | update_rq_clock() { 0) 0.590 us | kvm_steal_clock(); 0) 1.820 us | } 0) | activate_task() { 0) | enqueue_task_fair() { 0) | enqueue_entity() { 0) 0.580 us | update_curr(); 0) 0.626 us | update_cfs_group(); 0) 0.587 us | __enqueue_entity(); 0) 4.352 us | } 0) 0.553 us | hrtick_update(); 0) 6.598 us | } 0) 7.895 us | } 0) | check_preempt_curr() { 0) 0.629 us | resched_curr(); 0) 2.035 us | } 0) 0.555 us | preempt_count_sub(); 0) 0.548 us | preempt_count_sub(); 0) + 19.316 us | } 0) + 20.400 us | } 0) 0.567 us | preempt_count_sub(); 0) + 23.781 us | } 0) + 26.107 us | } 0) | rt_spin_unlock() { 0) | /* rt_mutex_release: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */ 0) 0.580 us | migrate_enable(); 0) 2.343 us | } 0) 0.576 us | __rcu_read_unlock(); 0) + 36.708 us | } 0) | queue_delayed_work_on.part.49() { 0) | rt_spin_unlock() { 0) | /* rt_mutex_release: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */ 0) 0.547 us | migrate_enable(); 0) 2.034 us | } 0) 0.968 us | migrate_enable(); 0) 4.576 us | } 0) + 48.171 us | } 0) 0.704 us | ktime_get(); 0) 0.552 us | hrtimer_forward(); 0) 0.622 us | __touch_watchdog(); 0) + 57.152 us | } 0) 0.543 us | preempt_count_add(); 0) 0.771 us | enqueue_hrtimer(); 0) + 63.902 us | } 0) | __hrtimer_get_next_event() { 0) 1.038 us | __hrtimer_next_event_base(); 0) 0.673 us | __hrtimer_next_event_base(); 0) 3.437 us | } 0) 0.558 us | preempt_count_sub(); 0) | tick_program_event() { 0) | clockevents_program_event() { 0) 0.658 us | ktime_get(); 0) 8.836 us | lapic_next_deadline(); 0) + 11.393 us | } 0) + 12.489 us | } 0) + 86.907 us | } 0) | irq_exit() { 0) 0.722 us | irqtime_account_irq(); 0) 0.602 us | preempt_count_sub(); 0) 0.572 us | idle_cpu(); 0) | rcu_irq_exit() { 0) 0.588 us | rcu_nmi_exit(); 0) 1.648 us | } 0) 6.500 us | } 0) | __irq_exit_late_tp() { 0) | rcu_irq_exit() { 0) 0.572 us | rcu_nmi_exit(); 0) 1.624 us | } 0) 2.690 us | } 0) ! 146.366 us | } 0) <========== |
In the single core, the following chain of events can take place:
smp_apic_timer_interrupt(){ hrtimer_interrupt() { __hrtimer_run_queues() { watchdog_timer_fn() { stop_one_cpu_nowait() { #ifdef !CONFIG_SMP schedule_work() { queue_work() { queue_work_on() { /* phew, long, ah!?.... */ local_lock_irqsave() { __local_lock_irqsave() { __local_lock_irq() { spin_lock_irqsave() { rt_spin_lock()... /* from here on you already know...*/ /* a lot of } */
And this can cause a schedling while in atomic. Tracking down, the commit that introduced the stop_one_cpu_nowait() to the path was:
commit 9cf57731b63e37ed995b46690adc604891a9a28f Author: Peter Zijlstra <peterz@infradead.org> Date: Thu Jun 7 10:52:03 2018 +0200 watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work Oleg suggested to replace the "watchdog/%u" threads with cpu_stop_work. That removes one thread per CPU while at the same time fixes softlockup vs SCHED_DEADLINE. But more importantly, it does away with the single smpboot_update_cpumask_percpu_thread() user, which allows cleanups/shrinkage of the smpboot interface. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> </mingo@kernel.org></tglx@linutronix.de></peterz@infradead.org></torvalds@linux-foundation.org></peterz@infradead.org></oleg@redhat.com></peterz@infradead.org>
later modified by:
commit be45bf5395e0886a93fc816bbe41a008ec2e42e2 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Jul 13 12:42:08 2018 +0200 watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug When scheduling is delayed for longer than the softlockup interrupt period it is possible to double-queue the cpu_stop_work, causing list corruption. Cure this by adding a completion to track the cpu_stop_work's progress. Reported-by: kernel test robot <lkp@intel.com> Tested-by: Rong Chen <rong.a.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work") Link: http://lkml.kernel.org/r/20180713104208.GW2494@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> </mingo@kernel.org></tglx@linutronix.de></peterz@infradead.org></torvalds@linux-foundation.org></peterz@infradead.org></rong.a.chen@intel.com></lkp@intel.com></peterz@infradead.org>
The BUG report was made, and you can find it here:
https://www.spinics.net/lists/linux-rt-users/msg20376.html