Another bug found with the model

While running the model against the 4.19-rt kernel, the following non-possible event chain was faced:

+ T   32019    2564.541336    [000]              preempt_disable ->    q8250 
+ T   32019    2564.541336    [000]             local_irq_enable ->   q13544 
+ T   32019    2564.541337    [000]               preempt_enable ->   q19599 
+ T   32019    2564.541337    [000]                   write_lock ->   q21325 
+ T   32019    2564.541337    [000]                   mutex_lock ->   q16217 
+ T   32019    2564.541338    [000]               mutex_acquired ->   q21325 
+ T   32019    2564.541338    [000]    sched_set_state_sleepable ->    q8561 
+ T   32019    2564.541338    [000]            local_irq_disable ->    q3392 
+ T   32019    2564.541338    [000]              preempt_disable ->   q20336 
+ T   32019    2564.541339    [000]     sched_set_state_runnable ->    q9970 
+ T   32019    2564.541339    [000]             local_irq_enable ->   q15342 
+ T   32019    2564.541339    [000]               preempt_enable ->   q21325 
+ T   32019    2564.541339    [000]               write_acquired ->   q19599 
+ T   32019    2564.541340    [000]            local_irq_disable ->   q14441 
+ T   32019    2564.541340    [000]              preempt_disable ->    q8250 
+ T   32019    2564.541342    [000]             local_irq_enable ->   q13544 
+ I   32019    2564.541344    [000]         hw_local_irq_disable ->   q18001 
+ I   32019    2564.541345    [000]                   mutex_lock
        2564.541345 event mutex_lock is not expected in state q18001
===== reseting model =====

In summary, a thread was running, then IRQs are disabled by the hardware (due to an interrupt), and a mutex lock was tried. This operation is not permitted.

The raw trace of these events shows the following events:

bash 32019 [000]  2564.541337:                   lock:rw_sem_request: write 0xffff9e992418fa90F unlink_file_vma+0x2d
bash 32019 [000]  2564.541337:                 lock:rt_mutex_request: 0xffff9e992418fa98F __down_write_common+0x40
bash 32019 [000]  2564.541338:                lock:rt_mutex_acquired: 0xffff9e992418fa98F __down_write_common+0x40
bash 32019 [000]  2564.541338:                 sched:sched_set_state: sleepable
bash 32019 [000]  2564.541338:                preemptirq:irq_disable: caller=__down_write_common+0x8b parent=(nil)F
bash 32019 [000]  2564.541338:            preemptirq:preempt_disable: caller=__down_write_common+0x95 parent=__down_write_common+0x95
bash 32019 [000]  2564.541339:                 sched:sched_set_state: runnable
bash 32019 [000]  2564.541339:                 preemptirq:irq_enable: caller=__down_write_common+0x337 parent=(nil)F
bash 32019 [000]  2564.541339:             preemptirq:preempt_enable: caller=__down_write_common+0x34b parent=__down_write_common+0x34b
bash 32019 [000]  2564.541339:                  lock:rw_sem_acquired: write 0xffff9e992418fa90F unlink_file_vma+0x2d
bash 32019 [000]  2564.541340:                   lock:rw_sem_release: write 0xffff9e992418fa90F free_pgtables+0xbe
bash 32019 [000]  2564.541340:                preemptirq:irq_disable: caller=__up_write+0x2c parent=(nil)F
bash 32019 [000]  2564.541340:            preemptirq:preempt_disable: caller=__up_write+0x36 parent=__up_write+0x36
bash 32019 [000]  2564.541342:                 preemptirq:irq_enable: caller=__up_write_unlock+0x75 parent=(nil)F
bash 32019 [000]  2564.541344:                preemptirq:irq_disable: caller=trace_hardirqs_off_thunk+0x1a parent=interrupt_entry+0xda
bash 32019 [000]  2564.541344:         irq_vectors:local_timer_entry: vector=236
bash 32019 [000]  2564.541345:                 lock:rt_mutex_request: pendingb_lock+0x0 queue_work_on+0x41

And voilà! a real-time mutex is being taken in the timer IRQ.

This violates two specifications, that specifies that it is not possible to take mutexes in with the interrupts disabled and that IRQ handlers do not take mutexes, as in Figure bellow:


Mutex cannot be taken with interrupts disabled.
Operations that are disabled in the interrupt handlers.

It is possible then to observe the code path that takes the rt_spin_lock (a mutex, in the practice):

 0)   ==========> |
 0)               |  smp_apic_timer_interrupt() {
 0)               |    __irq_enter_early_tp() {
 0)               |      rcu_irq_enter() {
 0)   0.690 us    |        rcu_nmi_enter();
 0)   2.032 us    |      }
 0)   3.361 us    |    }
 0)               |    irq_enter() {
 0)               |      rcu_irq_enter() {
 0)   0.635 us    |        rcu_nmi_enter();
 0)   1.792 us    |      }
 0)               |      tick_irq_enter() {
 0)   0.792 us    |        tick_check_oneshot_broadcast_this_cpu();
 0)   0.897 us    |        ktime_get();
 0)               |        update_ts_time_stats.constprop.32() {
 0)   0.594 us    |          nr_iowait_cpu();
 0)   1.926 us    |        }
 0)               |        tick_do_update_jiffies64.part.17() {
 0)   0.668 us    |          preempt_count_add();
 0)   0.594 us    |          preempt_count_add();
 0)               |          do_timer() {
 0)   0.676 us    |            calc_global_load();
 0)   1.808 us    |          }
 0)   0.638 us    |          preempt_count_sub();
 0)   0.551 us    |          preempt_count_sub();
 0)               |          update_wall_time() {
 0)               |            timekeeping_advance() {
 0)   0.567 us    |              preempt_count_add();
 0)   0.554 us    |              ntp_tick_length();
 0)   0.552 us    |              ntp_tick_length();
 0)   0.560 us    |              preempt_count_add();
 0)               |              timekeeping_update() {
 0)   0.552 us    |                ntp_get_next_leap();
 0)   0.769 us    |                update_vsyscall();
 0)               |                raw_notifier_call_chain() {
 0)   0.554 us    |                  notifier_call_chain();
 0)   1.621 us    |                }
 0)   0.565 us    |                update_fast_timekeeper();
 0)   0.576 us    |                update_fast_timekeeper();
 0)   7.533 us    |              }
 0)   0.614 us    |              preempt_count_sub();
 0)   0.608 us    |              preempt_count_sub();
 0) + 16.109 us   |            }
 0) + 17.268 us   |          }
 0) + 25.610 us   |        }
 0) + 32.431 us   |      }
 0)               |      irqtime_account_irq() {
 0)   0.547 us    |        in_serving_softirq();
 0)   2.045 us    |      }
 0)   0.561 us    |      preempt_count_add();
 0) + 39.845 us   |    }
 0)               |    hrtimer_interrupt() {
 0)   0.626 us    |      preempt_count_add();
 0)   0.786 us    |      ktime_get_update_offsets_now();
 0)               |      __hrtimer_run_queues.constprop.28() {
 0)   1.021 us    |        __remove_hrtimer();
 0)   0.574 us    |        preempt_count_sub();
 0)               |        watchdog_timer_fn() {
 0)               |          completion_done() {
 0)   0.541 us    |            preempt_count_add();
 0)   0.554 us    |            preempt_count_sub();
 0)   2.946 us    |          }
 0)               |          queue_work_on() {
 0)   0.597 us    |            migrate_disable();
 0)               |            rt_spin_lock() {
 0)   0.554 us    |              migrate_disable();
 0)               |              /* rt_mutex_request: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */
 0)               |              /* rt_mutex_acquired: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */
 0)   3.350 us    |            }
 0)               |            __queue_work() {
 0)   0.542 us    |              __rcu_read_lock();
 0)   0.593 us    |              get_work_pool();
 0)               |              rt_spin_lock() {
 0)   0.562 us    |                migrate_disable();
 0)               |                /* rt_mutex_request: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */
 0)               |                /* rt_mutex_acquired: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */
 0)   2.446 us    |              }
 0)               |              insert_work() {
 0)   0.563 us    |                get_pwq.isra.32();
 0)               |                wake_up_worker() {
 0)   0.547 us    |                  preempt_count_add();
 0)               |                  wake_up_process() {
 0)               |                    try_to_wake_up() {
 0)   0.560 us    |                      preempt_count_add();
 0)   0.558 us    |                      preempt_count_add();
 0)               |                      update_rq_clock() {
 0)   0.590 us    |                        kvm_steal_clock();
 0)   1.820 us    |                      }
 0)               |                      activate_task() {
 0)               |                        enqueue_task_fair() {
 0)               |                          enqueue_entity() {
 0)   0.580 us    |                            update_curr();
 0)   0.626 us    |                            update_cfs_group();
 0)   0.587 us    |                            __enqueue_entity();
 0)   4.352 us    |                          }
 0)   0.553 us    |                          hrtick_update();
 0)   6.598 us    |                        }
 0)   7.895 us    |                      }
 0)               |                      check_preempt_curr() {
 0)   0.629 us    |                        resched_curr();
 0)   2.035 us    |                      }
 0)   0.555 us    |                      preempt_count_sub();
 0)   0.548 us    |                      preempt_count_sub();
 0) + 19.316 us   |                    }
 0) + 20.400 us   |                  }
 0)   0.567 us    |                  preempt_count_sub();
 0) + 23.781 us   |                }
 0) + 26.107 us   |              }
 0)               |              rt_spin_unlock() {
 0)               |                /* rt_mutex_release: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */
 0)   0.580 us    |                migrate_enable();
 0)   2.343 us    |              }
 0)   0.576 us    |              __rcu_read_unlock();
 0) + 36.708 us   |            }
 0)               |            queue_delayed_work_on.part.49() {
 0)               |              rt_spin_unlock() {
 0)               |                /* rt_mutex_release: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */
 0)   0.547 us    |                migrate_enable();
 0)   2.034 us    |              }
 0)   0.968 us    |              migrate_enable();
 0)   4.576 us    |            }
 0) + 48.171 us   |          }
 0)   0.704 us    |          ktime_get();
 0)   0.552 us    |          hrtimer_forward();
 0)   0.622 us    |          __touch_watchdog();
 0) + 57.152 us   |        }
 0)   0.543 us    |        preempt_count_add();
 0)   0.771 us    |        enqueue_hrtimer();
 0) + 63.902 us   |      }
 0)               |      __hrtimer_get_next_event() {
 0)   1.038 us    |        __hrtimer_next_event_base();
 0)   0.673 us    |        __hrtimer_next_event_base();
 0)   3.437 us    |      }
 0)   0.558 us    |      preempt_count_sub();
 0)               |      tick_program_event() {
 0)               |        clockevents_program_event() {
 0)   0.658 us    |          ktime_get();
 0)   8.836 us    |          lapic_next_deadline();
 0) + 11.393 us   |        }
 0) + 12.489 us   |      }
 0) + 86.907 us   |    }
 0)               |    irq_exit() {
 0)   0.722 us    |      irqtime_account_irq();
 0)   0.602 us    |      preempt_count_sub();
 0)   0.572 us    |      idle_cpu();
 0)               |      rcu_irq_exit() {
 0)   0.588 us    |        rcu_nmi_exit();
 0)   1.648 us    |      }
 0)   6.500 us    |    }
 0)               |    __irq_exit_late_tp() {
 0)               |      rcu_irq_exit() {
 0)   0.572 us    |        rcu_nmi_exit();
 0)   1.624 us    |      }
 0)   2.690 us    |    }
 0) ! 146.366 us  |  }
 0)   <========== |

In the single core, the following chain of events can take place:

smp_apic_timer_interrupt(){
	hrtimer_interrupt() {
		__hrtimer_run_queues() {
			watchdog_timer_fn() {
				stop_one_cpu_nowait() {
					#ifdef !CONFIG_SMP
						schedule_work() {
							queue_work() {
								queue_work_on() {
								/* phew, long, ah!?.... */	
									local_lock_irqsave() {
										__local_lock_irqsave() {
											__local_lock_irq() {
												spin_lock_irqsave() {
													rt_spin_lock()...
														/* from here on you already know...*/
														/* a lot of } */

And this can cause a schedling while in atomic. Tracking down, the commit that introduced the stop_one_cpu_nowait() to the path was:

commit 9cf57731b63e37ed995b46690adc604891a9a28f
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu Jun 7 10:52:03 2018 +0200

    watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work
    
    Oleg suggested to replace the "watchdog/%u" threads with
    cpu_stop_work. That removes one thread per CPU while at the same time
    fixes softlockup vs SCHED_DEADLINE.
    
    But more importantly, it does away with the single
    smpboot_update_cpumask_percpu_thread() user, which allows
    cleanups/shrinkage of the smpboot interface.
    
    Suggested-by: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
</mingo@kernel.org></tglx@linutronix.de></peterz@infradead.org></torvalds@linux-foundation.org></peterz@infradead.org></oleg@redhat.com></peterz@infradead.org>

later modified by:

commit be45bf5395e0886a93fc816bbe41a008ec2e42e2
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jul 13 12:42:08 2018 +0200

    watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug
    
    When scheduling is delayed for longer than the softlockup interrupt
    period it is possible to double-queue the cpu_stop_work, causing list
    corruption.
    
    Cure this by adding a completion to track the cpu_stop_work's
    progress.
    
    Reported-by: kernel test robot <lkp@intel.com>
    Tested-by: Rong Chen <rong.a.chen@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
    Link: http://lkml.kernel.org/r/20180713104208.GW2494@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
</mingo@kernel.org></tglx@linutronix.de></peterz@infradead.org></torvalds@linux-foundation.org></peterz@infradead.org></rong.a.chen@intel.com></lkp@intel.com></peterz@infradead.org>

The BUG report was made, and you can find it here:
https://www.spinics.net/lists/linux-rt-users/msg20376.html

ISORC 2019 Material

Here is the companion material for the paper: Untangling the Intricacies of Thread Synchronization in the PREEMPT_RT Linux Kernel by Daniel Bristot de Oliveira, Rômulo Silva de Oliveira, and Tommaso Cucinotta, to be presented at the ISORC 2019.

NOTE: The latest version of the model and patches can be found here: http://bristot.me/linux-task-model/

The model was developed using the Supremica tool. You can find more information about Supremica here.

The source code of the model can be found here. It can be opened with Supremica. From Supremica it is possible to visualize, edit, analyze, simulate and export the model.

In the Linux part, you can find the patch with perf task_model and tracepoints here. It applies in the version v4.14.15-rt13 of the development tree of the PREEMPT RT.

It is possible to use a virtual machine to run the experiments. Here is the definition of a QEMU/KVM virtual machine and the kernel config file that works with this VM

ERRATA: Prof. Martin Fabian, found that despite the minimized model was “non-blocking” the non-minimized was not. The problem was caused by two missing events (sched_switch_in, sched_switch_in_o) in the state any_thread_running in the specification s19. This difference causes a change in the number of states from 13906 to 13786, and in the transitions from 31708 to 31221. This problem was already resolved, in a version earlier than the submission of the paper. But I (Daniel) forgot to update Table II and the reference model. This error does not change any of the results. The link above points to the correct model. The broken model is here.

Linux Plumbers 2018 report

It was a success!

This year I ended up making five talks! All related to real-time Linux, as expected. As feedback, I think that the most important are:

– The idea of using automata model for explaining Linux’s dynamics and, mainly, to verify the behavior of the system was very welcome by the community. They want even more: To extend the idea to other subsystems of the kernel.

– The semi-partitioned scheduler is a wish for Linux!

– We need new metrics for the RT, and we need to integrate this with probabilistic execution time analysis. I was already planning to do that, but now I am even more motivated!

Deliver at least part of these ideas is my main motivation for the next year, and I hope to talk about them at the Linux Plumbers 2019!

Here are the slides of my presentations:

Mind the gap between real-time Linux and real-time theory

SCHED_DEADLINE desiderata and slightly crazy ideas

How can we catch problems that can break the PREEMPT_RT preemption model?

Beyond the latency: New metrics for the real-time kernel

The talks were recorded, I will post links here as soon as they post it!

Real-time Linux Summit 2018 – My slides

Hi!

Here are the slides of my presentation in this year’s Real-time Linux Summit.

The title is “Mind the gap: between real-time Linux and real-time theory (Part I).” It is about the creation of a formal model for tasks on Linux, in the intersection of the viewpoint of real-time Linux developers and real-time researchers.

I received excellent feedback from the Linux community. There is interest in applying my method for the verification of the PREEMPT-RT kernel, even for areas that I was not expecting, like static code analysis.

For a computer scientist, seeing practical results of applying a formal method (automata) to a real and complex system is the proof that those long hours studying theory are not in vain :-).

Modeling the Behavior of Threads in the PREEMPT_RT Linux Kernel Using Automata: All material

From the communication:

“On behalf of the Program Committee for EWiLi 2018, I am pleased to inform you that the following submission has been accepted for publication in the workshop: Modeling the Behavior of Threads in the PREEMPT_RT Linux Kernel Using Automata”

The best thing is that all the revisor’s comments are constructive. All three revisors gave an “accepted” on the paper (a paper can be: accepted, weakly accepted, borderline, weakly rejected, rejected)!

You can find the paper in this link: Paper

The presentation in this link: Slides

The model in this link: Model

The code is in this link: Code

All open-souce!

An Empirical Study on the Adequacy of MBPTA for Tasks Executed on a Complex Computer Architecture with Linux

The paper An Empirical Study on the Adequacy of MBPTA for Tasks Executed on a Complex Computer Architecture with Linux, which I am co-author, was awared with traveling funding by  the IEEE IES S&YP!

Karila and Luis Fernando made a video about the paper, check this out:

Setting real-time priorities to services using systemd

Systemd supports the setup of real-time priority to services launched during the boot process. This is done using the following Directives in the Service section:

  • CPUSchedulingPolicy=
    Sets the CPU scheduling policy for executed processes. Takes one of other, batch, idle, fifo or rr.
  • CPUSchedulingPriority=
    Sets the CPU scheduling priority for executed processes. The available priority range depends on the selected CPU scheduling policy. For real-time scheduling policies, an integer between 1 (lowest priority) and 99 (highest priority) can be used.

Example

This is the mcelog service unity:

[Unit]
Description=Machine Check Exception Logging Daemon

[Service]
ExecStart=/usr/sbin/mcelog --ignorenodev --daemon --foreground

[Install]
WantedBy=multi-user.target

By adding the following directives in Service section:

CPUSchedulingPolicy=fifo
CPUSchedulingPriority=20

For example:

[Unit]
Description=Machine Check Exception Logging Daemon

[Service]
ExecStart=/usr/sbin/mcelog --ignorenodev --daemon --foreground
CPUSchedulingPolicy=fifo
CPUSchedulingPriority=20

[Install]
WantedBy=multi-user.target

When restarted, the mcelog service will start with the priority set by systemd:

# tuna --show_threads | grep mcelog
4680 FIFO 20 0,1 2 0 mcelog

__schedule() being called twice, the second in vain

Here is one example of the trace of the problem I reported in this e-mail to the LKML

Note Patch accepted! This patch is the first change that was proposed by the analysis of the task model I am developing in my Ph.D. Thesis.

Trace setup

The trace setup I am using is the following:

# cd /sys/kernel/debug/tracing/
# echo preempt_schedule_irq schedule schedule_preempt_disabled schedule_idle schedule_user >> set_graph_function ; done 
# echo function_graph > current_tracer
# echo sched:sched_switch sched:sched_wakeup > set_event
# echo irq_vectors >> set_event
# echo $(( 1408 * 100 )) > buffer_size_kb
# echo > trace

In addition, I added a trace_prink() at the entry and exit of __schedule() and another when there is no context switch (because __schedule() is notrace).

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 70641e8f3c89..406a0b7ed6b3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3443,6 +3443,7 @@ static void __sched notrace __schedule(bool preempt)
        struct rq *rq;
        int cpu;
 
+       trace_printk("__schedule_entry\n");
        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
        prev = rq->curr;
@@ -3517,9 +3518,11 @@ static void __sched notrace __schedule(bool preempt)
        } else {
                rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
                rq_unlock_irq(rq, &rf);
+               trace_printk("oh boy, no context switch\n");
        }
 
        balance_callback(rq);
+       trace_printk("__schedule_exit\n");
 }

This is the perf cmd-line I am using to generate a load that turns the problem visible:

# perf record -o /dev/shm/lala

I ran perf for some seconds and then read the trace.

Trace analysis

The kworker-336 is running:

 ------------------------------------------
 0)   ktimers-8    =>  kworker-336  
 ------------------------------------------

  310.241851 |   0)  kworker-336   |               |    finish_task_switch() {
  310.241851 |   0)  kworker-336   |   0.183 us    |      preempt_count_sub();
  310.241853 |   0)  kworker-336   |   1.467 us    |    }
  310.241853 |   0)  kworker-336   |               |  /* __schedule_exit */
  310.241854 |   0)  kworker-336   |   0.229 us    |    preempt_count_sub();
  310.241855 |   0)  kworker-336   |               |    wq_worker_running() {
  310.241856 |   0)  kworker-336   |   0.280 us    |      kthread_data();
  310.241857 |   0)  kworker-336   |   1.569 us    |    }
  310.241858 |   0)  kworker-336   | * 39582.48 us |  }

At this point it calls the scheduler:

  310.242147 |   0)  kworker-336   |               |  schedule() {
  310.242148 |   0)  kworker-336   |               |    wq_worker_sleeping() {
  310.242149 |   0)  kworker-336   |   0.239 us    |      kthread_data();

while in the wakeup_worker() call:

  310.242151 |   0)  kworker-336   |               |      wake_up_worker() {
  310.242151 |   0)  kworker-336   |   0.186 us    |        preempt_count_add();
  310.242152 |   0)  kworker-336   |               |        wake_up_process() {
  310.242153 |   0)  kworker-336   |               |          try_to_wake_up() {

And another kworker is awakened:

  310.242153 |   0)  kworker-336   |   0.200 us    |            preempt_count_add();
  310.242155 |   0)  kworker-336   |   0.186 us    |            preempt_count_add();
  310.242156 |   0)  kworker-336   |               |            update_rq_clock() {
  310.242157 |   0)  kworker-336   |   0.193 us    |              kvm_steal_clock();
  310.242158 |   0)  kworker-336   |   1.610 us    |            }
  310.242159 |   0)  kworker-336   |               |            activate_task() {
  310.242160 |   0)  kworker-336   |               |              enqueue_task_fair() {
  310.242160 |   0)  kworker-336   |               |                enqueue_entity() {
  310.242161 |   0)  kworker-336   |               |                  update_curr() {
  310.242162 |   0)  kworker-336   |   0.249 us    |                    update_min_vruntime();
  310.242163 |   0)  kworker-336   |               |                    cpuacct_charge() {
  310.242164 |   0)  kworker-336   |   0.155 us    |                      __rcu_read_lock();
  310.242165 |   0)  kworker-336   |   0.160 us    |                      __rcu_read_unlock();
  310.242166 |   0)  kworker-336   |   2.495 us    |                    }
  310.242167 |   0)  kworker-336   |   0.161 us    |                    __rcu_read_lock();
  310.242168 |   0)  kworker-336   |   0.166 us    |                    __rcu_read_unlock();
  310.242169 |   0)  kworker-336   |   7.787 us    |                  }
  310.242170 |   0)  kworker-336   |   0.216 us    |                  update_cfs_group();
  310.242171 |   0)  kworker-336   |   0.216 us    |                  __enqueue_entity();
  310.242172 |   0)  kworker-336   | + 11.478 us   |                }
  310.242173 |   0)  kworker-336   |   0.158 us    |                hrtick_update();
  310.242174 |   0)  kworker-336   | + 13.759 us   |              }
  310.242174 |   0)  kworker-336   | + 15.052 us   |            }
  310.242175 |   0)  kworker-336   |               |            check_preempt_curr() {
  310.242176 |   0)  kworker-336   |               |              check_preempt_wakeup() {
  310.242176 |   0)  kworker-336   |   0.209 us    |                update_curr();
  310.242178 |   0)  kworker-336   |   0.158 us    |                wakeup_preempt_entity.isra.39();
  310.242179 |   0)  kworker-336   |   0.291 us    |                set_next_buddy();
  310.242180 |   0)  kworker-336   |   0.246 us    |                resched_curr_lazy();
  310.242182 |   0)  kworker-336   |   5.704 us    |              }
  310.242183 |   0)  kworker-336   |   7.150 us    |            }
  310.242184 |   0)  kworker-336   |               |            /* sched_wakeup: comm=kworker/0:2 pid=104 prio=120 target_cpu=000 */
  310.242184 |   0)  kworker-336   |   0.349 us    |            preempt_count_sub();
  310.242186 |   0)  kworker-336   |   0.214 us    |            preempt_count_sub();
  310.242187 |   0)  kworker-336   | + 34.075 us   |          }
  310.242188 |   0)  kworker-336   | + 35.213 us   |        }
  310.242189 |   0)  kworker-336   |               |        preempt_count_sub() {

And here __schedule is called:

  310.242190 |   0)  kworker-336   |               |          /* __schedule_entry */
  310.242190 |   0)  kworker-336   |               |          rcu_note_context_switch() {
  310.242191 |   0)  kworker-336   |   0.215 us    |            rcu_sched_qs();
  310.242192 |   0)  kworker-336   |   0.214 us    |            rcu_preempt_qs();
  310.242194 |   0)  kworker-336   |   2.982 us    |          }
  310.242194 |   0)  kworker-336   |   0.237 us    |          preempt_count_add();
  310.242196 |   0)  kworker-336   |               |          update_rq_clock() {
  310.242197 |   0)  kworker-336   |   0.209 us    |            kvm_steal_clock();
  310.242198 |   0)  kworker-336   |   1.657 us    |          }
  310.242199 |   0)  kworker-336   |               |          pick_next_task_fair() {
  310.242199 |   0)  kworker-336   |               |            update_curr() {
  310.242201 |   0)  kworker-336   |   0.261 us    |              update_min_vruntime();
  310.242202 |   0)  kworker-336   |               |              cpuacct_charge() {
  310.242203 |   0)  kworker-336   |   0.203 us    |                __rcu_read_lock();
  310.242204 |   0)  kworker-336   |   0.199 us    |                __rcu_read_unlock();
  310.242206 |   0)  kworker-336   |   2.838 us    |              }
  310.242206 |   0)  kworker-336   |   0.201 us    |              __rcu_read_lock();
  310.242207 |   0)  kworker-336   |   0.275 us    |              __rcu_read_unlock();
  310.242209 |   0)  kworker-336   |   8.898 us    |            }
  310.242209 |   0)  kworker-336   |   0.205 us    |            check_cfs_rq_runtime();
  310.242211 |   0)  kworker-336   |               |            pick_next_entity() {
  310.242212 |   0)  kworker-336   |   0.229 us    |              wakeup_preempt_entity.isra.39();
  310.242213 |   0)  kworker-336   |   0.317 us    |              clear_buddies();
  310.242214 |   0)  kworker-336   |   2.959 us    |            }
  310.242215 |   0)  kworker-336   |               |            put_prev_entity() {
  310.242216 |   0)  kworker-336   |   0.246 us    |              update_curr();
  310.242217 |   0)  kworker-336   |   0.214 us    |              check_cfs_rq_runtime();
  310.242218 |   0)  kworker-336   |   0.382 us    |              __enqueue_entity();
  310.242220 |   0)  kworker-336   |   4.481 us    |            }
  310.242220 |   0)  kworker-336   |   0.461 us    |            set_next_entity();
  310.242222 |   0)  kworker-336   | + 22.840 us   |          }

resulting in a context switch:

  310.242223 |   0)  kworker-336   |               |          /* sched_switch: prev_comm=kworker/0:7 prev_pid=336 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:2 next_pid=104 next_prio=120 */
  310.242224 |   0)  kworker-336   |   0.244 us    |          enter_lazy_tlb();
 ------------------------------------------
 0)  kworker-336   =>  kworker-104  
 ------------------------------------------
[ lines deleted ]

Until the point in which the kworker leaves the processor:

  310.243004 |   0)  kworker-104   |               |  /* __schedule_entry */
  310.243004 |   0)  kworker-104   |               |    rcu_note_context_switch() {
  310.243005 |   0)  kworker-104   |   0.171 us    |      rcu_sched_qs();
  310.243006 |   0)  kworker-104   |   0.167 us    |      rcu_preempt_qs();
  310.243007 |   0)  kworker-104   |   2.478 us    |    }
  310.243008 |   0)  kworker-104   |   0.184 us    |    preempt_count_add();
  310.243009 |   0)  kworker-104   |               |    update_rq_clock() {
  310.243009 |   0)  kworker-104   |   0.176 us    |      kvm_steal_clock();
  310.243011 |   0)  kworker-104   |   1.405 us    |    }
  310.243011 |   0)  kworker-104   |               |    deactivate_task() {
  310.243012 |   0)  kworker-104   |               |      dequeue_task_fair() {
  310.243012 |   0)  kworker-104   |               |        dequeue_entity() {
  310.243013 |   0)  kworker-104   |               |          update_curr() {
  310.243014 |   0)  kworker-104   |   0.190 us    |            update_min_vruntime();
  310.243015 |   0)  kworker-104   |               |            cpuacct_charge() {
  310.243015 |   0)  kworker-104   |   0.176 us    |              __rcu_read_lock();
  310.243016 |   0)  kworker-104   |   0.157 us    |              __rcu_read_unlock();
  310.243018 |   0)  kworker-104   |   2.370 us    |            }
  310.243018 |   0)  kworker-104   |   0.180 us    |            __rcu_read_lock();
  310.243019 |   0)  kworker-104   |   0.153 us    |            __rcu_read_unlock();
  310.243020 |   0)  kworker-104   |   6.974 us    |          }
  310.243021 |   0)  kworker-104   |   0.193 us    |          clear_buddies();
  310.243022 |   0)  kworker-104   |   0.164 us    |          update_cfs_group();
  310.243023 |   0)  kworker-104   | + 10.589 us   |        }
  310.243024 |   0)  kworker-104   |   0.162 us    |        hrtick_update();
  310.243025 |   0)  kworker-104   | + 12.879 us   |      }
  310.243026 |   0)  kworker-104   | + 14.158 us   |    }
  310.243026 |   0)  kworker-104   |               |    pick_next_task_fair() {
  310.243027 |   0)  kworker-104   |   0.187 us    |      check_cfs_rq_runtime();
  310.243028 |   0)  kworker-104   |               |      pick_next_entity() {
  310.243029 |   0)  kworker-104   |   0.190 us    |        clear_buddies();
  310.243030 |   0)  kworker-104   |   1.275 us    |      }
  310.243030 |   0)  kworker-104   |               |      put_prev_entity() {
  310.243031 |   0)  kworker-104   |   0.181 us    |        check_cfs_rq_runtime();
  310.243032 |   0)  kworker-104   |   1.379 us    |      }
  310.243033 |   0)  kworker-104   |   0.335 us    |      set_next_entity();
  310.243034 |   0)  kworker-104   |   7.416 us    |    }
  310.243035 |   0)  kworker-104   |               |  /* sched_switch: prev_comm=kworker/0:2 prev_pid=104 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:7 next_pid=336 next_prio=120 */
  310.243036 |   0)  kworker-104   |   0.185 us    |    enter_lazy_tlb();

And switch the context back to the first kworker:

 ------------------------------------------
 0)  kworker-104   =>  kworker-336  
 ------------------------------------------

  310.243037 |   0)  kworker-336   |               |          finish_task_switch() {
  310.243038 |   0)  kworker-336   |   0.181 us    |            preempt_count_sub();
  310.243039 |   0)  kworker-336   |   1.332 us    |          }
  310.243040 |   0)  kworker-336   |               |          /* __schedule_exit */
  310.243040 |   0)  kworker-336   |   0.248 us    |        } /* preempt_count_sub */
  310.243041 |   0)  kworker-336   | ! 889.847 us  |      } /* wake_up_worker */

After returning from wake_up_worker(); the preemption is disabled and __schedule() called again:

  310.243041 |   0)  kworker-336   | ! 892.553 us  |    } /* wq_worker_sleeping */
  310.243042 |   0)  kworker-336   |   0.181 us    |    preempt_count_add();
  310.243043 |   0)  kworker-336   |               |  /* __schedule_entry */
  310.243043 |   0)  kworker-336   |               |    rcu_note_context_switch() {
  310.243044 |   0)  kworker-336   |   0.175 us    |      rcu_sched_qs();
  310.243045 |   0)  kworker-336   |   0.174 us    |      rcu_preempt_qs();
  310.243046 |   0)  kworker-336   |   2.456 us    |    }
  310.243047 |   0)  kworker-336   |   0.188 us    |    preempt_count_add();
  310.243048 |   0)  kworker-336   |               |    update_rq_clock() {
  310.243049 |   0)  kworker-336   |   0.191 us    |      kvm_steal_clock();
  310.243050 |   0)  kworker-336   |   1.429 us    |    }
  310.243051 |   0)  kworker-336   |               |    pick_next_task_fair() {
  310.243051 |   0)  kworker-336   |               |      update_curr() {
  310.243052 |   0)  kworker-336   |   0.221 us    |        update_min_vruntime();
  310.243053 |   0)  kworker-336   |               |        cpuacct_charge() {
  310.243054 |   0)  kworker-336   |   0.191 us    |          __rcu_read_lock();
  310.243055 |   0)  kworker-336   |   0.158 us    |          __rcu_read_unlock();
  310.243056 |   0)  kworker-336   |   2.388 us    |        }
  310.243056 |   0)  kworker-336   |   0.188 us    |        __rcu_read_lock();
  310.243058 |   0)  kworker-336   |   0.154 us    |        __rcu_read_unlock();
  310.243059 |   0)  kworker-336   |   7.124 us    |      }
  310.243059 |   0)  kworker-336   |   0.200 us    |      check_cfs_rq_runtime();
  310.243060 |   0)  kworker-336   |               |      pick_next_entity() {
  310.243061 |   0)  kworker-336   |   0.192 us    |        clear_buddies();
  310.243062 |   0)  kworker-336   |   1.437 us    |      }
  310.243063 |   0)  kworker-336   | + 11.870 us   |    }
  310.243064 |   0)  kworker-336   |   0.190 us    |    preempt_count_sub();

This is the trace_prink() that shows that no context switch:

  310.243065 |   0)  kworker-336   |               |  /* oh boy, no context switch */
  310.243065 |   0)  kworker-336   |               |  /* __schedule_exit */
  310.243065 |   0)  kworker-336   |   0.216 us    |    preempt_count_sub();
  310.243067 |   0)  kworker-336   |               |    wq_worker_running() {
  310.243067 |   0)  kworker-336   |   0.214 us    |      kthread_data();
  310.243069 |   0)  kworker-336   |   1.390 us    |    }
  310.243069 |   0)  kworker-336   | ! 921.052 us  |  }

In this case, the second __schedule() is aways called in vain.