Linux scheduling latency debug and analysis

This page contains the how-to for using rtla timerlat for scheduling latency debugging and analysis. It served as the script for the talk rtla timerlat: debugging real-time Linux scheduling latency at the Embedded Open Source Conference 2023 in Prague – CZ. Below you can find the slides used in the talk. This page will be updated as new features are added to rtla timerlat.

Reducing the scheduling latency is one of the main goals of the PREEMPT_RT patchset. The scheduling latency is historically measured using a black box approach. A measuring thread at the highest priority sets a timer using an external clock reference, going to sleep waiting for the timer. When the timer fires, the thread is awakened and reads the current time. A scheduling latency sample is then reported as the delta between the measured time and the timer set on the timer. For instance, cyclictest operates in this way.

Fig 1 examples a timeline for this measurement.

This method works well to sample scheduling latency. However, it has some well-known limitations. The principal one is that as an opaque value, it does not provide hints about the composition of an unexpectedly high value. When a value higher than an acceptable threshold is sampled, these tools are limited to stopping the main tracing session, where the tracing expert is responsible for setting up a tracing session. Timerlat fills this gap by integrating measurements and tracing, along with a series of analysis extracts for the formal analysis of the scheduling latency*.

rtla timerlat

The timerlat is composed of two components. The timerlat tracer is the in-kernel component that provides the timer handling, tracing, and (optionally) the in-kernel workload. The rtla timerlat is a user-space tool that controls the tracing session, allows an additional configuration set, collects data, and (optionally) creates a user-space workload. Figure 2 shows a basic example of rtla timerlat output in text and as a video.

$ sudo timerlat
                                     Timer Latency                                              
  0 00:01:00   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max
  0 #60002     |        1         0         0         2 |        2         1         2         5
  1 #60002     |        1         0         1         1 |        2         1         2         7
  2 #60002     |        1         0         1         1 |        2         1         2         7
  3 #60002     |        1         0         1         6 |        2         1         2        14
  4 #60002     |        1         0         1        15 |        2         2         2        22
  5 #60002     |        1         0         1         1 |        2         1         2         9
  6 #60002     |        1         0         1         3 |        2         2         2         9
  7 #60002     |        1         0         1         1 |        2         1         2         4

Fig 2: Timerlat basic output

At first look, it is possible to notice that timerlat presents two latencies instead of one. The timerlat timer is managed by the in-kernel timerlat tracer. The in-kernel tracer adds a special timer IRQ handling function that provides additional IRQ latency as soon as the handler starts to run. The IRQ handler then wakes up the thread that measures thread latency. Figure 3 shows the relation between these two values.

Fig 3: Timerlat IRQ and thread latency

Timerlat auto analysis

Another key timerlat feature is auto analysis, which can be activated via a single option like -a <threshold>. By leveraging the tracing backend, rtla timerlat sets up the tracing to stop if a latency crosses a given <threshold>. rtla timerlat then parses the trace providing an auto analysis for the root cause. For instance, setting the latency threshold for 30 us:

$ sudo timerlat -a 30 
                                     Timer Latency                                              
  0 00:00:01   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max
  0 #763       |        1         0         1         9 |        8         4         8        18
  1 #763       |        1         0         1         8 |       12         4        12        21
  2 #763       |        1         0         1         5 |       13         4        15        23
  3 #763       |        1         0         1         8 |       16         4        14        21
  4 #763       |       12         0         1        16 |       28         5        12        28
  5 #763       |        1         0         1         8 |       12         4        11        22
  6 #763       |       32         0         1        32 |       52         5        13        52
  7 #763       |        0         0         1        11 |        7         5        12        20
rtla timerlat hit stop tracing
## CPU 6 hit stop tracing, analyzing it ##
  IRQ handler delay:                     		31.00 us (59.56 %)
  IRQ latency:                                   	32.17 us
  Timerlat IRQ duration:         			9.57 us (18.38 %)
  Blocking thread:                        		8.77 us (16.84 %)
	                 objtool:1164402         	8.77 us
    Blocking thread stack trace
		-> timerlat_irq
		-> __hrtimer_run_queues
		-> hrtimer_interrupt
		-> __sysvec_apic_timer_interrupt
		-> sysvec_apic_timer_interrupt
		-> asm_sysvec_apic_timer_interrupt
		-> _raw_spin_unlock_irqrestore
		-> cgroup_rstat_flush_locked
		-> cgroup_rstat_flush_irqsafe
		-> mem_cgroup_flush_stats
		-> mem_cgroup_wb_stats
		-> balance_dirty_pages
		-> balance_dirty_pages_ratelimited_flags
		-> btrfs_buffered_write
		-> btrfs_do_write_iter
		-> vfs_write
		-> __x64_sys_pwrite64
		-> do_syscall_64
		-> entry_SYSCALL_64_after_hwframe
------------------------------------------------------------------------
  Thread latency:					52.05 us (100%)

Saving trace to timerlat_trace.txt

Fig 4: Auto analysis example

The auto-analysis splits the thread latency value into a series of variables that compose it so the user can focus on the most relevant aspects. The following section presents an essential background for the auto-analysis interpretation.

Latency Analysis

Borrowing terms from real-time scheduling theory, we can classify the causes for latency as:

  • Interference: When a higher-priority task preempts a lower-priority one
  • Blocking: When a low-priority task delays a higher-priority one
  • Release Jitter: The timer IRQ delay caused by external events, e.g., hardware
  • Execution time: The work necessary for the accomplishment of the goal

To these terms, we can add the execution time of the work necessary to accomplish the work, which in this case is to wake up the sampling thread.

Interference

As explained, a higher-priority task causes interference. The processes’ threads are the first type of task that one can think about, but there are more contexts in which a code can run, and these can map to tasks in the following priority order:

  • NMI: Nonmaskable interrupts preempt any other type of tasks
  • IRQ: Preempts other types of threads, except NMIs.
  • Softirq: Preempts threads only (on the PREEMPT_RT, softirqs are regular threads)
  • Threads: Threads can only preempt other threads.

Thus, for instance, a thread suffers interference from IRQs.

Blocking

The blocking is caused by a lower-priority task delaying a higher-priority one. The blocking is generally associated with some synchronization. For example, by disabling preemption, a lower-priority thread can postpone scheduling a higher-priority thread. Likewise, when any thread disables interrupts, it can block an IRQ.

Release jitter

In this context, the release jitter is observed when the hardware delays the IRQ handler’s delivery because of factors external to the operating system. For instance, the IRQ handler starts in an idle CPU late because the processor is leaving a deep idle state.

Example

For example, the auto analysis in Fig 4 makes it possible to draw the following timeline highlighting some of these variables:

Fig 5: Auto analysis sample timeline

IRQ latency analysis

By splitting the latency into IRQ and Thread latency, timerlat facilitates the isolation of the root cause.

The IRQ can face interference from NMIs and other IRQs. Each platform’s interrupt controller scheduler the dispatch of concurrent IRQs according to a defined priority. It can then happen that the timer IRQ is dispatched after another higher-priority IRQ.

IRQs can face blocking from lower-priority IRQs dispatched before. The reason is that IRQ handlers run with IRQs masked. The threads and softirq can also cause blocking on IRQs by disabling IRQ.

The execution time of the timer IRQ also influences the IRQ latency. However, most of the execution time will affect only the thread latency. The reason is that the latency is measured right at the beginning of the handler.

In the example in Fig 5, the IRQ handler faced 32.17 us of delay. No interference was observed. The blocking time caused a 31 us delay.

The lower priority objtool:1164402  thread caused the blocking time. The blocking thread stack trace shows that a cgroup operation disabled IRQs on a raw_spin_lock operation, causing the delay. Once the IRQ started running, it reported a 32.17 us IRQ latency. The execution time for the IRQ handler was 9.57 us.

Fig 6 presents the patch that added the IRQ disabled section. In this case, it is a legit use case for the raw spinlock.

Fig 6: A legit patch that causes IRQ blocking

The timer IRQ can also be delayed by factors external to the OS. For example, Fig 7 shows an example of it. In this case, the IRQ handler took 39.01 us to start. It was not a blocking case because the blocking thread was swapper, which is the idle thread. Thus, it is possible to assume that external factors caused it.  It is possible to point for release jitter when the IRQ delays when IRQs are not disabled. For example, if the blocking thread is running in user space.

## CPU 9 hit stop tracing, analyzing it ##
  IRQ handler delay:		(exit from idle)	    39.01 us (76.59 %)
  IRQ latency:						    40.49 us
  Timerlat IRQ duration:				     5.85 us (11.49 %)
  Blocking thread:					     3.99 us (7.83 %)
	               swapper/9:0        		     3.99 us
    Blocking thread stack trace
		-> timerlat_irq
		-> __hrtimer_run_queues
		-> hrtimer_interrupt
		-> __sysvec_apic_timer_interrupt
		-> sysvec_apic_timer_interrupt
		-> asm_sysvec_apic_timer_interrupt
		-> pv_native_safe_halt
		-> default_idle
		-> default_idle_call
		-> do_idle
		-> cpu_startup_entry
		-> start_secondary
		-> __pfx_verify_cpu
------------------------------------------------------------------------
  Thread latency:					    50.93 us (100%)

Max timerlat IRQ latency from idle: 40.49 us in cpu 9

Fig 7: exit from latency example

rtla timerlat adds an option that can aid in reducing the exit from idle latency. The –dma-latency <latency> instructs the idle driver about the acceptable exit from idle latency. The most common <latency> value is 0. cyclictest sets it to 0 by default, rtla does not set it by default because it might mask the bad idle setup. If setting –dma-latency is insufficient, a general hardware tune must be done.

Thread latency analysis

Similarly to the IRQ latency, the thread can face interference from NMIs and IRQs. Additionally, threads can suffer interference from softirq and other threads.

In the regular kernel, softirqs have their own execution context, which preempts threads. On the PREEMPT_RT, softirqs run in the thread context, so they are scheduled as threads.

Threads can cause both interference and blocking, depending on the thread’s priority. If the priority is higher than the timerlat’s thread, it is considered interference.

If the priority is lower than timerlat, it causes blocking by running with interrupts and/or preemption disabled and also by the scheduling overhead that causes the context switch to the timerlat thread. The blocking thread stack trace helps to identify the root cause of the blocking.

Figure 8 shows a case of low IRQ latency followed by a high thread latency. The example was taken from the regular kernel (non-PREEMPT_RT). In this example, the thread latency was mainly caused by an in-kernel operation on btrfs file system. As the kernel is non-preemptive, this operation will cause blocking until a scheduling point is reached. It is worth noticing that rtla timerlat supports all preemption models.

As the blocking time extends, the window for interference is also long. In the same example, the timerlat thread suffered interference from the timer IRQ, from the TIMER and RCU softirqs, as well as the migration thread that runs as the highest priority thread in the system.

## CPU 18 hit stop tracing, analyzing it ##
  IRQ handler delay:		                	              0.00 us (0.00 %)
  IRQ latency:							     1.64 us
  Timerlat IRQ duration:					     9.52 us (1.80 %)
  Blocking thread:						   501.68 us (94.96 %)
	           kworker/u40:0:306130   			   501.68 us
    Blocking thread stack trace
		-> timerlat_irq
		-> __hrtimer_run_queues
		-> hrtimer_interrupt
		-> __sysvec_apic_timer_interrupt
		-> sysvec_apic_timer_interrupt
		-> asm_sysvec_apic_timer_interrupt
		-> ZSTD_compressBlock_fast
		-> ZSTD_buildSeqStore
		-> ZSTD_compressBlock_internal
		-> ZSTD_compressContinue_internal
		-> ZSTD_compressEnd
		-> ZSTD_compressStream2
		-> ZSTD_endStream
		-> zstd_compress_pages
		-> btrfs_compress_pages
		-> compress_file_range
		-> async_cow_start
		-> btrfs_work_helper
		-> process_one_work
		-> worker_thread
		-> kthread
		-> ret_from_fork
  IRQ interference					     3.68 us (0.70 %)
	             local_timer:236			     3.68 us
  Softirq interference					     4.21 us (0.80 %)
	                   TIMER:1  			     3.71 us
	                     RCU:9  			     0.49 us
  Thread interference					     6.21 us (1.17 %)
	            migration/18:125			     6.21 us
------------------------------------------------------------------------
  Thread latency:					   528.31 us (100%)

Max timerlat IRQ latency from idle: 10.34 us in cpu 12
  Saving trace to timerlat_trace.txt

Fig 8: Thread latency example

Timerlat hist

By default (like in the previous examples) timerlat runs in the top mode. The tool shows basic statistics about the IRQ and thread latency in this mode. The hist mode, instead, runs in silence, displaying a histogram of both latencies at the end of the session. Fig 9: shows an example of timerlat hist output running on CPUs 0 to 3 for 60 seconds.

# timerlat hist -c 0-3 -d 60
# RTLA timerlat histogram
# Time unit is microseconds (us)
# Duration:   0 00:01:00
Index   IRQ-000   Thr-000   IRQ-001   Thr-001   IRQ-002   Thr-002   IRQ-003   Thr-003
0         53321         0      1658         0      2106         0      8210         0 
1          6677     45721     58339        19     57895       316     51776      1292 
2             2     14258         3     59787         0     59457         1     58458 
3             0        14         0       170         0       216         6       219 
4             1         3         0        18         0        10         6        14 
5             0         3         0         3         0         1         2         7 
6             0         0         1         1         0         0         0         7 
7             0         1         0         0         0         0         0         2 
8             0         0         0         2         0         0         0         1 
9             0         0         0         0         0         1         0         1 
11            0         1         0         0         0         0         0         0 
12            0         0         0         1         0         0         0         0 
over:         0         0         0         0         0         0         0         0 
count:    60001     60001     60001     60001     60001     60001     60001     60001 
min:          0         1         0         1         0         1         0         1 
avg:          0         1         0         2         0         1         0         1 
max:          4        11         6        12         1         9         5         9 

Fig 9: timerlat hist output

timerlat tracing

The rtla timerlat is a front end for the timerlat tracer, so it is natural for it to also serve as a front end of the general tracing. The tracing capabilities of rtla timerlat are enabled via the -t option. With this option set, timerlat will enable a tracing session with timerlat tracer and osnoise: events enabled. If timerlat hits a stop tracing condition set with options -T <threshold> (stop if a thread latency is higher than <threshold>), -i <threshold> (stop if an IRQ latency is higher than <threshold>), rtla timerlat will save a timerlat_trace.txt file the content of the tracing buffer. The last line in Fig 4 shows the timerlat message about the tracing file. That is because the -a <threshold> is equivalent to “-t -i <threshold> -T <threshold>.”

It is possible to enable other tracing events using the -e <event> option, like in the example below, where sched, workqueue, irq_vectors, and irq event classes.

timelat also enables tracing event filtering and triggering. These options enable via –filter <filter> and –trigger <trigger> after a previous -e <event> that will have these options enabled. When the –trigger option enables a histogram, timerlat will save the histogram output to a file. For example, the following command line enables collecting a histogram for each type of task during the entire session.

rtla timerlat top -t -e osnoise:nmi_noise --trigger="hist:key=cpu,duration/1000:sort=cpu,duration/1000:vals=hitcount" -e osnoise:irq_noise --trigger="hist:key=desc,cpu,duration/1000:sort=cpu,duration/1000:vals=hitcount" -e osnoise:softirq_noise --trigger="hist:key=vector,cpu,duration/1000:sort=cpu,duration/1000:vals=hitcount" -e osnoise:thread_noise --trigger="hist:key=comm,cpu,duration/1000:sort=cpu,duration/1000:vals=hitcount"

This command is powerful because it allows us to have a complete view of the interference, even when sampling of timerlat could not capture that scenario, being a step towards discovering the hypothetical worst-case latency. The video below shows one example of this histogram usage.

Coming up…

Currently, timerlat runs with a kernel thread workload. Many users, however, prefer the idea of running a user-space workload. Timerlat is adding this feature, enabled with the -u option. The video below shows an example of this feature.