Introduction

In the power management area, the analysis of the behavior of the system is required to understand where the power leaks. A system which is under heavy load won't save power as the performances are requested. But in a normal use case, for example a mobile or a desktop, the user will use one application at a time and the other processes on the system will sleep. This is particularly true when we look at the system from a small time frame perspective. For example, on my system, a i7 dual-core, there are 269 processes and the system is 98.2% idle, the small fraction of used cpu remaining is used by a process doing polling.

Hopefully, at the beginning of this new century, most of the programmers are using event based mainloop on the file descriptors. There are file descriptors for signals, sockets, timers, file, ... and behind these file descriptors, a subset are triggered by interrupts like the timers, the network io, etc ...

Beside that, internally the kernel itself relies on the interrupts mechanism to manage the cpus.

But even if the system is idle, we have to magnify the small time frames and look for when the cpu is doing something ... or nothing.

The key word here is "idle", we want "idle" as much as possible for the cpus, we want to understand why the cpus are not "idle" and what we can do to have them more "idle". What wakes up a cpu, makes it exiting the idle state ? ... An interrupt.

This document is intended to identify the wakeup sources aka the interrupts for an ARM Cortex-A9 cpu.

Basically we can identify the interrupts from the proc file /proc/interrupts.

Exynos Origen board

           CPU0       CPU1       
 67:          0          0       GIC  dma-pl330.0
 68:          0          0       GIC  dma-pl330.1
 74:       6864          0       GIC  mct_tick0_irq
 80:          0       4051       GIC  mct_tick1_irq
 84:       2367          0       GIC  exynos4210-uart.0, exynos4210-uart.0
 86:        168          0       GIC  exynos4210-uart.2, exynos4210-uart.2, exynos4210-uart.22
 89:         10          0       GIC  mct_comp_irq
 90:        895          0       GIC  s3c2440-i2c.0
 91:          0          0       GIC  s3c2440-i2c.1
105:          0          0       GIC  mmc0
107:       2569          0       GIC  mmc1
429:          0          0  exynos-eint  Menu
430:          0          0  exynos-eint  Home
431:          0          0  exynos-eint  Back
432:          0          0  exynos-eint  Up
433:          0          0  exynos-eint  Down
IPI0:         0          1  CPU wakeup interrupts
IPI1:         0          0  Timer broadcast interrupts
IPI2:      5120       5515  Rescheduling interrupts
IPI3:         0          0  Function call interrupts
IPI4:        12         11  Single function call interrupts
IPI5:         0          0  CPU stop interrupts
Err:          0

Snowball board

           CPU0       CPU1       
 29:       3150       4124       GIC  twd
 36:        783          0       GIC  Nomadik Timer Tick
 44:          0          0       GIC  nmk-i2c
 46:          0          0       GIC  pl022
 50:          0          0       GIC  rtc-pl031
 53:          0          0       GIC  nmk-i2c
 54:          0          0       GIC  nmk-i2c
 57:       1592          0       GIC  dma40
 58:         73          0       GIC  uart-pl011
 72:          0          0       GIC  ab8500
 79:        187          0       GIC  prcmu
 87:          0          0       GIC  nmk-i2c
 92:       4687          0       GIC  mmci-pl18x (cmd)
131:      14897          0       GIC  mmci-pl18x (cmd)
198:          0          0  Nomadik-GPIO  userpb
317:          0          0  Nomadik-GPIO  extkb1
318:          0          0  Nomadik-GPIO  extkb2
327:          0          0  Nomadik-GPIO  extkb3
328:          0          0  Nomadik-GPIO  extkb4
384:          0          0  Nomadik-GPIO  mmci-pl18x (cd)
483:          0          0    ab8500  ab8500-ponkey-dbf
484:          0          0    ab8500  ab8500-ponkey-dbr
495:          0          0    ab8500  ab8500-rtc
516:          0          0    ab8500  ab8500-gpadc
556:          0          0    ab8500  usb-link-status
IPI0:          0          0  CPU wakeup interrupts
IPI1:          0        167  Timer broadcast interrupts
IPI2:       3864       4908  Rescheduling interrupts
IPI3:          0          0  Function call interrupts
IPI4:          3         67  Single function call interrupts
IPI5:          0          0  CPU stop interrupts
Err:          0

Pandaboard Omap4 board

 29:        293        395       GIC  twd
 41:          0          0       GIC  l3-dbg-irq
 42:          0          0       GIC  l3-app-irq
 44:          0          0       GIC  DMA
 69:         61          0       GIC  gp_timer
 88:          0          0       GIC  i2c.9
 89:          0          0       GIC  i2c.10
 93:          0          0       GIC  i2c.11
 94:          0          0       GIC  i2c.12
106:         93          0       GIC  OMAP UART2
169:          0          0      PRCM  hwmod_io
IPI0:         0          0  Timer broadcast interrupts
IPI1:      1424       1260  Rescheduling interrupts
IPI2:         0          0  Function call interrupts
IPI3:        81         90  Single function call interrupts
IPI4:         0          0  CPU stop interrupts
Err:          0

The GIC is the Generic Interrupt Controller, the chipset in charge of managing the interrupts on the ARM architecture, similar to the APIC on the x86 platform.

The documentation is available under the title:

ARM Generic Interrupt Controller - Architecture Specification

Timer

One of the most complex and interesting interrupt, they are widely used by the system for:

  • all userspace timer setup via syscalls
    • select, poll, timer_create, setitimer, ...
  • the network stack, especially for the tcp protocol
  • the kernel for the next event and all timeout'ed io
  • the kernel for kworker handling the delayed work queues, timeouts
  • the tick sched

There are two kinds of timer, the local and global timer.

The x86 platform shows clearly a line "Local Timer Interrupt" in the proc file, but on ARM platform, the local timers are referred as twd which stands for Timer Watchdog. It is the generic framework for the local timers. If the board specific code uses this framework, the local timer will appear clearly in the /proc/interrupts file as twd with the interrupt number 29. You may check in the examples above.

When the SoC does not use this generic framework, given the high variety of ARM platforms and the /proc/interrupts information, it is not obvious to find which are the local timer interrupt and the global timer interrupt. The technical documentation of the board, dead code listing and mailing list harassment may be needed.

The Cortex-A9 MPCore Technical Reference Manual explains in details the timers in the Chapter 4.

Local Timer

The local timer is tied with a cpu. There is a local timer per cpu. When the timer expires, the cpu will be woken up. It is the local timer which is used by the system when the processor is not idle because of it's accuracy and because it is faster.

When the cpu enters idle in retention mode or deeper sleep state for the Cortex-A9, the local timers will be also shutdown. In the kernel code, the local timer, seen as a clock device, is set with the flag CLOCK_EVT_FEAT_C3STOP, which mean it will be stopped in case we go to a deep idle state. The name of this flag is inherited from the Intel semantic of the C-state and does not really makes sense on ARM. The function clockevents_notify with the ENTER/EXIT parameter indicates the cpu is entering the idle state to the time framework which, in turn, check the CLOCK_EVT_FEAT_C3STOP for the local timer and shutdown this one.

In order to prevent the loss of time service, a broadcast timer is used instead. A cpu, which is not idle, is elected to take into account the next timer expiration and to broadcast an event with a softirq to all cpus concerned by the next event due. In case the cpu going to idle is the last one on the system (others are idle), the time framework will switch the clock device from the local timer to the global timer which is not impacted by the retention mode. The local timer is switched back when the cpu exits the idle state.

That should be kept in mind when looking at the /proc/interrupts file when the number of interrupts for the global timer is much greater than the local timer, that indicates the system is mostly idle, that happens with a minimal bare busybox system. If you use an ubuntu distro with some applications, eg thunderbird, emacs, screensaver, etc ..., you should see the local timer to be used much more often than the global timer.

The file /proc/timer_list contains a lot of useful information about the timers. If the kernel is compiled with the CONFIG_TIMER_STATS, the /proc/timer_stats contains statistics about the timers usage. The acquisition should be enabled with  echo 1 > /proc/timer_stats 

IPI : Inter Processor Interrupt

When a cpu is idle, it is in a state of waiting for an interrupt. Usually these interrupts are hardware interrupts like keyboard, mouse, network or io, and timers interrupts, but that could also an IPI.

An IPI is a software interrupt. On the ARM architecture, the software interrupt is sent through the Generic Interrupt Controller (GIC) via the  smp_cross_call()  function which is the callback function for  raise_soft_irq() . The number of IPIs is limited by the hardware to 16 interrupts. At the moment, 5 interrupts are used.

static const char *ipi_types[NR_IPI] = {
#define S(x,s)  [x] = s
        S(IPI_WAKEUP, "CPU wakeup interrupts"),
        S(IPI_TIMER, "Timer broadcast interrupts"),
        S(IPI_RESCHEDULE, "Rescheduling interrupts"),
        S(IPI_CALL_FUNC, "Function call interrupts"),
        S(IPI_CALL_FUNC_SINGLE, "Single function call interrupts"),
        S(IPI_CPU_STOP, "CPU stop interrupts"),
};

The function  handle_IPI()  is in charge of handling any soft interrupt and calling the corresponding function depending on the soft interrupt number.

IPI0

IPI_WAKEUP

no callback

IPI1

IPI_TIMER

ipi_timer()

IPI2

IPI_RESCHEDULE

scheduler_ipi()

IPI3

IPI_CALL_FUNC

generic_smp_call_function_interrupt()

IPI4

IPI_CALL_FUNC_SINGLE

generic_smp_call_function_single_interrupt()

IPI5

IPI_CPU_STOP

ipi_cpu_stop()

Please note that is platform specific and the list could be slightly different.

The IPI can be used to emulate an interrupt from a timer. In this case, the cpu which is in charge of broadcasting the timer event will use a specific IPI number to wakeup all the cpus with their local timer down and that will trigger the timer interrupt for them. Also, the IPI allows to invoke a function on another cpu than the one executing the instruction flow.

IPI0 : CPU wakeup interrupts

Not used.

IPI1 : Timer broadcast interrupts

An emulated timer interruption. Already explained before with the timer interrupt. It occurs when the cpuidle driver is enabled in the system and supports deep sleep state. This interruption does not occur when no cpuidle driver is set or the idle states do not power down the local timers.

As an exercise, we will spot this behavior by putting the system in the situation where it has to generate a lot of timer broadcast interrupts. In order to reach this situation, we will isolate a cpu as much as possible from any activity with the cpuset forcing the cpu to reach a deep idle state; offlining it is a bad idea. The next step will be to run on this cpu only, an infinite loop doing a sleep, long enough to enter the deep idle state but short enough to receive a large number of this IPI per second we can highlight.

This test will be done on a snowball where the cpuidle driver behaves exactly as expected. The other boards have coupled idle states or non deep idle state.

1. Isolating the cpu

Let's setup the cpuset:

# mount the cgroup with the cpuset option
mount -t cgroup -ocpuset cgroup /sys/fs/cgroup

# clone the cpuset configuration from the cgroup parent
# when we create a new cgroup
echo 1 > /sys/fs/cgroup/cgroup.clone_children

# create a new cpuset
mkdir /sys/fs/cgroup/cpu1

# assign cpu1 to this newly created cgroup
echo 1 > /sys/fs/cgroup/cpu1/cpuset.cpus

# make this cpu exclusive
echo 1 > /sys/fs/cgroup/cpu1/cpuset.cpu_exclusive

# assign the current task to this cpuset
# NOTE : all the next commands *must* be run from this shell

echo $$ > /sys/fs/cgroup/cpu1/tasks

Ok, we are done with the cpuset setup.

2. Generate the timer interrupts

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
        useconds_t usec;

        if (argc < 2) {
                fprintf(stderr, "%s <usec>\n", argv[0]);
                return 1;
        }

        usec = atoi(argv[1]);

        while (1) {
                if (usleep(usec)) {
                        perror("usleep");
                        break;
                }
        }

        return 0;
}

The following figure shows the number of broadcast timer interrupt per seconds when we run the program above with a sleep time of 10ms, 5ms and 1ms. These timer values are long enough to have the cpuidle driver taking the decision to go to a deep idle state.

plot_timer_broadcast.jpg

Note the broadcast interrupts occur for the number of the times we expect the timer to expire per second. That is true, for 10ms and 5ms sleep time. But for the 1ms, we expect 1000 broadcast interrupts but we have roughly 800 per second which is the sign the timer duration is too short 20% of the time to have the processor to go to idle during this period.

Generated by: smp_timer_broadcast(const struct cpumask *mask)

Note : used as an { broadcast } ops from the { struct clock_event_device }.

Source of wakeups:

  • tick_handle_oneshot_broadcast() => tick_do_broadcast() => broadcast()

IPI2 : Rescheduling interrupts

On a multicore system when one cpu is idle and the other one is running, the scheduler may take the decision to wake up the idle processor to run a task. This IPI is used to for that. On an UP system, this interrupt never occurs.

This IPI could be a large number if the system is running multi-threaded program competing for a mutex. As an example, the following code will lead the system to generate a lot of rescheduling interrupt. For each processor, a new thread is created and the affinity is set, then the thread routine will acquire a mutex and release it right after. As the threads are tied with a cpu, when one thread release the lock, the kernel will wake up another cpu in order to let the attached thread to acquire the lock.

#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/poll.h>
#include <pthread.h>
#include <sched.h>

#define MAXTHREAD 16
pthread_t threads[MAXTHREAD];
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

static void *thread_routine(void *data)
{
        while (1) {
                pthread_mutex_lock(&mutex);
                pthread_mutex_unlock(&mutex);
        }
        return NULL;
}

int main(int argc, char *argv[])
{
        int ret, i;
        int nrthreads;
        cpu_set_t cpumask;

        nrthreads = sysconf(_SC_NPROCESSORS_ONLN);
        if (nrthreads < 0) {
                perror("sysconf");
                return 1;
        }

        if (nrthreads > MAXTHREAD)
                nrthreads = MAXTHREAD;

        CPU_ZERO(&cpumask);

        pthread_mutex_init(&mutex, NULL);

        for (i = 0; i < nrthreads; i++) {

                ret = pthread_create(&threads[i], NULL, thread_routine, NULL);
                if (ret < 0) {
                        fprintf(stderr, "failed to pthread_create: %d\n", ret);
                        return 1;
                }

                CPU_SET(i, &cpumask);
                pthread_setaffinity_np(threads[i], sizeof(cpumask), &cpumask);
                CPU_CLR(i, &cpumask);
        }

        return poll(0, 0, -1);
}

The figure below shows the number of rescheduling interrupts per seconds when the program above is run in the period 60 - 120 seconds.

plot_resched.jpg

Generated by : smp_send_reschedule(int cpu)

Source of wakeups:

  • scheduler_tick() => trigger_load_balance() => nohz_balancer_kick() => smp_send_reschedule()

  • resched_cpu() => resched_task() => smp_send_reschedule()

  • wake_up_idle_cpu() => smp_send_reschedule()

  • signal_wake_up() => kick_process() => smp_send_reschedule()

  • wake_up_process() => try_to_wake_up() => ttwu_queue() => ttwu_queue_remote() => smp_send_reschedule()

IPI3 : Function call interrupts

A flow of execution on one cpu may want to execute part of the code on another cpus because it is dependant of the context of these cpus. One good example is when the broadcast timer framework is initialized where the booting cpu is running a remote function to all other cpus with the  on_each_cpu  function.

The resulting  smp_cross_call  will be invoked on all online cpus.

Generated by :

  • smp_call_function_many(const struct cpumask *mask, smp_call_func_t func, void *info, bool wait)
  • smp_call_function(const struct cpumask *mask, smp_call_func_t func, void *info, bool wait)

Source of wakeups:

  • flush_tlb_mm() => on_each_cpu_mask() => smp_call_function_many()

  • flush_tlb_page() => on_each_cpu_mask() => smp_call_function_many()

  • flush_tlb_range() => on_each_cpu_mask() => smp_call_function_many()

  • flush_tlb_all() => on_each_cpu() => smp_call_function()

  • flush_tlb_kernel_page() => on_each_cpu() => smp_call_function()

  • flush_tlb_kernel_range() => on_each_cpu() => smp_call_function()

  • twd_rate_change() => smp_call_function()

  • access_remote_process() => copy_to_user_page() => flush_ptrace_access() => smp_call_function()

IPI4 : Single function call interrupts

This IPI is the same than IPI3 but occurs only on the targeted cpu instead of all online cpus.

We will spot this IPI with a simple example which is not necessary an usual use case. When the frequency is changed on the system, the cpus must reprogram their local timer because they depends on the frequency. The notification mechanism triggers a call to the twd_rate_change function which invoke a remote function call to the targeted cpu.

In order to trigger the IPI, we will change the frequency through the cpufreq framework. But first we have to switch the governor as 'userspace'.

echo userspace > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor

And then, let's be tough:

while $(true); do
        echo 200000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_setspeed 
        echo 800000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_setspeed
done

plot_single_function.jpg

As expected the number of single function call aka IPI4 per second increases during the frequency change in the interval of 60 - 120 seconds.

Generated by:

  • smp_call_function_single(int cpu, struct call_single_data *data, int wait)
  • smp_call_function_single(int cpu, smp_call_func_t func, void *info, int wait)

  • smp_call_function_any(const struct cpumask *mask, smp_call_func_t func, void *info, int wait)

Source of wakeups:

  • relay_late_setup_files() => smp_call_function_single()

  • try_remote_softirq() => smp_call_function_single()

  • (perf subsystem) => cpu_function_call() => smp_call_function_single()

  • hrtick_update() => hrtick_start_fair() => hrtick_start() => smp_call_function_single()

  • pick_next_task_fair() => hrtick_start_fair() => hrtick_start() => smp_call_function_single()

  • twd_cpufreq_transition() => smp_call_function_single()

IPI5 : CPU stop interrupts

An IPI sent to shutdown the cpus.

Power management considerations

Some of the idle states when they are not so deep could be used independently on the cpu but the deeper C-states need the cluster or package to be idle to reach this state on some specific platform, especially when they have a common power rail. Here when we have a dual core cpu, that could be interesting to have control on such timers in order to have one of the core to be sollicitated as little as possible and reach a deep idle state. Hence when the other core enters the same idle state, the cluster can goes down.

A patchset had been made by Viresh Kumar to migrate the timers and the workqueues from an idle to a non-idle cpu to prevent unwanted wake up.

https://lkml.org/lkml/2012/9/27/188

Vincent Guittot did a presentation introducing this concept at :

http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sched-timer-workqueue.pdf

As mentioned before, the clock event devices go to sleep when the deep idle state is selected and a broadcast timer is used in replacement. The cpu in charge of the broadcast will send an IPI to all cpus with a nonfunctional local tick device, that could be also a source of unnecessary wake up as we may want to handle the timer callback from the running cpu instead of waking up all the idle cpus.

RCU deferred maintenance code

RCU callback processing is driven by the scheduling-clock interrupt.

Pending RCU callback functions on a processor lead to a series of scheduling clock interrupts. Those interrupts continue to fire until all of the RCU callback functions have been completed. These RCU related scheduling clock interrupts represent a source of potential wakeups from idle state.

There are two possibilities to reduce the number of these RCU related wakeups:

a) Enforcing Idle

The idle period can be extended by reducing the frequency of RCU state machine invocations. Even in the case that the core has pending RCU callback functions, the scheduler-clock interrupt can be disabled during idle periods by replacing the scheduling-clock interrupt as the RCU wakeup source by a timer with a four times slower period value than that of the scheduling-clock interrupt itself.

A recent Linux kernel has the option CONFIG_RCU_FAST_NO_HZ to enable this feature.

b) RCU processing offload

RCU callback processing can be offloaded from softirq to kthread context. On asymmetric multi-processing architecture systems this further allows to force those kthreads to run on an energy efficient instead on a high performance core.

A recent Linux kernel has the option CONFIG_RCU_NOCB_CPU to offload RCU callback invocation from the set of CPUs specified at boot time by the rcu_nocbs parameter.

Hardware IRQ

Except for the timer, which has been already described above, this kind of interrupts do not raise interest for us because they are hardwired on the hardware and it is difficult change the behavior of it. Most of them are resulting from external events (network, keyboard, mouse, ...).

NMI : Non Maskable interrupts

Even if this interrupt does not exist on ARM, it is interesting to describe what they are for.

There is nothing to prevent this interrupt to be raised on the system, this is why they are called 'non maskable'. They are raised when a critical system error occurs.

Normally, you shouldn't see any NMI on your system but under certain circumstances you may see a large number of them. That is certainly due to the NMI watchdog, a mechanism to dump the stack trace of the kernel when this one is hung in a interrupt context.

WorkingGroups/PowerManagement/Doc/WakeUpSources (last modified 2013-03-01 17:25:32)