The big.LITTLE in-kernel switcher (IKS) is a piece of code managing the CPU resources between a cluster of Cortex-A15 and a cluster of Cortex-A7. The goal is to balance performance versus power consumption by migrating execution states between the two clusters. Additional information on the switcher concept can be found in the following LWN article:

What follows is a detailed explanation of how the in-kernel big.LITTLE switcher implementation works. Knowledge of the concepts covered in the above article is assumed. More information is also available directly in the code and in the various commit logs for the related patches.

High-level Overview

The switcher is comprised of 4 parts:

1. Low-level power management

This is responsible for powering up and down individual CPUs, enabling and disabling caches, power planes, etc. This is shared for the most part with the suspend/resume and CPU hotplug code. In a real product, this part is highly platform specific and possibly implemented partly in Secure World.

2. Switcher core

This part handles the switching process itself, including the saving of the execution context for the outbound CPU, migration of interrupts from the outbound CPU to the inbound CPU, and restoration of the execution state on the inbound CPU. While the inbound CPU resumes normal kernel code execution, the outbound CPU takes care of shutting itself down. This is all platform independent code calling into the low-level power management code described in (1) above.

3. The core cpufreq layer

This includes the standard Linux cpufreq subsystem, augmented with a special driver that provides an adaptation layer between that subsystem and the switcher core code described in (2). The role of this driver is to present the two clusters as different operating points and convert CPU frequency changes into cluster switch requests. This part is generic code as well. However, platform specific adaptations are required to customize the various intra cluster operating points for actual clock frequency changes, just like on traditional systems.

4. The cpufreq policy governors

Those are made of kernel modules or user space daemons monitoring the system activity and requesting CPU frequency changes to the cpufreq core described in (3). Various cpufreq governors already exist for different usage profiles, and new ones can also be created. It is up to the system integrator to select and tune the appropriate governor for the intended work load. Therefore this part won't be covered here but documentation on cpufreq governors can be found in the Documentation/cpu-freq/ directory in the Linux kernel source tree or on the Internet.

Implementation Details

Let's approach this in a top-down fashion i.e. in the reverse order than listed in the previous section. As mentioned, the cpufreq policies and governors are outside the scope of this document.

The switcher cpufreq driver

This is found in the file drivers/cpufreq/arm_big_little.c. The core principle is to represent the A7 and the A15 cluster frequencies in a uniform and contiguous fashion to the cpufreq core, as it would be the case on a single cluster architecture. From there the cpufreq core is only concerned with the virtual frequencies without being aware of the cluster intricacies. It is up to the driver to take requested frequencies from the cpufreq core and do the cluster management. If the needed operating point can be accommodated on the current cluster, the frequency is adjusted. If not, a cluster switch is requested, the operating frequency adjusted and communicated back to the core in its virtual scale.

The current driver is fairly generic -- it gets a handle on the cluster operating point through an ops and manipulate the clock using the clk_[get,set]_rate interface. It is also able to recognize and deal with CPU hotplug events when the switcher is enabled or disabled at run time.

The core switcher code

This is found in the file arch/arm/common/bL_switcher.c. The main entry point for this code is the function bL_switch_request(). Its arguments are the logical CPU number for which a switch is requested, and the destination cluster for that logical CPU. The request is forwarded to the appropriate kernel thread responsible for performing switch operations for that CPU. There is one thread per logical CPU waiting for those switch requests to occur.

The core switch operation is handled by bL_switch_to() which must be called on the CPU for which a switch is requested. What this code does:

  • Return early if the current cluster is already the wanted one.
  • Close the gate in the kernel entry vector for both the inbound and outbound CPU.
  • Wake up the inbound CPU so it can perform its reset sequence in parallel up to the kernel entry vector gate.
  • Migrate all interrupts in the GIC targeting the outbound CPU interface to the inbound CPU interface, including SGIs. This is performed by gic_migrate_target() in arch/arm/common/gic.c.
  • Shut down the local timer for the outbound CPU.
  • Call cpu_pm_enter() which takes care of flushing the VFP state to RAM and save the CPU interface config from the GIC to RAM.
  • Call cpu_suspend() which saves the CPU state (general purpose registers, page table address) onto the stack and store the resulting stack pointer in an array indexed by processor number, then call the provided shutdown function. This happens in arch/arm/kernel/sleep.S.

At this point, the provided shutdown function executed by the outbound CPU ungates the inbound CPU. Therefore the inbound CPU:

  • Picks up the saved stack pointer in the array indexed by processor number above. At the moment the corresponding code in arch/arm/kernel/sleep.S only looks at the CPU number field in the MPIDR so the current code works unmodified even if the new CPU comes from a different cluster.
  • The MMU and caches are re-enabled using the saved state on the provided stack, just like if this was a resume operation from a suspended state.
  • Then cpu_suspend() returns, although this is on the inbound CPU rather than the outbound CPU which called it initially.
  • The function cpu_pm_exit() is called which effect is to restore the CPU interface state in the GIC using the state previously saved by the outbound CPU.
  • The local timer on the inbound CPU is restored.
  • Exit of bL_switch_to() to resume normal kernel execution on the new CPU.

However, the outbound CPU is potentially still running in parallel while the inbound CPU is resuming normal kernel execution, hence we need per CPU stack isolation to execute bL_do_switch(). After the outbound CPU has ungated the inbound CPU, it calls bL_cpu_power_down() to:

  • Clean its L1 cache.
  • If it is the last CPU still alive in its cluster (last man standing), it also cleans its L2 cache and disables cache snooping from the other cluster.
  • Enters WFI to get into reset.

Two optimizations were added to the above sequence as well:

  1. The outbound CPU is kept alive until the inbound CPU is done with resuming its state. This allows for the inbound CPU to snoop the outbound CPU's cache rather than having to fetch everything from RAM.
  2. After the outbound CPU has sent a wake-up request to the inbound CPU, it waits until the inbound CPU signals its readiness to proceed with the resume tasks before saving its state. This allows for the outbound CPU to schedule other tasks during the delay required to power up the inbound CPU, which delay can be quite significant when the inbound cluster has to be activated as well.

Simple trace events are included to allow tracing of CPU migration events and the associated latency. Usage details can be found in the message log for the commit adding this support titled "ARM: bL_switcher: Basic trace events support".

The GIC state migration

To properly transfer interrupts from the outbound CPU to the inbound CPU, a few things have to be done in sequence. This is implemented in gic_migrate_target(). Those steps are:

  1. No IRQs should be "active" when the switch is happening. This is enforced by having bL_switch_to() disable interrupts during the critical switch code path before calling gic_migrate_target().
  2. gic_cpu_map is updated so new SGIs are directed to the CPU interface of the inbound CPU.
  3. The target mask for each peripheral interrupts is inspected, and if it matches the outbound CPU then it is updated to target the inbound CPU instead.
  4. Pending states for SGIs on the outbound CPU are cleared and re-issued on the inbound CPU.

At this point the GIC interface for the outbound CPU should be quiet. Eventually, IRQs are unmasked on the inbound CPU and interrupt servicing resumes.

The low-level power management layer

This part is mostly platform specific. The bL_platform_power_ops structure declared in arch/arm/include/asm/bL_entry.h must be filled with the appropriate methods to provide CPU power up and power down functionalities, as well as handling cluster power up/down when necessary. This structure must be registered using bL_platform_power_register().

This part is very tricky and difficult to implement correctly. Many races are possible and they cannot be avoided using the traditional exclusion locking mechanisms provided by the kernel. For example, if the power_down method has determined that it is about to power down the last CPU in its cluster, it should also power down the cluster. However, to power down the cluster, it is necessary to also disable the interface in the cache coherency interconnect (CCI) for that cluster. This can be done only when the CPU cache has been disabled and flushed to RAM, and then the STREX and LDREX instructions normally used for locking are no longer usable. In the mean time, a previously disabled CPU in the same cluster might have been powered up, possibly racing to set the cluster back up as well.

The implemented algorithms to safely power up and down CPUs and clusters are described in the Linux kernel source tree. The following documents are provided:

  • Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
  • Documentation/arm/big.LITTLE/vlocks.txt

Helper methods to coordinate between CPUs coming down and CPUs going up are provided so that cluster teardown and setup operations are not done for a cluster simultaneously. Those helpers implement specially crafted algorithms to work around the restrictions imposed by the incertainty about the cache state on the other CPUs.

For use in the power_down() implementation:

  • bL_cpu_going_down(unsigned int cluster, unsigned int cpu)

  • bL_outbound_enter_critical(unsigned int cluster)
  • bL_outbound_leave_critical(unsigned int cluster)

  • bL_cpu_down(unsigned int cluster, unsigned int cpu)

The definition for those helpers is provided in the source code found in arch/arm/common/bL_entry.c.

On the power-up path, proper synchronization with the above functions is already implemented in generic code. The platform must only provide a power_up_setup() helper to do platform-specific setup in preparation for turning the cluster on, such as entering coherency by enabling the appropriate CCI interface, etc. It must be assembler for now, since it must run before the MMU can be switched on.

It is a good idea to inspect and study the implementation for the Dual Cluster System Control Block (DCSCB) in arch/arm/mach-vexpress/dcscb.c and arch/arm/mach-vexpress/dcscb_setup.S at this point. This code is well commented and should be a good example to follow.

The boot protocol

It is possible for a switch to occur simultaneously on multiple CPU pair. Therefore, re-entry into the kernel must be controlled on a per CPU basis. The traditional holding pen protocol used on Versatile Express and some other SMP systems doesn't work for the switcher as it can let into the kernel only one CPU at a time. This is why an alternative entry point is provided by the code in arch/arm/common/bL_head.S with the gate control in arch/arm/common/bL_entry.c.

At the boot ROM and firmware level It is important not to make any CPU special e.g. even CPU 0 should be able to shut itself down and be resumed without presuming this is a cold boot. There is no need to have a secondary address for each CPU at the firmware level. The kernel has to provide its own CPU reset vectors and gating mechanism anyway.

So, given a single location for the kernel secondary start address, the boot protocol should implement the following algorithm:

From CPU reset:

  • Initialize CPU specific things such as the architected timer frequency, etc. RAM must not be altered.
  • Load the secondary start address value. Its location is machine specific. For example, Versatile Express uses SYS_FLAGS in the SYSREGS space. RAM must not be altered.
  • If the secondary start address is not zero then this is a warm boot and execution should branch to that address right away. RAM must not be altered.
  • Otherwise (the secondary start address is zero) then and only then CPU 0 becomes special. If this is CPU 0 then perform a standard cold boot. Obviously, RAM can be altered in that case.
  • If this is not CPU 0 then the CPU must execute a WFE and reload the secondary start address until it is non-zero, then branch to that address. RAM must not be altered.

Here's some example code for Versatile Express implementing the above:


        @ Program architected timer frequency
        mrc     p15, 0, r0, c0, c1, 1   @ CPUID_EXT_PFR1
        lsr     r0, r0, #16
        ands    r0, r0, #1              @ Check generic timer support
        beq     1f
        ldr     r0, =24000000           @ 24MHz timer frequency
        mcr     p15, 0, r0, c14, c0, 0  @ CNTFRQ
         * If SYS_FLAGS is already set, this is a warm boot and we blindly
         * branch to the indicated address right away, irrespective of the
         * CPU we are.
        ldr     r4, =0x1c010030         @ V2M SYS_FLAGS register
        ldr     r0, [r4]
        cmp     r0, #0
        bxne    r0

         * Otherwise this is a cold boot.  In this case it depends if
         * we are the primary CPU or not.  The primary CPU boots the system
         * while the secondaries wait for the primary to set SYS_FLAGS.
        mrc     p15, 0, r0, c0, c0, 5
        tst     r0, #0xff
        tsteq   r0, #(0xff << 8)
        bleq    primary_cold_boot

2:      wfe
        ldr     r0, [r4]
        cmp     r0, #0
        bxne    r0
        b       2b

The goal here is to keep the boot protocol as simple and flexible as possible. More complex CPU gating and dispatching should be performed by updatable kernel code, not by firmware ROM.

projects/big.LITTLE.MP/Big.Little.Switcher/Docs/in-kernel-code (last modified 2013-05-01 17:37:18)