Introduction

This article is meant to help people understand the steps involved in porting the IKS (In kernel Switcher) solution to their hardware and plan for such activity. The first part concentrates on the minimum areas to be addressed for a successful port. The next section looks at some of the interactive governor tuneables that need adjustment in order to attain the desired performance level.

For more details on what the IKS solution is please refer to this companion document or this LWN article.

Porting Guide

The topics below are presented in the order in which they are needed as the kernel is booting. It is strongly suggested to complement this information by studying the code that was written for the TC2 reference platform.

Enabling the IKS functionality

The IKS functionality is enabled by setting the CONFIG_BL_SWITCHER kernel flag to 'y'. From there the switcher (IKS) will be activated by default when the kernel boots.

At that point it is possible to switch between IKS and MP by passing 'bL_switcher_active=no' to the kernel on the boot command line or via SYSFS:

$ echo [0,1] > /sys/kernel/bL_switcher/active

Note that the above mechanism are for debugging purposes only. Linaro does not recommend nor support a system in production that would alternate between the two models.

The device tree

The layout of cluster and CPUs within those clusters must be detailed in the device tree source file. When the kernel starts this information is parsed and used during the initialisation of the SMP subsystem. Cluster and CPU nodes can be laid out as follow:

        clusters {
                #address-cells = <1>;
                #size-cells = <0>;

                cluster0: cluster@0 {
                        reg = <0>;
                        cores {
                                #address-cells = <1>;
                                #size-cells = <0>;

                                core0: core@0 {
                                        reg = <0>;
                                };

                                core1: core@1 {
                                        reg = <1>;
                                };

                        };
                };

                cluster1: cluster@1 {
                        reg = <1>;
                        cores {
                                #address-cells = <1>;
                                #size-cells = <0>;

                                core2: core@0 {
                                        reg = <0>;
                                };

                                core3: core@1 {
                                        reg = <1>;
                                };

                                core4: core@2 {
                                        reg = <2>;
                                };
                        };
                };
        };

        cpus {
                #address-cells = <1>;
                #size-cells = <0>;

                cpu0: cpu@0 {
                        device_type = "cpu";
                        compatible = "arm,cortex-a15";
                        reg = <0>;
                        cluster = <&cluster0>;
                        core = <&core0>;
                        clock-frequency = <1000000000>;
                };

                cpu1: cpu@1 {
                        device_type = "cpu";
                        compatible = "arm,cortex-a15";
                        reg = <1>;
                        cluster = <&cluster0>;
                        core = <&core1>;
                        clock-frequency = <1000000000>;
                };

                cpu2: cpu@2 {
                        device_type = "cpu";
                        compatible = "arm,cortex-a7";
                        reg = <0x100>;
                        cluster = <&cluster1>;
                        core = <&core2>;
                        clock-frequency = <800000000>;
                };

                cpu3: cpu@3 {
                        device_type = "cpu";
                        compatible = "arm,cortex-a7";
                        reg = <0x101>;
                        cluster = <&cluster1>;
                        core = <&core3>;
                        clock-frequency = <800000000>;
                };

                cpu4: cpu@4 {
                        device_type = "cpu";
                        compatible = "arm,cortex-a7";
                        reg = <0x102>;
                        cluster = <&cluster1>;
                        core = <&core4>;
                        clock-frequency = <800000000>;
                };
        };

Multi Cluster Power Management

The multi cluster power management (MCPM) API is at the heart of both IKS and MP solutions. It includes a set of machine specific methods to perform low-level actions. Those methods form a platform specific backend that must be registered with the MCPM core. Note that a discussion on cluster management race avoidance is probably good to introduce before moving on to the next two sub sections. Documentation on the subject can be found in "Documentation/arm/cluster-pm-race-avoidance.txt" in the kernel repository.

Board Specific Backend

To keep the MCPM generic a platform specific API was introduced. That API (struct mcpm_platform_ops) is used to manage platform specific cluster and CPU power up and down operations. It is communicated to the MCPM by calling 'mcpm_platform_register()', something that needs to be done _before_ 'smp_init()' gets called.

The best way to do so is probably to use an 'early_initcall' as enacted in the TC2 implementation:

static const struct mcpm_platform_ops tc2_pm_power_ops = {
        .power_up       = tc2_pm_power_up,
        .power_down     = tc2_pm_power_down,
        .suspend        = tc2_pm_suspend,
        .powered_up     = tc2_pm_powered_up,
};

static int __init tc2_pm_init(void)
{
        int ret;

        ret = psci_probe();
        if (!ret) {
                pr_debug("psci found. Aborting native init\n");
                return -ENODEV;
        }

        if (!vexpress_spc_check_loaded())
                return -ENODEV;

        tc2_pm_usage_count_init();

        ret = mcpm_platform_register(&tc2_pm_power_ops);
        if (!ret)
                ret = mcpm_sync_init(tc2_pm_power_up_setup);
        if (!ret)
                pr_info("TC2 power management initialized\n");
        return ret;
}

early_initcall(tc2_pm_init);

Two things are worth pointing out in the above:

  • On the TC2 'vexpress_spc_check_loaded()' is the initialisation point for the serial power controller (SPC), a component that plays a crucial role in cluster and CPU management on that specific target. If the platform being ported to has a similar mechanism, this would be a perfect place to ensure that it is initialised properly.
  • By way of 'mcpm_sync_init', the MCPM API provides the functionality to install a very low level cluster management function to be called every time a cluster is powered up. Please refer to the TC2 implementation for more details on this.

It is very important to have a good understanding of the callback methods defined in 'struct mcpm_platform_ops'. For that purpose the following summary should be complemented with a careful reading of 'mcpm_entry.h' where 'struct mcpm_platform_ops' is defined and examples in 'tc2_pm.c' and 'dcscb.c':

  • power_up: Powers up a CPU on a given cluster. It is recommended to keep track of the CPU and clusters as they are powered up, something that should be done with concurrency in mind. Note that CCI management should also take place at this time.
  • power_down: Expected to take down the CPU on which the code is running on, including flushing the CPU's caches, switching off the CCI interface and snooping if needed. Once again concurrency should be considered when doing so.
  • suspend: Same as 'power_down' but set the re-entry point of the CPU before going down.
  • powered_up: Mainly to clean up the power up sequence and get ready for the next shut down but also to make sure the current cluster can't be switched off.

SMP operations

The MCPM API is wrapped in the 'mcpm_smp_ops' structure and can be found in 'arch/arm/common/mcpm_platsmp.c'. Said structure needs to be passed to 'smp_set_ops()', most likely as part of the initialisation function that is referenced by '.smp_init' in the machine declaration macro.

Taking the TC2 example the machine declaration looks like":

DT_MACHINE_START(VEXPRESS_DT, "ARM-Versatile Express")
        .dt_compat      = v2m_dt_match,
        .smp_init       = smp_init_ops(vexpress_smp_init_ops),
        .map_io         = v2m_dt_map_io,
        .init_early     = v2m_dt_init_early,
        .init_irq       = irqchip_init,
        .init_time      = v2m_dt_timer_init,
        .init_machine   = v2m_dt_init,
MACHINE_END

In the above 'vexpress_smp_init_ops()' is assigned to '.smp_init' in the machine description structure. From there 'vexpress_smp_init_ops' simply feeds 'mcpm_smp_ops' to 'smp_set_ops()' if the MCPI kernel configuration flag has been enabled:

void __init vexpress_smp_init_ops(void)
{
        struct smp_operations *ops = &vexpress_smp_ops;
#ifdef CONFIG_MCPM
        extern struct smp_operations mcpm_smp_ops;
        if(of_find_compatible_node(NULL, NULL, "arm,cci"))
                ops = &mcpm_smp_ops;
#endif
        smp_set_ops(ops);
}

The Clock Framework

The big.LITTLE solutions developed by Linaro are using a generic cpufreq driver that provides a very good foundation for device frequency scaling. If the target platform is to use this driver then a clock for each cluster must registered with the common clock API. Note that the generic cpufreq driver assumes that the clock for each cluster is named 'clusterX', where 'X' is the cluster number.

As an example the TC2 platform has 2 clusters and as such, the SPC initialisation code will register two clock with the common clock framework, i.e "cluster0" and "cluster1".

It is expected that the 'clk_ops' for each registered clock has the capability to set the frequency on the cluster they represent. Clocks are registered using 'clk_register()' and 'clk_register_clkdev()', the latter being specifically important for name lookup by the cpufreq driver.

static struct clk_ops clk_spc_ops = { 
        .recalc_rate = spc_recalc_rate,
        .round_rate = spc_round_rate,
        .set_rate = spc_set_rate,
};

struct clk *vexpress_clk_register_spc(const char *name, int cluster_id)
{
        struct clk_init_data init;
        struct clk_spc *spc;
        struct clk *clk;

        if (!name) {
                pr_err("Invalid name passed");
                return ERR_PTR(-EINVAL);
        }   

        spc = kzalloc(sizeof(*spc), GFP_KERNEL);
        if (!spc) {
                pr_err("could not allocate spc clk\n");
                return ERR_PTR(-ENOMEM);
        }   

        spc->hw.init = &init;
        spc->cluster = cluster_id;

        init.name = name;
        init.ops = &clk_spc_ops;
        init.flags = CLK_IS_ROOT | CLK_GET_RATE_NOCACHE;
        init.num_parents = 0;

        clk = clk_register(NULL, &spc->hw);
        if (!IS_ERR_OR_NULL(clk))
                return clk;

        pr_err("clk register failed\n");
        kfree(spc);

        return NULL;
}

void __init vexpress_clk_of_register_spc(void)
{
...
...
...
                clk = vexpress_clk_register_spc(name, cluster_id);
                if (IS_ERR(clk))
                        return;

                pr_debug("Registered clock '%s'\n", name);
                clk_register_clkdev(clk, name, NULL);
        }
}

The Generic Cpufreq driver

As mentioned above the IKS and MP projects have produced a generic cpufreq driver that provides a very good starting point for managing DVFS on a big.LITTLE implementation. The driver has a generic and a platform specific portion, with the generic part being enabled by switching on the ARM_BIG_LITTLE_CPUFREQ flag in the kernel config.

By itself the generic portion of the driver will not be registered with the cpufreq core. For that to happen a platform specific shin needs to be provided. The said shin should provide a platform specific instantiation of a 'struct cpufreq_arm_bL_ops' that will convey methods to manage the frequency table for each discovered cluster.

static struct cpufreq_arm_bL_ops vexpress_bL_ops = {
        .name   = "vexpress-bL",
        .get_freq_tbl = vexpress_get_freq_tbl,
        .put_freq_tbl = vexpress_put_freq_tbl,
};

From there the initialisation function for the platform specific cpufreq driver should simply call 'bL_cpufreq_register()' with the aforementioned 'struct cpufreq_arm_bL_ops':

static int vexpress_bL_init(void)
{
        if (!vexpress_spc_check_loaded()) {
                pr_info("%s: No SPC found\n", __func__);
                return -ENOENT;
        }

        return bL_cpufreq_register(&vexpress_bL_ops);
}
module_init(vexpress_bL_init);

static void vexpress_bL_exit(void)
{
        return bL_cpufreq_unregister(&vexpress_bL_ops);
}
module_exit(vexpress_bL_exit);

It is very important to keep in mind that a clock for each cluster needs to be registered with the common clock API for the generic big.LITTLE cpufreq driver to work properly.

Interactive Governor Tuneables

The IKS solutions can work with any governor and isn't specifically tied to the interactive governor. We decided to concentrate on the latter because of the targeted audience and the nature of the workbench used in the performance assessments. Note that tuneables for MP are not yet available due to ongoing development.

When tuning the IKS solution on a given platform it is important to get a reference point, a yard stick that gives an idea of how the system performs. Since there is two A15 processors on the TC2 we decided to that our reference system would run with only two A15 processors - with optimal settings it is clear that the system can _not_ yield better performance than with this configuration.

Background

Before hoping to get good performance out of IKS a system _must_ be proven to perform optimally when using only the A15 cores.

Most of the software performance gained from a big.LITTLE architecture comes from how the cpufreq governors are tuned. In our dealings with the interactive governor we concentrated on the following 6 tuneables:

  • go_hispeed_load: CPU load at which the frequency should jump to 'hispeed_freq'.
  • above_hispeed_delay: when running above 'hispeed_freq', amount of time to spend at one frequency before moving up to the next one (in usec).
  • timer_rate: frequency at which the cpu speed is verified to see if an adjustment is needed (in usec).
  • min_sample_time: amount of time that must be spent at a given frequency before moving down to the next one (in usec).
  • target_loads: An example is probably best for this one:
    • 85 600000:90 1000000:97

The above indicate that for all frequencies below 600MHZ the target load should be 85%, between 600MHz and _below_ 1GHz the target load should be at 90% and for 1GHz and above, 97%.

TC2 settings

  • target_loads: "85 1000000:97": Targeting a CPU load average of 85% up until 1GHz where the load average needs to be at 97%.
  • go_hispeed_load: 90%
  • hispeed_freq: 500MHz
  • above_hispeed_delay: 5000 usec
  • timer_rate: 10000 usec

The above settings allowed IKS on the TC2 implementation to reach a 60/90 ratio, that is 90% of the performance yielded by an A15 solution, using 60% of the power. That performance ratio was obtained when running the bbench application with audio playing in the background - it is expected that other scenarios will yield different performance metrics. It cannot be stressed enough that the same settings will very likely produce varying results on a different platform.

projects/big.LITTLE.MP/Big.Little.Switcher/Docs/porting-guide (last modified 2013-06-27 23:11:42)