Scheduler mini-summit

(aka 'Scheduling to save power')

Details

Attendees

  • Paul McKenney (IBM/Linaro)

  • Peter Zijlstra (Redhat)
  • Paul Turner (Google)
  • Vincent Guittot (ST-E/Linaro)
  • Suresh Siddha (Intel)
  • Venkatesh Pallapadi (Google)
  • Kevin Hilman (TI)
  • ?? (ARM)
  • Amit Kucheria (Canonical/Linaro)
  • -Ingo Molnar (Redhat)-

  • -Mike Galbraith (Suse)-

Proposed topics for the agenda

Topology and ARM architecture

ARM's big.LITTLE architecture

  • Heterogeneous multi-processing (HMP), with considerable variation permitted:
    1. The number of big and LITTLE CPUs need not be equal. For example, Cortex-A15 (big) and Cortex-A7 (LITTLE) each permit at most four each of big and LITTLE CPUs, but within that constraint could have any number of each.
    2. OMAP's upcoming implementation cannot power down CPU 0 for trustzone reasons
    3. Nvidia's upcoming implementation cannot have both big and LITTLE CPUs active at the same time.
    4. UX500 cannot power down CPU 1 alone, must power down entire cluster containing CPU 1.
    5. In general, ARM manufacturers have great freedom in defining power domains.
    6. Data caches are PIPT, instructions are PIPT on Cortex-A15, but VIPT aliasing on Cortex-A7.
  • How to teach Linux to use it efficiently i.e. how to make the scheduler aware of the 'power cost' of scheduling on a particular core
    1. Power consumption of CPU when running (also vs. performance)
    2. Power penalty for turning CPU on/off
    3. Power consumption for rest of system (display backlight and e-ink; radios wifi and cell; storage, ...)
    4. Application knowledge (if any) of duration of CPU-bound period
    5. Different ARM end-user devices will have very different power-comsumption characteristics, consider e-ink Amazon Kindle compared to backlit Samsung Galaxy Tab
    6. Even within a given end-user device class, different ARM SoCs have different power-consumption characteristics, for example Samsung Galaxy phone uses either TI OMAP4 or their own Exynos SoC

    7. Many SoCs have thermal limitations, and running all big CPUs at full frequency for extended periods of time will usually exceed the thermal envelope.

    8. Scheduler interaction may be required for profiling (e.g., perf) due to differences between PMUs between big and LITTLE CPUs.
    9. Any way to estimate cache footprint in order to estimate migration penalties?
  • Whitepaper on big.LITTLE

Generic Topology

  • Is there a generic way to use topology information in the scheduler?
    1. Clock trees
    2. Power domains
    3. Caches on/off based on which CPUs are sharing a given cache
    4. Caches on/off based on cost of flushing the cache vs. the power it consumes if left on and the expected idle time
    5. Cost of waking up a given CPU depends on both the CPU type (big vs. LITTLE) and on its power state (e.g., caches already powered on or not?)
  • Can we take same decision across all architecture for a given topology description?

Some figures on multi-core ARM

Spurious wake ups on multi-core ARM SoCs

  • Hardware-accelerated media workloads with periodic short-duration CPU demands pose challenges for current Linux kernel power management.
    1. Intervals between CPU bursts are relatively long, race to idle does not provide significant gains.
    2. Increasing frequency consumes more power than would saved by racing to idle due power consumption rising as roughly the square of the frequency.
  • However, different SoCs have different frequency/power characteristics, so different SoCs will have different optimal power-management strategies.

    1. Leakage current varies as a function of voltage (and in turn frequency)
    2. Different HW accelerators have different CPU requirements
    3. Different process technologies have different power-dissipation characteristics
    4. Different decoding algorithms have different parallelization characteristics, requiring different power-management strategies
  • Need to compare different power-management strategies (for example, sched_mc vs CPU hotplug) using realistic workloads and compare the differences.
  • See some experiments for example comparisons

Hotplug

  • Device manufacturers currently make heavy use of CPU hotplug in production
  • What are the alternatives?
  • What is required to make the alternatives show power savings equal to that of CPU hotplug?

Load balance

Getting the most benefit out of each wakeup

  • Very expensive to 'wake-up' a core for a small task, if we must wake up a CPU we need to get out money's worth.
  • Sometimes waking up a CPU is slower than letting the task wait for an already powered up CPU to become available.
  • Although many mobile workloads are extremely predictable, the greater the user interaction for a given workload, the less predictable that workload will be.
  • Thermal issues limit the amount of time that a given CPU can run above a given frequency. CPU-bound task might need to migrate among CPUs for thermal reasons.
  • Frequency changes take time to carry out. The CPU is running in the meantime at an intermediate frequency.
  • There is synchronized time across all cores for recent CPUs (Cortex A7 and Cortex A15).
  • How to decide?

Power or performance?

  • How to set the cursor between performance and power saving and what does power saving mean?
  • What should the userspace API look like? Quality-of-service specification? If so, how to we avoid consuming more power computing quality of service than we end up saving?
  • Typical expected mobile workload has background tasks that run on LITTLE CPUs, with an occasional page-rendering or media applications that need a short burst of big-CPU performance.
  • Real-time implications: Real-time processes need a given level of performance in order to operate correctly.

CPU isolation, cgroups, cpusets

  • How do these fit in? Can they achieve CPU-hotplug-like reductions in power consumption? If so, under what conditions?
  • Encouraging results on getting the core to 'shut up'
  • Some experimental results on cpusets

Tests and Benchmarks

Linsched

  • PJT: update on status?
  • Can it be used to model HMP-like systems?

Benchmarking

  • How to benchmark a load balance policy for a mobile/embedded device?
  • What is a typical mobile/embedded workload and what figures of merit should be output?
  • Do existing synthetic mobile/embedded benchmarks need modification to properly measure big.LITTLE?
  • Are there similar server or PC workloads?
  • Which server/PC workloads should be used as test loads in any case in order to avoid server-side regressions?
  • Combinatorial workloads, given that mobile devices are seeing general-purpose usage?

Discussion

WorkingGroups/PowerManagement/Archives/ConfNotes/2012-02-Connect-SFO-Scheduler-minisummit (last modified 2013-08-21 14:11:23)