GCC Optimisation Opportunities 2

Topics:

  • Goal - strech goal of improving performance by 5 % over six months
    • In the 4.5 series, between 4.5-2010.11 and 4.5-2011.05
    • Note that a upstream improvement may not turn into a 4.5 improvement. Will measure.
    • Should track separately individual improvement gains in upstream
    • Track upstream performance, publish
      • Michael: generate historical data
      • Michael: perhaps specific Linaro patches to show our changes
  • Performance of what?
    • Cortex-A9 with NEON only
    • Monitor Cortex-A8, 'do no harm' so can't have non-trivial regressions
    • Profiles, workloads, and benchmarks
      • Profile: hand-held devices such as phones and tablets
      • What are people using to measure our result?
        • SPEC 2k, EEMBC (many of), coremark, dhrystone (hah!)
        • SPECs: 2k vs 2k6, SPECINT vs SPECFP, base vs peak, -ffast-math vs full IEEE
        • Integer kernels more likely to occur in EEMBC, FP kernels in SPECFP
        • How can we evaluate this to see where the improvements are?
          • Ulrich: contact Mike Meissner to get a feel for it. Ckec last GCC conference proceedings
        • ACTION: decide on 2k vs 2k6. Critiera: can it run, how long does it take to complete
        • EEMBCs:
          • ARM uses Consumer (might actually be DENBench which reports as consumer v2), Networking v2, Office, Telecom
          • CSL uses Automotive, Consumer, Networking, Office, Telecom
          • +coremark
    • Given profile and workload,
      • Concentrate on Consumer, DENBench, and SPECINT 2kX
    • Also track size and speed of things with -Os
  • Approach - top down vs benchmarks up
    • Top down topics:
      • Architecture features not used
        • NEON!
        • ARMv6 SIMD, ARMv5 saturated ops
        • ALU shifts
        • Other hardware features
      • GCC features not used
        • Inlining heuristics
        • SMS
        • Hot/cold partitioning
    • Bottom up is most productive
    • Look for the biggest regression
      • Ramana: to check if we can benchmark and compare against armcc
      • Chuck in llvm
      • Need to be able to reproduce
      • Define 'the method' to allow reproducing
        • How to build
        • Environment to build in
    • What tools do we need?
      • Have perf
      • Do we have smarter simulation models or tools?
      • May need access to hardware folks as some of the needed data is not documented or published
      • Save off final binaries to allow reproducing
    • What level should we try at? -O2? -O3? -Ofast? -flto? PGO?
      • GCC approach:
        • -O2 *always* makes things better (worse is a bug)
        • -O3 *normally* makes things better. May be the same but significantly larger
        • -O3 is heuristic based
      • What consumers use - they should use -O3!
      • Not what they benchmark at?
      • Use -O3. Monitor -O2
      • Use -ffast-math as all benchmarks should use floats as 'approximate real numbers' as opposed to floating point numbers
      • These together are -Ofast
    • Existing information:
      • Ramana: check any 'full fury' or similar efforts going on inside ARM. Gives upper bound on what we can achieve
      • Is Daves string routine optimisations interesting? No, as too much based on the memory system
      • Dave: attack the top two-ish routines in a few of the benchmarks to set bounds
      • Check existing hand-written kernels such as FIR or IDCT
  • Investigation
  • Creating blueprints
  • Doing it
  • Upstream? 4.5?
    • New development should only be upstream
    • Could backport improvements that are coming up in 4.6, but 4.6 is available soon-ish
    • Consumers have picked up 4.5 and will use it for some time
    • Will backport non-Linaro changes if upstream release is far away
    • Some criteria:
      • If has anything to do with heurestics, don't backport
      • Simple, in backend, say by adding new instruction, then backport
    • Decision: keep 4.5 in development in parallel with our 4.6, backport into both until 4.5 is in 'maintenance', i.e. everyone has moved on

WorkingGroups/ToolChain/GCCOptimizations-2011-01-1 (last modified 2011-01-28 22:33:21)