GCC Optimization Opportunities

Click here for a recording of the call.

  • Costs tuning [Andrew Stubbs]
    • GCC 4.6 has a new costs model, but are we making full use of it?
    • What about optimizing for size?
    • Do the optimizers take any notice?
      • We discovered recently that combine is happy to take two insns and combine them into a pattern that matches a splitter that then explodes into three insns (partly due to being no longer able to generate pseudo-registers).
    [Steven Bosscher adds ...]
    • Comparing ARM cost models and param settings to x86_64
      • Compare, for some set of functions/benchmarks, the results of estimate_num_insns, estimate_operator_cost, and estimate_move_cost, between ARM and x86_64. Rationalize or fix any significant differences. See whether heuristics based on these functions require tuning for ARM.
      • Go through params.def and see if there are further ARM tuning opportunities. There are more than 100 DEFPARAMs and many of them guide heuristics but have only been tuned on x86_64. (There is set_default_param_value, but most backends do not change the defaults.)
  • Instruction set coverage [Andrew Stubbs]
  • Constant pools [Andrew Stubbs]
    • it might be a very handy space optimization to have small functions share one constant pool, but the way the passes work one function at a time makes this hard. (625233)

  • Inlining [Mark Mitchell]
    • on it's own; and
    • in profile-directed optimization
      • Are we inlining enough in hot spots?
      • Are we inlining too much in cold spots?
        • Binary size is thought to affect system performance through cache misses and page faults
  • Size optimization [Mark Mitchell]
    • Set -Os automatically in cold code.
    • Are we inlining all cases where doing so would reduce code size?
      • Speculative inlining (try it, and back out if necessary), perhaps?
  • ARMv5 saturated instructions. [Michael Hope]
  • ARMv6 SIMD instructions. [Michael Hope]
  • ARMv7 unaligned accesses [Michael Hope]
  • Unaligned built-in memcpy. [Michael Hope]
  • Register allocation. [Michael Hope]
    • Is the RA too Intel-centric, assuming a small number of special purpose registers?
    • Could it be improved for the ARM architecture?
  • Conditional instructions. [Michael Hope]
    • Does the middle-end optimizations inhibit the use of conditional instructions in the back end?
    [Steven Bosscher adds ...]
    • How to model conditional execution before register allocation?
    • How exploit opportunities better in GCC (ifcvt is inadequate and too late in the pipeline)?
    • Also look at LLVM here, it appears to have a better cost model for if-conversion than GCC (taking into account a target-dependent branch misprediction penalty, for example).
  • Basic block re-ordering for speed/size. [Steven Bosscher]
    • The existing basic block reordering pass in GCC implements only a reordering strategy for speed.
    • The pass does not run at all for functions optimized for size.
  • Load/store multiple instructions. [Julian Brown]
    • GCC only generates these from load_multiple/store_multiple (from a couple of places where hard registers are already known, e.g. function prologues & epilogues)

    • All ARM chips support much more generality
      • sparse sets of registers
      • transfers of more than four registers
    • ARM backend has some support using generated MD files.
    • Maybe a pass can be added to do it better.
      • [Ramana Radhakrishnan adds ....]
      • such a pass has to run after RA because increasing register number need increasing addresses
      • or, we need to teach RA to assign registers according to ldm/stm constraints
        • it might also have some effects with generated schedules where we now might change the order in which values are produced.
        • [Peter Maydell adds ...] this might also facilitate the use of ldm/stm in asm inserts
  • Shifted operands in ALU ops [Julian Brown]
    • GCC generally does quite well at using these
    • Maybe there are places where they can be used more effectively though
    • Add "special tricks" (GCC already knows many).
  • Comparison flag setting [Julian Brown]
    • 's' instruction suffix use should be reviewed.
  • NEON [Andrew Stubbs]
    • There's already a lot of work going on here.


Meeting Minutes: 2010-12-15 9am UTC

Attendees

Costs

Richard explained the motivation behind the recent costs model infrastructure work.

  • The code was littered with code conditional on specific ARM variants. This was getting hard to maintain.
  • The new cost model abstracts away the specific cpu names, and has the code query the reasons/features behind that decision via a set of hooks.
  • Also refactored the generic code to hook into cpu specific ways.
  • Did not hookize cases where the same decision is made for all ARM cores.

and the current state ...

  • Still a work in progress
  • Most existing costs abstracted, but small number remain hard-coded. These are mainly X-scale specific.
  • The number of abstractions are intended to grow as new cores with new needs are added.
  • We may want to retrofit costs models for older cores (probably not a task interesting to Linaro).
  • May also want to add new abstractions for existing cores to improve performance.
    • E.g. Branch costs are not really exposed properly at the moment.

Areas for improvement

  • Costs of conditional returns is not modelled at all.
    • Currently just done blindly.
  • A8 has no separate costs model (A9 does).
    • The default costs model is ARM 9e.
  • A9 prefetch support.
  • Costs of *not* executing an instruction (in instructions with condition codes).
    • This needs work on if-conversion.
      • Current implementation assumes 1 cycle for each not-executed instruction.
      • A9 takes more than one cycle.
        • may even take longer to *not* execute than to execute, for instructions that write to registers.
          • mitigated when another instruction *does* write to the same register.
  • Optimizing for size
    • Does the code assume that thumb insns are 32-bit or 16-bit?
    • We don't know what size until the constraints are checked quite late in the day.
      • Thumb1 sees fewer registers, so size depends on RA.
      • Thumb1 has smaller immediates.
      • etc.

Recent comparison of size from 3 toolchains. http://lists.linaro.org/pipermail/linaro-toolchain/2010-December/000581.html

  • We're producing both slowest and smallest code.
  • GCC 4.5 inlining heuristics suspected wrong.

Inlining

Speculative inlining

  • Do the inline if the heuristics say they should be, and back it out if it turns out to be wrong.
  • Thought to be a prohibitively expensive option.

Param tuning

  • Tweak the tuning and benchmark.
  • Param best values may change from one compile release to the next.

Is the cost function broken?

  • An inaccurate cost estimate may well throw off the inlining heuristics.

Builtins

  • Assumed to have a default cost of 1 instruction.
    • Often true (?), but byteswap operation varies by core variant, for example.
  • Is it hookized?
  • We should tune the builtin costs for ARM.

Unsupported Instruction set

Unaligned access instruction support

  • Richard considers that an unaligned access is always faster than two aligned accesses followed by a merge.
    • Provided that the unaligned load isn't implemented via an OS trap.

LDRD/STRD support

  • Recent patch from Carrot Wei rejected and problems not addressed, yet.

Generic "Best meld" optimization

Richard proposed implementing a cost model for when no specific core is named.

  • Assumes v7 features, for example.
  • Provides best compromise of features.
  • Avoid very bad things on certain v7 cores.
  • There will be some things where it'll be wrong everywhere.
    • E.g. LDRD is much faster on some cores, LDM is better elsewhere (including A9).

LDM/STM

Current state

  • LDM/STM supported for up to 4 registers.
  • Introduced late, via peephole.
    • This requires consequtive insns.

Possible new implementation:

  • Invent a new constraint that requires a register of a certain class, but greater than that of another operand.
  • Implement support for this in the RA
  • Requires inserting LDM/STM insns before RA.
    • Possibly generate LDM initially in expand
      • Load elimination etc should be handled in tree optimizers, so this is valid.
  • Require code to split LDM/STM into other loads if registers cannot be found.
    • Reload might be able to do this.
  • Careful not to generate insns with so many operands they cause mass spilling.

Stack slot reassignment

  • Might help optimize multiple loads.
  • Retrospective reassignment means rewriting offsets which might break constraints on existing insns
    • Probably harder than retrospective register reassignment
  • Probably not profitable use of time.

Use of match_parallel in ldmstm.md

  • Andrew: Current implementation has ldm2, ldm3 and ldm4 patterns, but match_parallel permits variable numbers of operations.
    • Would it not be better to have one variable length pattern?
  • Richard: Yes, you can match arbitrary patterns, but you can't add constraints without the explicit RTL.
    • Only possible with hard registers.
  • Ulrich has tried this, and found that reload could not cope.
    • Eliminable registers not handled properly - a bug maybe, but not unexpected without explicit RTL.

Scheduling LDM

  • The registers are loaded in a known order.
  • Dependent instructions are executed as the values come in.
  • The scheduler should be taught to take advantage of this.

Basic Block Reordering

If-conversion

  • BB reordering can disrupt if-conversion.
    • It might move unlikely blocks out of order, thus hiding them from if-conversion.
  • BB not broken.
  • If-conversion should be made smart enough to re-inline basic blocks.

Per function optimisation

ARM does not have a per-function optimization attributes.

Ramana has tried per-function ARM/Thumb attributes

  • Unsuccessful due to broken inlining.

O2/Os attributes exist on other platforms

  • global variables for optimization settings still in force
    • mixed decisions follow.

Literal pools

It would be nice to have constant/literal pools shared between multiple small functions.

  • However, the function that generates constant pools can only see one function at a time.
  • Possible implementation:
    • Carry info across calls (deferred write).
      • Need to dodge garbage collection.
    • Use a hook to write the pool before prologue, if needed.
    • Use another hook to write left over data after the last function.
    • Need much more accurate epilogue size estimation.
    • Hot/cold code optimization for cache optimization not available right now.
      • Thought to be a big loss now.
  • New items can only be added to forward constant pools, but code can backward reference existing items already emitted.

Register Allocation

Is the RA Intel-centric?

  • Consensus: No.
  • ARM (general purpose) register set is not that large.
  • Comparable to x86_64/S390/SuperH.

Shifted ALU operands

Many instructions provide 'free' shifting, but GCC doesn't always use it.

  • Expand always generates the shift as a separate insn.
  • Combine will fix it if the shifted value is only used once, and it's in the same basic block.
  • CSE, and other optimisations, might hoist multiple use operations out of a basic block.
    • Combine doesn't even consider multiple use data-flows.
    • However, this is still a win if there's another use of the shifted value that doesn't get it free.
      • Especially if the unshifted value is now unused.
  • Possible solution:
    • Have gimple detect the case and generate a specific code for it (like multiply-and-accumulate).
    • Expand that directly to the combined insn.
    • There's precedent for this with vector ops and multiply-and-accumulate.
    • This project needs to be proposed upstream before implementation to prevent trouble later.

64-bit Integer Arithmetic

Traditionally not very good in ARM GCC.

Subtraction patterns implemented as multiple instruction sequences, rather than splitting into multiple insns.

  • Always been that way.
  • May be a missed optimization.

64-bit Logic operations split into 32-bit operations

  • (Not applicable to arithmetic operations.)
  • Can be counter-productive since combine improved.
    • The operations can no longer be recognised as single operations.
    • Makes it more difficult to do algorithmic transformations.
    • May be some testcases in bugzilla.
    • SPEC chess benchmark might show the problems.

Conditional Instructions

Conditional comparisons

  • Very useful for optimizing logic
  • In order to get the compiler to generate them branch costs must be set too high.

Costs

  • if-conversion costs are not well optimized
    • it doesn't take into account costs of not executing an instruction
    • basic heuristic is to use conversion if sum of taken/not-taken is less that 8 instructions.
      • not good if 1 true, and 7 false insns
  • branch probabilities not taken into account.
  • Original implementation for Itanium assumes VLIW, so costs are different.

If-conversion algorithm

  • Basic Block reordering moving things out-of-line might prevent if-conversion linear checks form working
    • (Need to check implementation.)
  • Best solution to make if conversion smarter
    • Undo the reordering, perhaps.
  • Anticipate branch prediction
    • Knowing how the hardware will predict the branch will inform the optimization
    • Prediction may need to be CPU core specific.
  • Profile-guided optimizations may be interesting?

NEON

Julian and Ira are not interested in having a discussion like this. They feel that they need much more information and understanding before they get to that point.

WorkingGroups/ToolChain/GCCOptimizations-2010-12 (last modified 2011-01-06 00:15:00)