Work In Progress at time of leaving Linaro
GCC 4.7 and Linaro GCC 4.6 have a new command-line option "-mtune=generic-armv7a". This was originally committed as just a framework, and the tuning parameters were to be tweaked later, but the tuning never happened.
The problems were:
- The A8 and A9 tuning options are not terribly good, with A8 tuning sometimes performing better on an A9 than the specific A9 tuning, and vice-versa.
- The benchmark results for enabling and disabling various features were inconclusive.
As it happens, the current behaviour of the generic tuning option is actually quite reasonable: on many benchmarks it performs more or less in the range you'd expect from a non-specific tuning. In some cases it has actually out-performed the A8/A9 specific tuning. Part of the reason for this is that it uses the fall-back single-issue pipeline definition, and it has been suggested that this can have fewer register allocation issues (and such) than the dual-issue models used for A8 and A9 (this assertion has not really been tested).
The generic tuning work has therefore been on hold while the the A8 and A9 performance issues are straightened out. The improvements for 64-bit integers in NEON are part of this effort.
NEON and 64-bit integers
The ARM NEON unit has some support for 64-bit integer arithmetic, but the compiler was not taking advantage of all of it. Additionally, some of the code generated for 64-bit arithmetic in core registers (with and without NEON available) was/is suboptimal, so that has been folded into this project.
One's Complement (NOT)
One's complement is available in NEON, but was not supported in GCC, so all "not" operations had to occur in core-registers.
This patch was completed and committed upstream in April 2012.
NEON 64-bit Immediate Constants
NEON supports constants in the form of replicated vectors, including a 64-bit singleton vector that can be used for 64-bit integers.
GCC did not support these for two reasons:
It expected that all NEON constants would be of type const_vector and regular integer arithmetic code is generated with const_int.
gen_rtx_CONST_VECTOR automatically converts vectors with only one element to const_int.
A patch for this was completed and committed upstream in April 2012, however there may be some missed cases of const_vector used without allowing const_int.
Two's Complement (NEG)
NEON does not support 64-bit integer negation directly, but it can be implemented by subtracting the value from zero. This ought to be at least as efficient as doing it in core-registers, and more efficient in the case that the variable is already present in NEON registers or can be loaded there directly.
This patch depended on the NEON immediate constants patch (to load the zero efficiently) and was completed and committed upstream in April 2012.
Neon supports 64-bit shifts, but GCC did not support them at all. In order to implement any operation in NEON the machine description must also provide a fall-back mode for doing the operation in core-registers, if the register allocator determines that to be a better option. For this reason there are two parts to this patch:
Shifts in core registers
GCC already has generic, target independent support for generating 64-bit shifts in 32-bit registers in the expand pass, but it does not provide any means for doing this post-reload. This means splitters need to be provided that work in both ARM and Thumb modes are produce pre-optimized instruction sequences (since this happens way too late for the combine pass to do the job).
Additionally, we can use this opportunity to take advantage of ARM's implementation-defined behaviour of out-of-range shift-amounts in a way that the generic algorithm cannot or does not. Therefore the patch must be implemented for use by both the expand pass and split2 pass.
The patch was submitted upstream here, has been committed both upstream and into Linaro GCC 4.7.
Shifts in NEON registers
Adding the NEON shift patterns is not totally straight-forward for a few reasons:
- NEON implements right-shifts as a negative left-shift.
If the register allocator decides to do the shift in core-registers then the amount must not be negated.
Technically, the vshl and vshr instructions take a DImode register as the shift amount, but only pay attention to the bottom 8 bits, so choosing what mode to use to avoid needless extends and truncates is not easy.
To solve these problems I have chosen to introduce the negation only after register allocation, and use SImode values.
Delaying the negation means that we can't optimize the case where the code says "a >> -n", but the alternative might mean double negation of a value, in the core-regs case.
There's also the question of how to write this in RTL so that the compiler does not try to generate shifts without the proper negated value, or else emit warnings about negative shift amounts. This probably isn't a real problem at present because the patterns will not exist until late in the compilation process, but future proofing isn't a bad plan. I have mulled over the possibility of implementing the right-shifts like this:
(set (reg:SI Z) (neg:SI (reg:SI Z)) (set (reg:DI X) (lshiftrt:DI (reg:DI Y) (neg:SI (reg:SI Z))))
.. the idea being that the compiler will, in theory, be able to see what's really going on and, one day, may be able to reason about it usefully. However, there's not really any opportunity to do so as this pattern wouldn't occur until split2.
Instead, I have chosen to reuse the existing unspec to hide the non-standard shift characteristics, so my patch has it emit code like this:
(set (reg:SI Z) (neg:SI (reg:SI Z)) (set (reg:DI X) (unspec:DI [(reg:DI Y) (reg:SI Z)] UNSPEC_ASHIFT_UNSIGNED))
(Note I've simplified that because the neg must happen in core-registers and then be copied to NEON.)
The choice to use an SImode shift amount means that the value must typically be moved from core-registers for every shift. However, this is probably the most likely scenario in any case, and it avoids the useless extend. This value is unlikely to ever need moving back to core-registers, and that would be the expensive case. The main disadvantage is that the available 32-bit move instructions limit us to using an even-numbered register in the lower half of the NEON register set (even-numbered because the shift expects a 64-bit register, so we can only use 32-bit registers that overlap the low-part of a 64-bit register).
Core -> NEON Extends
When GCC copies a 32-bit value from core-registers to NEON and extends it to 64-bit, it emits code like this:
;; value in r0 mov r1, #0 fmdrr d16, r0, r1
;; value in r0 asrs r1, r0, #31 fmdrr d16, r0, r1
This is fine, but it consumes a whole core-register (a limited resource!) just to fill it with ones or zeros. Additionally, although vmov/fmdrr will take any arbitrary two registers, I don't think GCC will allocate them that freely, in most cases.
I've not yet benchmarked it, but it seems clear to me that moving it first, and extending it second would be superior. My patch emits code like this:
vdup.32 d16, r0 vshr.u64 d16, d16, #32
The lower subreg pass has caused us some problems both with and without NEON enabled, but primarily when NEON is enabled. The problem is that, when NEON is enabled, we do not want DImode registers to be decomposed into two SImode registers, and mostly lower-subreg recognises this and does not do so, but if the code contains a pseudo-to-pseudo copy then both the input and output registers are decomposed, unless there's some other operation that prevents it.
I do not understand the motivation for doing this. The normal rule is that it only decomposes DImode registers that are accessed via SUBREG, somewhere, and not accessed directly by any non-move insn (other than shifts or zero-extends that have special handling). So, if a pseudo-register is loaded from memory (a move) and then used in a DImode add, for instance, then it will not be decomposed, but, if the pseudo-register is loaded, copied to another pseudo, and the new pseudo is used in the add, then the first pseudo will be decomposed, and the second not, with the result that register allocation gets confused, and poor code generation follows.
There's an example in GCC pr43137. The bug here has been fixed in mainline, but returns when the neon-shift/neon-extend patches are applied because they effectively reverse the fix (deliberately).
It's like the pseudo-to-pseudo copy counts as a pair of SUBREG copies, and therefore triggers decomposition. Now, I can see why a copy would not prevent decomposition, but why should it cause decomposition any more than a load or anything else? The lower-subreg code has support for forward-propagating decomposition across pseudo-copies, so it isn't for that purpose; in fact, it seems to make the propagation redundant. There doesn't appear to be any attempt at backward propagation, so maybe it was an attempt at doing that?
I discussed this upstream with Richard Sandiford, and we agreed to try disabling the pseudo-to-pseudo copy feature in the subreg1 pass.
I have tried this, and it appears to solve the problem, but I'm currently waiting for benchmark results to show that it does not cause any regressions. I have submitted a toolchain to Michaels build system named gcc-4.8~svn187203-ams-lower-subreg-test3. If the regression test results are ok, and the benchmark results (with, but particularly without NEON) are satisfactory, then then patch should be submitted upstream. Since NEON is not important here, it is probably best to benchmark this on A8 to side-step the A9 benchmarking problems.
The patch is not posted upstream yet.
DImode operators and Lower Subreg
In the course of investigation Lower Subreg it because increasingly clear to me that the machine description should only use DImode registers directly, as opposed to within a subreg wrapper, if there is a real hardware instruction to implement it. Even then, the instruction would have to give a sufficient performance boost to justify the added register allocation pain. On ARM (without NEON), about the only true DImode instructions are the loads, stores, and widening multiplies.
Therefore, when NEON is not available the machine description should expand DImode operations using subreg and the multiple steps to do the job in 32-bit registers, and when NEON is available it should expand to give the proper DImode operation, and then split them later, if necessary.
Or should it? I've made quite a confident sounding statement here, but there are reasons to do it the opposite way: operators such as widening-multiply have DImode extends in the definition, and if we eliminate DImode extend patterns at expand time then it becomes harder (impossible?) for combine to automatically create them from separate extend and multiply instructions. Similarly, if all DImode registers are decomposed then we will end up never using the 64-bit loads and stores (unless a peephole can recombine them later). This issue probably needs more investigation and benchmarking, but I would suggest that widening multiplies are not a problem because they are combined by a tree pass.
Unfortunately, there are many, many insns in the ARM machine description that use DImode as if the operation really existed; even worse, several of them emit a hard-coded pair of instructions! Perhaps they pre-date the lower-subreg pass, and therefore once made no difference? I'm not convinced this was ever true because they would always have impeded the ability of the other passes to combine the constituent 32-bit operations with other instructions until the first split pass, which is too late. (For example, the zero-extend introduces a constant zero that can be used to optimize many following operations, not just those that currently have special patterns to recognise that case.)
It might be nice if we could expand to use DImode for all DImode operations, go all the way through to first split pass, but stop just prior to register allocation, then try to judge whether the operation will be implemented in true DImode or decomposed into an SImode sequence and (if necessary) repeat all the RTL optimization passes with the new state, before continuing with register allocation and onwards as usual. This would be a somewhat expensive optimization, although it could be short-cut for functions that do not use 64-bit (the pass-manager would need a small patch to achieve all this, of course). Anyway, it's a thought.
So far I've begun "fixing" the logical operators: and or xor not. This is by no means all of them (search arm.md for "di2" and "di3" and inspect the RTL for "reg:DI" to see how many).
There are two steps to fix each pattern: first fix the RTL when NEON is not available, and second to give as good RTL as possible when NEON is not used, even though it is available (i.e. the register allocator chooses to assign the register to the core). Obviously, none of this must do any harm to those instructions that do get assigned to NEON registers.
Logical operators when NEON is disabled
This has been reported upstream as GCC pr53189.
The main purpose of this patch is to split and or xor not into their constituent parts right from the expand pass. It also requires a few adjustments to not break NEON, of course, although I've not tried to improve NEON output here.
The patch is HERE; it is not posted upstream yet. It depends on the neon-extends patch (textually, if not logically).
And, HERE is another version of the patch; this one does the same but makes anddi3/iordi3/xordi3/one_cmpldi2 conditional on NEON, thus requiring expand to DTRT when NEON is not available. This might have some unexpected consequences (when there's no NEON) if an optimizer tests whether they exist, or not. That said, TARGET_HAVE_anddi3 is still defined, and the optab with still contain the operator, so it only becomes a problem if something tests the condition.
Neither patch is completely polished. In particular, the one's complement patterns should probably be moved to neon.md (and maybe the di3 expanders are unnecessary in arm.md). Also, the IWMMXT stuff needs to be checked.
Logical operators when NEON is enabled
The point of this patch is not to make operations in NEON work better, but to optimize the case where the register allocator chooses not to do the operation in NEON: that is, to do it in core registers. The aim is produce code that as good as that produced when NEON is disabled.
This is easier said than done because the split does not happen until post-reload, and that is way too late for most optimizations. This means that the splitters must be extra smart to make up for it.
This patch will work much better when combined with the lower-subreg-disable-decomposition-of-pseudo-copies patch.
The patch is HERE; it is not posted upstream yet. It depends on the core-and64 patch.
Multiplication in NEON
NEON has no direct support for 64-bit multiplication, but it can be done relatively easily in a few instructions so it is probably worth having.
The patch is here.
This has not been posted upstream yet, but is probably just about ready to go, although the patch depends on the neon-shifts patch.
Richard Earnshaw suggested merging all the target32 movdi variants.
Register class pre-allocation. Meeting Minutes
- The purpose for this is primarily to side-step the late split problem for DImode operations whose registers get allocated in core registers. Deciding this in advance, and splitting those operations early, would help with code generation, but it wouldn't help simplify machine descriptions much because they would still have to handle the case where the register allocator is forced to make a different decision.
- The implementation might be an earlyish RTL pass, or some kind of analyser tool (similar to DF), or might even be a late tree-ssa pass. Any RTL solution would need an additional split mechanism of some kind to decompose insns pre-allocated to core-registers. A tree-pass could split the operations in gimple, or else annotate the variables as hints for expand and/or the machine description.
- Force operations into NEON.
- Currently, no operation that begins or finishes in a hard register (such as function parameters or return values) can ever happen in NEON because the register allocator refuses to "reload" them unless it absolutely has to (say, if the just isn't an alternative for core-registers), no matter what the relative costs are (as far as I can make out). This might be part of the pre-allocator idea.
- Fix remaining fake DImode operators (see "DImode operators and Lower Subreg" above).
Do 64-bit only in NEON (perhaps).
- This would very much simplify the code by eliminating the core-reg fall-back code, resolve task 3 above ("Force operations into NEON"), and kill the notion of "pre-allocation". Of course, the easy option is not always the best one.
- A8 would probably be tuned to use only core 64-bit.
- Small test cases would probably be less efficient due to the extra register moves, but "real" code may not have this problem.
- With shifts and muldi3 added, the only (?) missing operation is divide, and that would be a library call in any case.
AndrewStubbs/Sandbox (last modified 2012-05-31 08:05:09)