New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: avoid reloads of temp register on ppc64le, others #17110
Comments
Yeah, this happens because a single AMOVD becomes two real instructions so there's not really a stage where this can be fixed in the current way of working. I guess the move needs to be decomposed earlier. I think a similar issue affects ARM64 (and probably the other fixed-width instruction architectures). |
Did this very thing on Sparc decades ago; the inner loop of fpppp (which was huge) had 42 different "base registers" which was of course far more than the number of registers. |
I'd like to push this off until 1.9. Note that this is problem occurs on other architectures, not just PPC. |
OK, so do you want the title changed? |
That would be fine. Any architecture with load/store instructions using a register + small immediate to form the address is eligible for this optimization. I suspect it might need linker or at least assembler assistance, since we'll be emitting relocations along the lines of
and we'd want to usually guess right at the rough magnitude of the difference of those two globals. |
Where this problem shows up most often is when referencing constants, which is common in many of functions in the math package. It seems that with constants, they should all be in rodata, and then the base could be the start of the rodata (named: runtime.rodata). Then the offset of the constant from the start of rodata would be known at compile time. Likewise something similar could be done for static data. |
@dr2chase I'd like to try and look at this further and try to get it working. If you have some clues or pointers that would be helpful.
|
Change https://golang.org/cl/68250 mentions this issue: |
On PPC64 (and a few other architectures), accessing global requires multiple instructions and use of temp register. The compiler emits a single MOV prog, and the assembler expands it to multiple instructions. If globals are accessed multiple times, each time it generates a reload of the temp register. As this is done by the assembler, the compiler cannot optimize it. This CL makes the compiler not fold address of global into load and store. If a global is accessed multiple times, or multiple fields of a struct are accessed, the compiler can CSE the address. Currently, this doesn't help the case where different globals are accessed, even though they may be close to each other in the address space (which we don't know at compile time). It helps a little bit in go1 benchmark: name old time/op new time/op delta BinaryTree17-2 4.84s ± 1% 4.84s ± 1% ~ (p=0.796 n=10+10) Fannkuch11-2 4.10s ± 0% 4.08s ± 0% -0.58% (p=0.000 n=9+8) FmtFprintfEmpty-2 97.9ns ± 1% 96.8ns ± 1% -1.08% (p=0.000 n=10+10) FmtFprintfString-2 147ns ± 0% 147ns ± 1% ~ (p=0.129 n=9+10) FmtFprintfInt-2 152ns ± 0% 152ns ± 0% ~ (p=0.294 n=10+8) FmtFprintfIntInt-2 218ns ± 1% 217ns ± 0% -0.64% (p=0.000 n=10+8) FmtFprintfPrefixedInt-2 263ns ± 1% 256ns ± 0% -2.77% (p=0.000 n=10+8) FmtFprintfFloat-2 375ns ± 1% 368ns ± 0% -1.95% (p=0.000 n=10+7) FmtManyArgs-2 849ns ± 0% 850ns ± 0% ~ (p=0.621 n=8+9) GobDecode-2 12.3ms ± 1% 12.2ms ± 1% -0.94% (p=0.003 n=10+10) GobEncode-2 10.3ms ± 1% 10.5ms ± 1% +2.03% (p=0.000 n=10+10) Gzip-2 414ms ± 1% 414ms ± 0% ~ (p=0.842 n=9+10) Gunzip-2 66.3ms ± 0% 66.4ms ± 0% ~ (p=0.077 n=9+9) HTTPClientServer-2 66.3µs ± 5% 66.4µs ± 1% ~ (p=0.661 n=10+9) JSONEncode-2 23.9ms ± 1% 23.9ms ± 1% ~ (p=0.905 n=10+9) JSONDecode-2 119ms ± 1% 116ms ± 0% -2.65% (p=0.000 n=10+10) Mandelbrot200-2 5.11ms ± 0% 4.92ms ± 0% -3.71% (p=0.000 n=10+10) GoParse-2 5.81ms ± 1% 5.84ms ± 1% ~ (p=0.052 n=10+10) RegexpMatchEasy0_32-2 315ns ± 0% 317ns ± 0% +0.67% (p=0.000 n=10+10) RegexpMatchEasy0_1K-2 658ns ± 0% 638ns ± 0% -3.01% (p=0.000 n=9+9) RegexpMatchEasy1_32-2 315ns ± 1% 317ns ± 0% +0.56% (p=0.000 n=9+9) RegexpMatchEasy1_1K-2 935ns ± 0% 926ns ± 0% -0.96% (p=0.000 n=9+9) RegexpMatchMedium_32-2 394ns ± 0% 396ns ± 1% +0.46% (p=0.001 n=10+10) RegexpMatchMedium_1K-2 65.1µs ± 0% 64.5µs ± 0% -0.90% (p=0.000 n=9+9) RegexpMatchHard_32-2 3.16µs ± 0% 3.17µs ± 0% +0.35% (p=0.000 n=10+9) RegexpMatchHard_1K-2 89.4µs ± 0% 89.3µs ± 0% ~ (p=0.136 n=9+9) Revcomp-2 703ms ± 2% 694ms ± 2% -1.41% (p=0.009 n=10+10) Template-2 107ms ± 1% 107ms ± 1% ~ (p=0.053 n=9+10) TimeParse-2 526ns ± 0% 524ns ± 0% -0.34% (p=0.002 n=9+9) TimeFormat-2 534ns ± 0% 504ns ± 1% -5.51% (p=0.000 n=10+10) [Geo mean] 93.8µs 93.1µs -0.70% It also helps in the case mentioned in issue #17110, main.main in package math's test. Now it generates 4 loads of R31 instead of 10, for the same piece of code. This causes a slight increase of binary size: cmd/go increases 0.66%. If this is a good idea, we should do it on other architectures where accessing global is expensive. Updates #17110. Change-Id: I2687af6eafc04f2a57c19781ec300c33567094b6 Reviewed-on: https://go-review.googlesource.com/68250 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
@cherrymui @jeremyfaller Will there be support in the new linker to fix this problem? It still exists, mostly in the math package. |
Here's an example of some code from math.sin. R31 is reloaded with the same value over and over.
|
I don't think the new linker does anything with this. This is all in the compiler. I wonder if marking global address (i.e. a MOVDaddr with the operand being SB) not rematerializeable helps. If we do that, the register allocator will try to keep the address in a register, instead of rematerialize it each time. It doesn't last across a call, though. |
There are two problems here that I am aware of. First, when loading a constant or static variable, the base address is always generated using 2 instructions in asm9.go. The relocated value for the upper part of the base address is loaded into r31 (temp register) then added into the relocated value for the lower part of the base address. Because it is done this way, the register allocator is not able to find the first value (in r31) in common, even though it frequently is reloading the same value over and over. This could be split out into 2 instructions in obj9.go (or elsewhere?) and I would be willing to experiment with this. The second problem is that these values are created as relocated values relative to the constant or static variable. The actual values aren't known until link time and I don't think they really look like the same value to the register allocator so it wouldn't know they are the same even if it had the information to try. We would need to do something like David Chase mentions in his post on Nov. 2016. Maybe I'm jumping to conclusions, but I thought maybe since it looks like a TOC is now always being generated that would help this problem. I thought all addresses of this type would be in the TOC and the offset within the TOC would be known so that the same base could be used when referencing them. Or maybe I'm making some incorrect assumptions on how this works. |
I agree - we'll need to put all the constants together in a group (or groups, if there are too many?), add an SSA instruction to get the base of that group, and have each constant load be an offset from its group base. There's no PC-relative load in powerpc? That would make life much easier. |
This is a place where a globally reserved SB register can be useful. Maybe we could do that now with the internal ABI. |
There is not, but I'm not actually sure it would help. cmd/compile has code to maintain R2 as a TOC pointer (it's needed for shared libraries) which could be globally enabled but the load instructions only have a 16 bit displacement and I don't think you know enough at this point of codegen to know if the base value can be shared between two loads? |
Power10 has PC relative loads and that seems to improve this. This problem has been around for a while and for < Power10 there doesn't seem to be a nice solution so I think this could be closed. |
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go version devel +9fcbde7 Wed Sep 14 10:07:17 2016 -0500 linux/ppc64le
What operating system and processor architecture are you using (
go env
)?Ubuntu 16.04 ppc64le
What did you do?
Built the test from the math package, then inspected the objdump.
What did you expect to see?
Looking for opportunities for improvement.
What did you see instead?
In main.main, this code is generated:
111c8: 1f 00 e0 3f lis r31,31
111cc: 98 29 9f e8 ld r4,10648(r31)
111d0: 1f 00 e0 3f lis r31,31
111d4: 90 29 bf e8 ld r5,10640(r31)
111d8: 1f 00 e0 3f lis r31,31
111dc: e0 29 df e8 ld r6,10720(r31)
111e0: 1f 00 e0 3f lis r31,31
111e4: d8 29 ff e8 ld r7,10712(r31)
111e8: 1f 00 e0 3f lis r31,31
111ec: d0 29 1f e9 ld r8,10704(r31)
111f0: 1f 00 e0 3f lis r31,31
111f4: a0 29 3f e9 ld r9,10656(r31)
111f8: 1f 00 e0 3f lis r31,31
111fc: b0 29 5f e9 ld r10,10672(r31)
11200: 1f 00 e0 3f lis r31,31
11204: b8 29 7f e9 ld r11,10680(r31)
11208: 1f 00 e0 3f lis r31,31
1120c: c0 29 9f e9 ld r12,10688(r31)
r31 is reloaded each time with the same value. I believe this happens because this is a MOVD based on SB which implicitly generates two instructions, and uses r31 for the temp register and possibly that is not visible to the register allocator.
The text was updated successfully, but these errors were encountered: