cmd/compile: avoid reloads of temp register on ppc64le, others #17110

laboger · 2016-09-14T21:15:17Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (`go version`)?

go version devel +9fcbde7 Wed Sep 14 10:07:17 2016 -0500 linux/ppc64le

What operating system and processor architecture are you using (`go env`)?

Ubuntu 16.04 ppc64le

What did you do?

Built the test from the math package, then inspected the objdump.

What did you expect to see?

Looking for opportunities for improvement.

What did you see instead?

In main.main, this code is generated:
111c8: 1f 00 e0 3f lis r31,31
111cc: 98 29 9f e8 ld r4,10648(r31)
111d0: 1f 00 e0 3f lis r31,31
111d4: 90 29 bf e8 ld r5,10640(r31)
111d8: 1f 00 e0 3f lis r31,31
111dc: e0 29 df e8 ld r6,10720(r31)
111e0: 1f 00 e0 3f lis r31,31
111e4: d8 29 ff e8 ld r7,10712(r31)
111e8: 1f 00 e0 3f lis r31,31
111ec: d0 29 1f e9 ld r8,10704(r31)
111f0: 1f 00 e0 3f lis r31,31
111f4: a0 29 3f e9 ld r9,10656(r31)
111f8: 1f 00 e0 3f lis r31,31
111fc: b0 29 5f e9 ld r10,10672(r31)
11200: 1f 00 e0 3f lis r31,31
11204: b8 29 7f e9 ld r11,10680(r31)
11208: 1f 00 e0 3f lis r31,31
1120c: c0 29 9f e9 ld r12,10688(r31)

r31 is reloaded each time with the same value. I believe this happens because this is a MOVD based on SB which implicitly generates two instructions, and uses r31 for the temp register and possibly that is not visible to the register allocator.

The text was updated successfully, but these errors were encountered:

mwhudson · 2016-10-05T22:02:12Z

Yeah, this happens because a single AMOVD becomes two real instructions so there's not really a stage where this can be fixed in the current way of working. I guess the move needs to be decomposed earlier.

I think a similar issue affects ARM64 (and probably the other fixed-width instruction architectures).

dr2chase · 2016-10-06T00:19:55Z

Did this very thing on Sparc decades ago; the inner loop of fpppp (which was huge) had 42 different "base registers" which was of course far more than the number of registers.

dr2chase · 2016-11-02T19:46:26Z

I'd like to push this off until 1.9. Note that this is problem occurs on other architectures, not just PPC.

laboger · 2016-11-02T20:18:02Z

OK, so do you want the title changed?

dr2chase · 2016-11-02T20:52:57Z

That would be fine. Any architecture with load/store instructions using a register + small immediate to form the address is eligible for this optimization. I suspect it might need linker or at least assembler assistance, since we'll be emitting relocations along the lines of

  base := &global1
  x := *(base + (&global2-&global1))

and we'd want to usually guess right at the rough magnitude of the difference of those two globals.

laboger · 2017-07-07T19:18:08Z

Where this problem shows up most often is when referencing constants, which is common in many of functions in the math package. It seems that with constants, they should all be in rodata, and then the base could be the start of the rodata (named: runtime.rodata). Then the offset of the constant from the start of rodata would be known at compile time.

Likewise something similar could be done for static data.

laboger · 2017-09-19T13:40:08Z

@dr2chase I'd like to try and look at this further and try to get it working. If you have some clues or pointers that would be helpful.

I can see that linker assistance is needed but not sure how to do that.
I see that the base register needs to be split from the load of the constant because now they are always done together, but where does that need to be done so that common base registers can be recognized.
Any help appreciated...

gopherbot · 2017-10-04T19:36:23Z

Change https://golang.org/cl/68250 mentions this issue: cmd/compile: don't fold address of global into load/store on PPC64

On PPC64 (and a few other architectures), accessing global requires multiple instructions and use of temp register. The compiler emits a single MOV prog, and the assembler expands it to multiple instructions. If globals are accessed multiple times, each time it generates a reload of the temp register. As this is done by the assembler, the compiler cannot optimize it. This CL makes the compiler not fold address of global into load and store. If a global is accessed multiple times, or multiple fields of a struct are accessed, the compiler can CSE the address. Currently, this doesn't help the case where different globals are accessed, even though they may be close to each other in the address space (which we don't know at compile time). It helps a little bit in go1 benchmark: name old time/op new time/op delta BinaryTree17-2 4.84s ± 1% 4.84s ± 1% ~ (p=0.796 n=10+10) Fannkuch11-2 4.10s ± 0% 4.08s ± 0% -0.58% (p=0.000 n=9+8) FmtFprintfEmpty-2 97.9ns ± 1% 96.8ns ± 1% -1.08% (p=0.000 n=10+10) FmtFprintfString-2 147ns ± 0% 147ns ± 1% ~ (p=0.129 n=9+10) FmtFprintfInt-2 152ns ± 0% 152ns ± 0% ~ (p=0.294 n=10+8) FmtFprintfIntInt-2 218ns ± 1% 217ns ± 0% -0.64% (p=0.000 n=10+8) FmtFprintfPrefixedInt-2 263ns ± 1% 256ns ± 0% -2.77% (p=0.000 n=10+8) FmtFprintfFloat-2 375ns ± 1% 368ns ± 0% -1.95% (p=0.000 n=10+7) FmtManyArgs-2 849ns ± 0% 850ns ± 0% ~ (p=0.621 n=8+9) GobDecode-2 12.3ms ± 1% 12.2ms ± 1% -0.94% (p=0.003 n=10+10) GobEncode-2 10.3ms ± 1% 10.5ms ± 1% +2.03% (p=0.000 n=10+10) Gzip-2 414ms ± 1% 414ms ± 0% ~ (p=0.842 n=9+10) Gunzip-2 66.3ms ± 0% 66.4ms ± 0% ~ (p=0.077 n=9+9) HTTPClientServer-2 66.3µs ± 5% 66.4µs ± 1% ~ (p=0.661 n=10+9) JSONEncode-2 23.9ms ± 1% 23.9ms ± 1% ~ (p=0.905 n=10+9) JSONDecode-2 119ms ± 1% 116ms ± 0% -2.65% (p=0.000 n=10+10) Mandelbrot200-2 5.11ms ± 0% 4.92ms ± 0% -3.71% (p=0.000 n=10+10) GoParse-2 5.81ms ± 1% 5.84ms ± 1% ~ (p=0.052 n=10+10) RegexpMatchEasy0_32-2 315ns ± 0% 317ns ± 0% +0.67% (p=0.000 n=10+10) RegexpMatchEasy0_1K-2 658ns ± 0% 638ns ± 0% -3.01% (p=0.000 n=9+9) RegexpMatchEasy1_32-2 315ns ± 1% 317ns ± 0% +0.56% (p=0.000 n=9+9) RegexpMatchEasy1_1K-2 935ns ± 0% 926ns ± 0% -0.96% (p=0.000 n=9+9) RegexpMatchMedium_32-2 394ns ± 0% 396ns ± 1% +0.46% (p=0.001 n=10+10) RegexpMatchMedium_1K-2 65.1µs ± 0% 64.5µs ± 0% -0.90% (p=0.000 n=9+9) RegexpMatchHard_32-2 3.16µs ± 0% 3.17µs ± 0% +0.35% (p=0.000 n=10+9) RegexpMatchHard_1K-2 89.4µs ± 0% 89.3µs ± 0% ~ (p=0.136 n=9+9) Revcomp-2 703ms ± 2% 694ms ± 2% -1.41% (p=0.009 n=10+10) Template-2 107ms ± 1% 107ms ± 1% ~ (p=0.053 n=9+10) TimeParse-2 526ns ± 0% 524ns ± 0% -0.34% (p=0.002 n=9+9) TimeFormat-2 534ns ± 0% 504ns ± 1% -5.51% (p=0.000 n=10+10) [Geo mean] 93.8µs 93.1µs -0.70% It also helps in the case mentioned in issue #17110, main.main in package math's test. Now it generates 4 loads of R31 instead of 10, for the same piece of code. This causes a slight increase of binary size: cmd/go increases 0.66%. If this is a good idea, we should do it on other architectures where accessing global is expensive. Updates #17110. Change-Id: I2687af6eafc04f2a57c19781ec300c33567094b6 Reviewed-on: https://go-review.googlesource.com/68250 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>

laboger · 2020-04-09T19:49:39Z

@cherrymui @jeremyfaller Will there be support in the new linker to fix this problem? It still exists, mostly in the math package.

laboger · 2020-04-09T20:20:47Z

Here's an example of some code from math.sin. R31 is reloaded with the same value over and over.

                x = Sin(Pi * (x - 2))
  0x94cb4               3fe0001b                ADDIS $0,$27,R31                     // lis r31,27              
  0x94cb8               c85fba90                LFD -17776(R31),F2                   // lfd f2,-17776(r31)      
  0x94cbc               fc211028                FSUB F1,F2,F1                        // fsub f1,f1,f2           
  0x94cc0               3fe0001b                ADDIS $0,$27,R31                     // lis r31,27              
  0x94cc4               c85fbaf8                LFD -17672(R31),F2                   // lfd f2,-17672(r31)      
  0x94cc8               fc2100b2                FMUL F1,F2,F1                        // fmul f1,f1,f2           
  0x94ccc               d8210020                STFD 32(R1),F1                       // stfd f1,32(r1)          
  0x94cd0               480027f1                CALL math.Sin(SB)                    // bl 0x974c0              
  0x94cd4               c8210028                LFD 40(R1),F1                        // lfd f1,40(r1)           
        return -x
  0x94cd8               fc200850                FNEG F1,F1                           // fneg f1,f1      
  0x94cdc               d8210060                STFD 96(R1),F1                       // stfd f1,96(r1)  
  0x94ce0               ebe10000                MOVD 0(R1),R31                       // ld r31,0(r1)    
  0x94ce4               7fe803a6                MOVD R31,LR                          // mtlr r31        
  0x94ce8               38210038                ADD R1,$56,R1                        // addi r1,r1,56   
  0x94cec               4e800020                RET                                  // bclr 20,lt,0    
                x = Cos(Pi * (0.5 - x))
  0x94cf0               3fe0001b                ADDIS $0,$27,R31                     // lis r31,27              
  0x94cf4               c85fb8e0                LFD -18208(R31),F2                   // lfd f2,-18208(r31)      
  0x94cf8               fc220828                FSUB F2,F1,F1                        // fsub f1,f2,f1           
  0x94cfc               3fe0001b                ADDIS $0,$27,R31                     // lis r31,27              
  0x94d00               c85fbaf8                LFD -17672(R31),F2                   // lfd f2,-17672(r31)      
  0x94d04               fc2100b2                FMUL F1,F2,F1                        // fmul f1,f1,f2           
  0x94d08               d8210020                STFD 32(R1),F1                       // stfd f1,32(r1)          
  0x94d0c               480027d5                CALL math.Cos(SB)                    // bl 0x974e0              
  0x94d10               c8210028                LFD 40(R1),F1                        // lfd f1,40(r1)           
  0x94d14               4bffffc4                BR 0x94cd8                           // b 0x94cd8               
                x = Sin(Pi * x)
  0x94d18               3fe0001b                ADDIS $0,$27,R31                     // lis r31,27              
  0x94d1c               c85fbaf8                LFD -17672(R31),F2                   // lfd f2,-17672(r31)      
  0x94d20               fc2100b2                FMUL F1,F2,F1                        // fmul f1,f1,f2           
  0x94d24               d8210020                STFD 32(R1),F1                       // stfd f1,32(r1)          
  0x94d28               48002799                CALL math.Sin(SB)                    // bl 0x974c0              
  0x94d2c               c8210028                LFD 40(R1),F1                        // lfd f1,40(r1)           
  0x94d30               4bffffa8                BR 0x94cd8                           // b 0x94cd8

cherrymui · 2020-04-10T00:13:23Z

I don't think the new linker does anything with this. This is all in the compiler.

I wonder if marking global address (i.e. a MOVDaddr with the operand being SB) not rematerializeable helps. If we do that, the register allocator will try to keep the address in a register, instead of rematerialize it each time. It doesn't last across a call, though.

laboger · 2020-04-10T14:31:46Z

There are two problems here that I am aware of.

First, when loading a constant or static variable, the base address is always generated using 2 instructions in asm9.go. The relocated value for the upper part of the base address is loaded into r31 (temp register) then added into the relocated value for the lower part of the base address. Because it is done this way, the register allocator is not able to find the first value (in r31) in common, even though it frequently is reloading the same value over and over. This could be split out into 2 instructions in obj9.go (or elsewhere?) and I would be willing to experiment with this.

The second problem is that these values are created as relocated values relative to the constant or static variable. The actual values aren't known until link time and I don't think they really look like the same value to the register allocator so it wouldn't know they are the same even if it had the information to try.

We would need to do something like David Chase mentions in his post on Nov. 2016. Maybe I'm jumping to conclusions, but I thought maybe since it looks like a TOC is now always being generated that would help this problem. I thought all addresses of this type would be in the TOC and the offset within the TOC would be known so that the same base could be used when referencing them. Or maybe I'm making some incorrect assumptions on how this works.

randall77 · 2020-04-10T16:34:57Z

I agree - we'll need to put all the constants together in a group (or groups, if there are too many?), add an SSA instruction to get the base of that group, and have each constant load be an offset from its group base.

There's no PC-relative load in powerpc? That would make life much easier.

cherrymui · 2020-04-11T16:52:46Z

This is a place where a globally reserved SB register can be useful. Maybe we could do that now with the internal ABI.

mwhudson · 2020-04-11T21:37:40Z

There's no PC-relative load in powerpc? That would make life much easier.

There is not, but I'm not actually sure it would help. cmd/compile has code to maintain R2 as a TOC pointer (it's needed for shared libraries) which could be globally enabled but the load instructions only have a 16 bit displacement and I don't think you know enough at this point of codegen to know if the base value can be shared between two loads?

laboger · 2023-08-09T18:37:50Z

Power10 has PC relative loads and that seems to improve this. This problem has been around for a while and for < Power10 there doesn't seem to be a nice solution so I think this could be closed.

quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 3, 2016

quentinmit added this to the Go1.8 milestone Oct 3, 2016

laboger mentioned this issue Oct 5, 2016

cmd/compile: Possible to synthesize base registers for PPC64 addressing? #17347

Closed

rsc modified the milestones: Go1.9, Go1.8 Nov 2, 2016

rsc changed the title ~~cmd/compile: avoid reloads of temp register on ppc64le~~ cmd/compile: avoid reloads of temp register on ppc64le, others Nov 2, 2016

randall77 modified the milestones: Go1.10, Go1.9 May 31, 2017

randall77 added the Performance label May 31, 2017

laboger mentioned this issue Oct 27, 2017

cmd/internal/obj/ppc64: assumes addresses of static data fit in 32 bits #22459

Closed

bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017

bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018

laboger mentioned this issue Jun 27, 2018

cmd/compile: loads/constants not lifted out of loop #15808

Open

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: avoid reloads of temp register on ppc64le, others #17110

cmd/compile: avoid reloads of temp register on ppc64le, others #17110

laboger commented Sep 14, 2016

mwhudson commented Oct 5, 2016

dr2chase commented Oct 6, 2016

dr2chase commented Nov 2, 2016

laboger commented Nov 2, 2016

dr2chase commented Nov 2, 2016

laboger commented Jul 7, 2017

laboger commented Sep 19, 2017

gopherbot commented Oct 4, 2017

laboger commented Apr 9, 2020

laboger commented Apr 9, 2020

cherrymui commented Apr 10, 2020

laboger commented Apr 10, 2020

randall77 commented Apr 10, 2020

cherrymui commented Apr 11, 2020

mwhudson commented Apr 11, 2020

laboger commented Aug 9, 2023

Navigation Menu

cmd/compile: avoid reloads of temp register on ppc64le, others #17110

cmd/compile: avoid reloads of temp register on ppc64le, others #17110

Comments

laboger commented Sep 14, 2016

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

mwhudson commented Oct 5, 2016

dr2chase commented Oct 6, 2016

dr2chase commented Nov 2, 2016

laboger commented Nov 2, 2016

dr2chase commented Nov 2, 2016

laboger commented Jul 7, 2017

laboger commented Sep 19, 2017

gopherbot commented Oct 4, 2017

laboger commented Apr 9, 2020

laboger commented Apr 9, 2020

cherrymui commented Apr 10, 2020

laboger commented Apr 10, 2020

randall77 commented Apr 10, 2020

cherrymui commented Apr 11, 2020

mwhudson commented Apr 11, 2020

laboger commented Aug 9, 2023

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?