Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: performance regression on ppc64 due to change in scheduling CL 270940 #57976

Closed
laboger opened this issue Jan 24, 2023 · 4 comments
Assignees
Labels
arch-ppc64x compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@laboger
Copy link
Contributor

laboger commented Jan 24, 2023

What version of Go are you using (go version)?

$ go version
latest

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
ppc64le

What did you do?

I noticed some regressions in performance for a few benchmarks since the scheduling change was made in CL 270940.

What did you expect to see?

Same or better performance.

What did you see instead?

10-20% degradation for some crypto benchmarks. So far I've mostly looked in the crypto package, but I'm still looking.
This example is from crypto/internal/bigmod when comparing the commit before the CL was merged against the latest. I verified the degradation happened with CL 270940, and ran against latest to make sure it hadn't been fixed yet.

name            old time/op  new time/op   delta
ModAdd           119ns ± 0%    133ns ± 0%  +12.21%  (p=0.029 n=4+4)
ModSub          87.1ns ± 0%  103.8ns ± 0%  +19.09%  (p=0.029 n=4+4)
MontgomeryRepr  2.50µs ± 0%   2.61µs ± 0%   +4.10%  (p=0.029 n=4+4)
MontgomeryMul   2.50µs ± 0%   2.62µs ± 0%   +5.06%  (p=0.029 n=4+4)
ModMul          5.00µs ± 0%   5.22µs ± 0%   +4.26%  (p=0.029 n=4+4)
ExpBig          5.74ms ± 0%   5.84ms ± 0%   +1.78%  (p=0.029 n=4+4)
Exp             6.64ms ± 0%   7.01ms ± 0%   +5.54%  (p=0.029 n=4+4)

The degradation in Add and Sub are due to a change in the sub loop:
Before:

  0x11e8e8              48000044                BR 0x11e92c                          // b 0x11e92c      
                res := xLimbs[i] - yLimbs[i] - c
  0x11e8ec              78681f24                RLDICR R3,$3,$60,R8                  // rldicr r8,r3,3,60       
  0x11e8f0              7d474214                ADD R8,R7,R10                        // add r10,r7,r8           
  0x11e8f4              e96a0000                MOVD 0(R10),R11                      // ld r11,0(r10)           
  0x11e8f8              7d08482a                MOVD (R9)(R8),R8                     // ldx r8,r8,r9            
  0x11e8fc              7d085850                SUB R8,R11,R8                        // subf r8,r8,r11          
  0x11e900              7d054050                SUB R5,R8,R8                         // subf r8,r5,r8           
                xLimbs[i] = ctSelect(on, res&_MASK, xLimbs[i])
  0x11e904              3d800019                ADDIS $0,$25,R12                     // lis r12,25              
  0x11e908              e98c3938                MOVD 14648(R12),R12                  // ld r12,14648(r12)       
  0x11e90c              7d8e4038                AND R8,R12,R14                       // and r14,r12,r8          
        mask := -uint(on)
  0x11e910              7de400d0                NEG R4,R15                           // neg r15,r4      
        return y ^ (mask & (y ^ x))
  0x11e914              7dce5a78                XOR R11,R14,R14                      // xor r14,r14,r11 
  0x11e918              7dee7038                AND R14,R15,R14                      // and r14,r15,r14 
  0x11e91c              7dcb5a78                XOR R11,R14,R11                      // xor r11,r14,r11 
                xLimbs[i] = ctSelect(on, res&_MASK, xLimbs[i])
  0x11e920              f96a0000                MOVD R11,0(R10)                      // std r11,0(r10)  
        for i := 0; i < size; i++ {
  0x11e924              38630001                ADD R3,$1,R3                         // addi r3,r3,1    
                c = res >> _W
  0x11e928              79050fe0                RLDICL R8,$1,$63,R5                  // rldicl r5,r8,1,63       
        for i := 0; i < size; i++ {
  0x11e92c              7c233000                CMP R3,R6                            // cmpd r3,r6      
  0x11e930              4180ffbc                BLT 0x11e8ec                         // blt 0x11e8ec    

After:

  0x11eda8              48000048                BR 0x11edf0                          // b 0x11edf0      
  0x11edac              39030001                ADD R3,$1,R8                         // addi r8,r3,1    
                res := xLimbs[i] - yLimbs[i] - c
  0x11edb0              786a1f24                RLDICR R3,$3,$60,R10                 // rldicr r10,r3,3,60      
  0x11edb4              7d675214                ADD R10,R7,R11                       // add r11,r7,r10          
  0x11edb8              e98b0000                MOVD 0(R11),R12                      // ld r12,0(r11)           
  0x11edbc              7d4a482a                MOVD (R9)(R10),R10                   // ldx r10,r10,r9          
  0x11edc0              7d4a6050                SUB R10,R12,R10                      // subf r10,r10,r12        
  0x11edc4              7d455050                SUB R5,R10,R10                       // subf r10,r5,r10         
                xLimbs[i] = ctSelect(on, res&_MASK, xLimbs[i])
  0x11edc8              3dc00019                ADDIS $0,$25,R14                     // lis r14,25              
  0x11edcc              e9ce3a50                MOVD 14928(R14),R14                  // ld r14,14928(r14)       
  0x11edd0              7dcf5038                AND R10,R14,R15                      // and r15,r14,r10         
                c = res >> _W
  0x11edd4              79450fe0                RLDICL R10,$1,$63,R5                 // rldicl r5,r10,1,63      
        mask := -uint(on)
  0x11edd8              7d4400d0                NEG R4,R10                           // neg r10,r4      
        return y ^ (mask & (y ^ x))
  0x11eddc              7def6278                XOR R12,R15,R15                      // xor r15,r15,r12 
  0x11ede0              7d4a7838                AND R15,R10,R10                      // and r10,r10,r15 
  0x11ede4              7d4a6278                XOR R12,R10,R10                      // xor r10,r10,r12 
                xLimbs[i] = ctSelect(on, res&_MASK, xLimbs[i])
  0x11ede8              f94b0000                MOVD R10,0(R11)                      // std r10,0(r11)  
        for i := 0; i < size; i++ {
  0x11edec              7d034378                OR R8,R8,R3                          // or r3,r8,r8     
  0x11edf0              7c233000                CMP R3,R6                            // cmpd r3,r6      
  0x11edf4              4180ffb8                BLT 0x11edac                         // blt 0x11edac    

In the latest code, the increment of r3 is at the top of the loop the result is put into r8, but then moves it back to r3 at the bottom even though r3 is never clobbered in the loop. In the previous code r3 was incremented at the bottom of the loop and stayed in r3 throughout the loop. I also see that the shift of res is in a different place although I think the degradation is due to using another register and moving it.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 24, 2023
@randall77
Copy link
Contributor

I've seen problems like this before, where the loop increment is scheduled before all the uses of the pre-incremented value have been scheduled, so it needs to put the incremented value into a different register and move it back at the end of the loop.
In fact it is worse on x86 because it is 2-arg assembly.

For example, see here: #16122 (comment)

The new scheduler might make it easier to solve this for good. I will ponder.

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 30, 2023
@mknyszek mknyszek added this to the Go1.21 milestone Jan 30, 2023
@randall77
Copy link
Contributor

@laboger If you could please try https://go-review.googlesource.com/c/go/+/463751 and see if it helps.

@laboger
Copy link
Contributor Author

laboger commented Feb 1, 2023

@randall77 Yes that generates code the way it did before the scheduling change and gets the previous performance.

@gopherbot
Copy link

Change https://go.dev/cl/463751 mentions this issue: cmd/compile: schedule values with no in-block uses later

johanbrandhorst pushed a commit to Pryz/go that referenced this issue Feb 12, 2023
When scheduling a block, deprioritize values whose results aren't used
until subsequent blocks.

For golang#58166, this has the effect of pushing the induction variable increment
to the end of the block, past all the other uses of the pre-incremented value.

Do this only with optimizations on. Debuggers have a preference for values
in source code order, which this CL can degrade.

Fixes golang#58166
Fixes golang#57976

Change-Id: I40d5885c661b142443c6d4702294c8abe8026c4f
Reviewed-on: https://go-review.googlesource.com/c/go/+/463751
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Feb 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-ppc64x compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

4 participants