cmd/compile: performance regression on ppc64 due to change in scheduling CL 270940 #57976

laboger · 2023-01-24T20:02:54Z

What version of Go are you using (`go version`)?

$ go version
latest

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
ppc64le

What did you do?

I noticed some regressions in performance for a few benchmarks since the scheduling change was made in CL 270940.

What did you expect to see?

Same or better performance.

What did you see instead?

10-20% degradation for some crypto benchmarks. So far I've mostly looked in the crypto package, but I'm still looking.
This example is from crypto/internal/bigmod when comparing the commit before the CL was merged against the latest. I verified the degradation happened with CL 270940, and ran against latest to make sure it hadn't been fixed yet.

name            old time/op  new time/op   delta
ModAdd           119ns ± 0%    133ns ± 0%  +12.21%  (p=0.029 n=4+4)
ModSub          87.1ns ± 0%  103.8ns ± 0%  +19.09%  (p=0.029 n=4+4)
MontgomeryRepr  2.50µs ± 0%   2.61µs ± 0%   +4.10%  (p=0.029 n=4+4)
MontgomeryMul   2.50µs ± 0%   2.62µs ± 0%   +5.06%  (p=0.029 n=4+4)
ModMul          5.00µs ± 0%   5.22µs ± 0%   +4.26%  (p=0.029 n=4+4)
ExpBig          5.74ms ± 0%   5.84ms ± 0%   +1.78%  (p=0.029 n=4+4)
Exp             6.64ms ± 0%   7.01ms ± 0%   +5.54%  (p=0.029 n=4+4)

The degradation in Add and Sub are due to a change in the sub loop:
Before:

  0x11e8e8              48000044                BR 0x11e92c                          // b 0x11e92c      
                res := xLimbs[i] - yLimbs[i] - c
  0x11e8ec              78681f24                RLDICR R3,$3,$60,R8                  // rldicr r8,r3,3,60       
  0x11e8f0              7d474214                ADD R8,R7,R10                        // add r10,r7,r8           
  0x11e8f4              e96a0000                MOVD 0(R10),R11                      // ld r11,0(r10)           
  0x11e8f8              7d08482a                MOVD (R9)(R8),R8                     // ldx r8,r8,r9            
  0x11e8fc              7d085850                SUB R8,R11,R8                        // subf r8,r8,r11          
  0x11e900              7d054050                SUB R5,R8,R8                         // subf r8,r5,r8           
                xLimbs[i] = ctSelect(on, res&_MASK, xLimbs[i])
  0x11e904              3d800019                ADDIS $0,$25,R12                     // lis r12,25              
  0x11e908              e98c3938                MOVD 14648(R12),R12                  // ld r12,14648(r12)       
  0x11e90c              7d8e4038                AND R8,R12,R14                       // and r14,r12,r8          
        mask := -uint(on)
  0x11e910              7de400d0                NEG R4,R15                           // neg r15,r4      
        return y ^ (mask & (y ^ x))
  0x11e914              7dce5a78                XOR R11,R14,R14                      // xor r14,r14,r11 
  0x11e918              7dee7038                AND R14,R15,R14                      // and r14,r15,r14 
  0x11e91c              7dcb5a78                XOR R11,R14,R11                      // xor r11,r14,r11 
                xLimbs[i] = ctSelect(on, res&_MASK, xLimbs[i])
  0x11e920              f96a0000                MOVD R11,0(R10)                      // std r11,0(r10)  
        for i := 0; i < size; i++ {
  0x11e924              38630001                ADD R3,$1,R3                         // addi r3,r3,1    
                c = res >> _W
  0x11e928              79050fe0                RLDICL R8,$1,$63,R5                  // rldicl r5,r8,1,63       
        for i := 0; i < size; i++ {
  0x11e92c              7c233000                CMP R3,R6                            // cmpd r3,r6      
  0x11e930              4180ffbc                BLT 0x11e8ec                         // blt 0x11e8ec

After:

  0x11eda8              48000048                BR 0x11edf0                          // b 0x11edf0      
  0x11edac              39030001                ADD R3,$1,R8                         // addi r8,r3,1    
                res := xLimbs[i] - yLimbs[i] - c
  0x11edb0              786a1f24                RLDICR R3,$3,$60,R10                 // rldicr r10,r3,3,60      
  0x11edb4              7d675214                ADD R10,R7,R11                       // add r11,r7,r10          
  0x11edb8              e98b0000                MOVD 0(R11),R12                      // ld r12,0(r11)           
  0x11edbc              7d4a482a                MOVD (R9)(R10),R10                   // ldx r10,r10,r9          
  0x11edc0              7d4a6050                SUB R10,R12,R10                      // subf r10,r10,r12        
  0x11edc4              7d455050                SUB R5,R10,R10                       // subf r10,r5,r10         
                xLimbs[i] = ctSelect(on, res&_MASK, xLimbs[i])
  0x11edc8              3dc00019                ADDIS $0,$25,R14                     // lis r14,25              
  0x11edcc              e9ce3a50                MOVD 14928(R14),R14                  // ld r14,14928(r14)       
  0x11edd0              7dcf5038                AND R10,R14,R15                      // and r15,r14,r10         
                c = res >> _W
  0x11edd4              79450fe0                RLDICL R10,$1,$63,R5                 // rldicl r5,r10,1,63      
        mask := -uint(on)
  0x11edd8              7d4400d0                NEG R4,R10                           // neg r10,r4      
        return y ^ (mask & (y ^ x))
  0x11eddc              7def6278                XOR R12,R15,R15                      // xor r15,r15,r12 
  0x11ede0              7d4a7838                AND R15,R10,R10                      // and r10,r10,r15 
  0x11ede4              7d4a6278                XOR R12,R10,R10                      // xor r10,r10,r12 
                xLimbs[i] = ctSelect(on, res&_MASK, xLimbs[i])
  0x11ede8              f94b0000                MOVD R10,0(R11)                      // std r10,0(r11)  
        for i := 0; i < size; i++ {
  0x11edec              7d034378                OR R8,R8,R3                          // or r3,r8,r8     
  0x11edf0              7c233000                CMP R3,R6                            // cmpd r3,r6      
  0x11edf4              4180ffb8                BLT 0x11edac                         // blt 0x11edac

In the latest code, the increment of r3 is at the top of the loop the result is put into r8, but then moves it back to r3 at the bottom even though r3 is never clobbered in the loop. In the previous code r3 was incremented at the bottom of the loop and stayed in r3 throughout the loop. I also see that the shift of res is in a different place although I think the degradation is due to using another register and moving it.

The text was updated successfully, but these errors were encountered:

randall77 · 2023-01-24T22:35:44Z

I've seen problems like this before, where the loop increment is scheduled before all the uses of the pre-incremented value have been scheduled, so it needs to put the incremented value into a different register and move it back at the end of the loop.
In fact it is worse on x86 because it is 2-arg assembly.

For example, see here: #16122 (comment)

The new scheduler might make it easier to solve this for good. I will ponder.

randall77 · 2023-02-01T17:14:43Z

@laboger If you could please try https://go-review.googlesource.com/c/go/+/463751 and see if it helps.

laboger · 2023-02-01T18:20:01Z

@randall77 Yes that generates code the way it did before the scheduling change and gets the previous performance.

gopherbot · 2023-02-01T18:41:01Z

Change https://go.dev/cl/463751 mentions this issue: cmd/compile: schedule values with no in-block uses later

When scheduling a block, deprioritize values whose results aren't used until subsequent blocks. For golang#58166, this has the effect of pushing the induction variable increment to the end of the block, past all the other uses of the pre-incremented value. Do this only with optimizations on. Debuggers have a preference for values in source code order, which this CL can degrade. Fixes golang#58166 Fixes golang#57976 Change-Id: I40d5885c661b142443c6d4702294c8abe8026c4f Reviewed-on: https://go-review.googlesource.com/c/go/+/463751 Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Keith Randall <khr@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

gopherbot added the compiler/runtime label Jan 24, 2023

prattmic added this to Go Compiler / Runtime Jan 25, 2023

mknyszek assigned randall77 Jan 30, 2023

mknyszek added the NeedsInvestigation label Jan 30, 2023

mknyszek added this to the Go1.21 milestone Jan 30, 2023

mknyszek added Performance arch-ppc64x labels Jan 30, 2023

randall77 mentioned this issue Feb 1, 2023

cmd/compile: multiple performance regressions due to CL 270940, 448036, or 454036 #58166

Open

gopherbot closed this as completed in 6224db9 Feb 1, 2023

github-project-automation bot moved this to Done in Go Compiler / Runtime Feb 1, 2023

mknyszek removed this from Go Compiler / Runtime Apr 19, 2023

golang locked and limited conversation to collaborators Feb 1, 2024

gopherbot added the FrozenDueToAge label Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: performance regression on ppc64 due to change in scheduling CL 270940 #57976

cmd/compile: performance regression on ppc64 due to change in scheduling CL 270940 #57976

laboger commented Jan 24, 2023

randall77 commented Jan 24, 2023

randall77 commented Feb 1, 2023

laboger commented Feb 1, 2023

gopherbot commented Feb 1, 2023

cmd/compile: performance regression on ppc64 due to change in scheduling CL 270940 #57976

cmd/compile: performance regression on ppc64 due to change in scheduling CL 270940 #57976

Comments

laboger commented Jan 24, 2023

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

randall77 commented Jan 24, 2023

randall77 commented Feb 1, 2023

laboger commented Feb 1, 2023

gopherbot commented Feb 1, 2023

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?