cmd/compile: inefficient code generation on amd64 #40426

rokkerruslan · 2020-07-27T10:38:03Z

Probably, inefficient code generation on amd64.

Probably a slight regression in the performance of the resulting code beetween g.14.6 and master (8696ae8).

Preface

Examining the assembly code listing, I found out that the compiler generates sub-optimal code for some part of the code.
I will not describe the whole journey. I found a code (code from benchmark game) illustrating the problem and will use it for an example.

All source code: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-go-3.html

Let's pay attention to these lines (60, 61, 62), it's "advance" func:

dx := sys.s[i].x - sys.s[j].x
dy := sys.s[i].y - sys.s[j].y
dz := sys.s[i].z - sys.s[j].z

Let's take a look at the result of the code generation:

  main.go:60    LEAQ 0(CX)(CX*2), DI         // DI = i, BX = j
  main.go:60    LEAQ main.sys+160(SB), R8

  main.go:60    MOVSD_XMM 0(R8)(DI*8), X4    // load x +0x0
  main.go:60    LEAQ 0(BX)(BX*2), R9         // calculate offset for .x
  main.go:60    LEAQ 0(R8)(R9*8), R10        // sys.s[j] + offset
  main.go:60    SUBSD 0(R10), X4

  main.go:61    MOVSD_XMM 0x8(R8)(DI*8), X5  // load y +0x8
  main.go:61    LEAQ 0(R8)(R9*8), R10
  main.go:61    LEAQ 0x8(R10), R10
  main.go:61    SUBSD 0(R10), X5

  main.go:62    MOVSD_XMM 0x10(R8)(DI*8), X6 // load z +0x10
  main.go:62    LEAQ 0(R8)(R9*8), DI
  main.go:62    LEAQ 0x10(DI), DI
  main.go:62    SUBSD 0(DI), X6

We need one instruction to load the value of the left operand and two to calculate the address of the right one.

I rolled back to version go1.14 and found that the codegen there is different. For the same lines, the following code will be generated:

  main.go:60    0x49f77d    LEAQ 0(CX)(CX*2), DI              // DI = 0
  main.go:60    0x49f781    LEAQ main.sys+160(SB), R8         // R8 = Positions addr

  main.go:60    0x49f788    MOVSD_XMM 0(R8)(DI*8), X4         // load sys.s[i].x
  main.go:60    0x49f78e    LEAQ 0(BX)(BX*2), R9
  main.go:60    0x49f792    MOVSD_XMM 0(R8)(R9*8), X5         // load sys.s[j].x
  main.go:60    0x49f798    SUBSD X5, X4

  main.go:61    0x49f79c    MOVSD_XMM 0x8(R8)(DI*8), X5
  main.go:61    0x49f7a3    MOVSD_XMM 0x8(R8)(R9*8), X6       // dy := sys.s[i].y - sys.s[j].y
  main.go:61    0x49f7aa    SUBSD X6, X5

  main.go:62    0x49f7ae    MOVSD_XMM 0x10(R8)(DI*8), X6
  main.go:62    0x49f7b5    MOVSD_XMM 0x10(R8)(R9*8), X7      // dz := sys.s[i].z - sys.s[j].z
  main.go:62    0x49f7bc    SUBSD X7, X6

It uses one instruction to load both the left and right side of the expression.

How does it affect code performance?

It depends on how often such a code generation pattern is used in the code and in how hot the pieces of code are.
But we can write benchmark:

package main

import (
    "testing"
)

func BenchmarkT(b *testing.B) {
    offsetMomentum()
    c := 500000

    b.ResetTimer()
    for n := 0; n < b.N; n++ {
        for i := 0; i < c; i++ {
            advance(0.01)
        }
    }

    _ = energy()
}

Results:

$ bin for i in {1..20}; do go.1.14 test -bench=. >> old.txt; done
$ bin for i in {1..20}; do go.master test -bench=. >> new.txt; done
$ bin benchstat old.txt new.txt
name  old time/op  new time/op  delta
T-4   48.9ms ± 5%  54.5ms ±10%  +11.46%  (p=0.000 n=19+19)

When did the issue start?

git bisect report beetween master(8696ae8) and go1.14:

... skip all steps

98cb76799c3053e779c4e1b61bb50705d25dd77f is the first bad commit
commit 98cb76799c3053e779c4e1b61bb50705d25dd77f
Author: Keith Randall <khr@golang.org>
Date:   Thu Jan 30 10:17:01 2020 -0800

    cmd/compile: insert complicated x86 addressing modes as a separate pass

    Use a separate compiler pass to introduce complicated x86 addressing
    modes.  Loads in the normal architecture rules (for x86 and all other
    platforms) can have constant offsets (AuxInt values) and symbols (Aux
    values), but no more.

    The complex addressing modes (x+y, x+2*y, etc.) are introduced in a
    separate pass that combines loads with LEAQx ops.

    Organizing rewrites this way simplifies the number of rewrites
    required, as there are lots of different rule orderings that have to
    be specified to ensure these complex addressing modes are always found
    if they are possible.

    Update #36468

    Change-Id: I5b4bf7b03a1e731d6dfeb9ef19b376175f3b4b44
    Reviewed-on: https://go-review.googlesource.com/c/go/+/217097
    Run-TryBot: Keith Randall <khr@golang.org>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>

 src/cmd/compile/internal/ssa/addressingmodes.go |   154 +
 src/cmd/compile/internal/ssa/compile.go         |     1 +
 src/cmd/compile/internal/ssa/gen/AMD64.rules    |   699 +-
 src/cmd/compile/internal/ssa/rewrite.go         |    40 +
 src/cmd/compile/internal/ssa/rewriteAMD64.go    | 10427 ++++++----------------
 test/codegen/memops.go                          |    88 +
 6 files changed, 3233 insertions(+), 8176 deletions(-)
 create mode 100644 src/cmd/compile/internal/ssa/addressingmodes.go

Issue linked with changes - #36468

/cc @randall77

The text was updated successfully, but these errors were encountered:

davecheney · 2020-07-27T10:42:50Z

/cc @randall77

randall77 · 2020-07-27T16:40:21Z

Yes, looks like we need corresponding float support. I have a CL.

gopherbot · 2020-07-27T16:57:58Z

Change https://golang.org/cl/244859 mentions this issue: cmd/compile: add floating point load+op operations to addressing modes pass

randall77 · 2020-07-27T17:00:40Z

With that CL you should get better code than even 1.14:

        0x000d 00013 (/Users/khr/gowork/issue40426.go:69)       LEAQ    (CX)(CX*2), DI
        0x0011 00017 (/Users/khr/gowork/issue40426.go:69)       LEAQ    "".sys+160(SB), R8
        0x0018 00024 (/Users/khr/gowork/issue40426.go:69)       MOVSD   (R8)(DI*8), X4
        0x001e 00030 (/Users/khr/gowork/issue40426.go:69)       LEAQ    (BX)(BX*2), R9
        0x0022 00034 (/Users/khr/gowork/issue40426.go:69)       SUBSD   (R8)(R9*8), X4
        0x0028 00040 (/Users/khr/gowork/issue40426.go:70)       MOVSD   8(R8)(DI*8), X5
        0x002f 00047 (/Users/khr/gowork/issue40426.go:70)       SUBSD   8(R8)(R9*8), X5
        0x0036 00054 (/Users/khr/gowork/issue40426.go:71)       MOVSD   16(R8)(DI*8), X6
        0x003d 00061 (/Users/khr/gowork/issue40426.go:71)       SUBSD   16(R8)(R9*8), X6

Not sure if it would be faster. Please try it and let me know.

rokkerruslan · 2020-07-28T05:42:02Z

Yes, that CL improve stats:

$ benchstat regression.txt fix.txt 
name  old time/op  new time/op  delta
T-4   54.5ms ±10%  47.7ms ± 0%  -12.44%  (p=0.000 n=19+20)`

And compare with go1.14:

$ benchstat go.14.txt fix.txt 
name  old time/op  new time/op  delta
T-4   48.9ms ± 5%  47.7ms ± 0%  -2.41%  (p=0.000 n=19+20)

ALTree added NeedsInvestigation Performance labels Jul 27, 2020

gopherbot closed this as completed in c4fed25 Jul 27, 2020

golang locked and limited conversation to collaborators Jul 28, 2021

gopherbot added the FrozenDueToAge label Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: inefficient code generation on amd64 #40426

cmd/compile: inefficient code generation on amd64 #40426

rokkerruslan commented Jul 27, 2020 •

edited

Loading

davecheney commented Jul 27, 2020

randall77 commented Jul 27, 2020

gopherbot commented Jul 27, 2020

randall77 commented Jul 27, 2020

rokkerruslan commented Jul 28, 2020

cmd/compile: inefficient code generation on amd64 #40426

cmd/compile: inefficient code generation on amd64 #40426

Comments

rokkerruslan commented Jul 27, 2020 • edited Loading

Probably, inefficient code generation on amd64.

Preface

How does it affect code performance?

When did the issue start?

davecheney commented Jul 27, 2020

randall77 commented Jul 27, 2020

gopherbot commented Jul 27, 2020

randall77 commented Jul 27, 2020

rokkerruslan commented Jul 28, 2020

rokkerruslan commented Jul 27, 2020 •

edited

Loading