Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: inefficient code generation on amd64 #40426

Closed
rokkerruslan opened this issue Jul 27, 2020 · 5 comments
Closed

cmd/compile: inefficient code generation on amd64 #40426

rokkerruslan opened this issue Jul 27, 2020 · 5 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance

Comments

@rokkerruslan
Copy link

rokkerruslan commented Jul 27, 2020

Probably, inefficient code generation on amd64.

Probably a slight regression in the performance of the resulting code beetween g.14.6 and master (8696ae8).

Preface

Examining the assembly code listing, I found out that the compiler generates sub-optimal code for some part of the code.
I will not describe the whole journey. I found a code (code from benchmark game) illustrating the problem and will use it for an example.

All source code: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-go-3.html

Let's pay attention to these lines (60, 61, 62), it's "advance" func:

dx := sys.s[i].x - sys.s[j].x
dy := sys.s[i].y - sys.s[j].y
dz := sys.s[i].z - sys.s[j].z

Let's take a look at the result of the code generation:

  main.go:60    LEAQ 0(CX)(CX*2), DI         // DI = i, BX = j
  main.go:60    LEAQ main.sys+160(SB), R8

  main.go:60    MOVSD_XMM 0(R8)(DI*8), X4    // load x +0x0
  main.go:60    LEAQ 0(BX)(BX*2), R9         // calculate offset for .x
  main.go:60    LEAQ 0(R8)(R9*8), R10        // sys.s[j] + offset
  main.go:60    SUBSD 0(R10), X4

  main.go:61    MOVSD_XMM 0x8(R8)(DI*8), X5  // load y +0x8
  main.go:61    LEAQ 0(R8)(R9*8), R10
  main.go:61    LEAQ 0x8(R10), R10
  main.go:61    SUBSD 0(R10), X5

  main.go:62    MOVSD_XMM 0x10(R8)(DI*8), X6 // load z +0x10
  main.go:62    LEAQ 0(R8)(R9*8), DI
  main.go:62    LEAQ 0x10(DI), DI
  main.go:62    SUBSD 0(DI), X6

We need one instruction to load the value of the left operand and two to calculate the address of the right one.

I rolled back to version go1.14 and found that the codegen there is different. For the same lines, the following code will be generated:

  main.go:60    0x49f77d    LEAQ 0(CX)(CX*2), DI              // DI = 0
  main.go:60    0x49f781    LEAQ main.sys+160(SB), R8         // R8 = Positions addr

  main.go:60    0x49f788    MOVSD_XMM 0(R8)(DI*8), X4         // load sys.s[i].x
  main.go:60    0x49f78e    LEAQ 0(BX)(BX*2), R9
  main.go:60    0x49f792    MOVSD_XMM 0(R8)(R9*8), X5         // load sys.s[j].x
  main.go:60    0x49f798    SUBSD X5, X4

  main.go:61    0x49f79c    MOVSD_XMM 0x8(R8)(DI*8), X5
  main.go:61    0x49f7a3    MOVSD_XMM 0x8(R8)(R9*8), X6       // dy := sys.s[i].y - sys.s[j].y
  main.go:61    0x49f7aa    SUBSD X6, X5

  main.go:62    0x49f7ae    MOVSD_XMM 0x10(R8)(DI*8), X6
  main.go:62    0x49f7b5    MOVSD_XMM 0x10(R8)(R9*8), X7      // dz := sys.s[i].z - sys.s[j].z
  main.go:62    0x49f7bc    SUBSD X7, X6

It uses one instruction to load both the left and right side of the expression.

How does it affect code performance?

It depends on how often such a code generation pattern is used in the code and in how hot the pieces of code are.
But we can write benchmark:

package main

import (
    "testing"
)

func BenchmarkT(b *testing.B) {
    offsetMomentum()
    c := 500000

    b.ResetTimer()
    for n := 0; n < b.N; n++ {
        for i := 0; i < c; i++ {
            advance(0.01)
        }
    }

    _ = energy()
}

Results:

$ bin for i in {1..20}; do go.1.14 test -bench=. >> old.txt; done
$ bin for i in {1..20}; do go.master test -bench=. >> new.txt; done
$ bin benchstat old.txt new.txt
name  old time/op  new time/op  delta
T-4   48.9ms ± 5%  54.5ms ±10%  +11.46%  (p=0.000 n=19+19)

When did the issue start?

git bisect report beetween master(8696ae8) and go1.14:

... skip all steps

98cb76799c3053e779c4e1b61bb50705d25dd77f is the first bad commit
commit 98cb76799c3053e779c4e1b61bb50705d25dd77f
Author: Keith Randall <khr@golang.org>
Date:   Thu Jan 30 10:17:01 2020 -0800

    cmd/compile: insert complicated x86 addressing modes as a separate pass

    Use a separate compiler pass to introduce complicated x86 addressing
    modes.  Loads in the normal architecture rules (for x86 and all other
    platforms) can have constant offsets (AuxInt values) and symbols (Aux
    values), but no more.

    The complex addressing modes (x+y, x+2*y, etc.) are introduced in a
    separate pass that combines loads with LEAQx ops.

    Organizing rewrites this way simplifies the number of rewrites
    required, as there are lots of different rule orderings that have to
    be specified to ensure these complex addressing modes are always found
    if they are possible.

    Update #36468

    Change-Id: I5b4bf7b03a1e731d6dfeb9ef19b376175f3b4b44
    Reviewed-on: https://go-review.googlesource.com/c/go/+/217097
    Run-TryBot: Keith Randall <khr@golang.org>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>

 src/cmd/compile/internal/ssa/addressingmodes.go |   154 +
 src/cmd/compile/internal/ssa/compile.go         |     1 +
 src/cmd/compile/internal/ssa/gen/AMD64.rules    |   699 +-
 src/cmd/compile/internal/ssa/rewrite.go         |    40 +
 src/cmd/compile/internal/ssa/rewriteAMD64.go    | 10427 ++++++----------------
 test/codegen/memops.go                          |    88 +
 6 files changed, 3233 insertions(+), 8176 deletions(-)
 create mode 100644 src/cmd/compile/internal/ssa/addressingmodes.go

Issue linked with changes - #36468

/cc @randall77

@davecheney
Copy link
Contributor

/cc @randall77

@ALTree ALTree added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance labels Jul 27, 2020
@randall77
Copy link
Contributor

Yes, looks like we need corresponding float support. I have a CL.

@gopherbot
Copy link

Change https://golang.org/cl/244859 mentions this issue: cmd/compile: add floating point load+op operations to addressing modes pass

@randall77
Copy link
Contributor

With that CL you should get better code than even 1.14:

        0x000d 00013 (/Users/khr/gowork/issue40426.go:69)       LEAQ    (CX)(CX*2), DI
        0x0011 00017 (/Users/khr/gowork/issue40426.go:69)       LEAQ    "".sys+160(SB), R8
        0x0018 00024 (/Users/khr/gowork/issue40426.go:69)       MOVSD   (R8)(DI*8), X4
        0x001e 00030 (/Users/khr/gowork/issue40426.go:69)       LEAQ    (BX)(BX*2), R9
        0x0022 00034 (/Users/khr/gowork/issue40426.go:69)       SUBSD   (R8)(R9*8), X4
        0x0028 00040 (/Users/khr/gowork/issue40426.go:70)       MOVSD   8(R8)(DI*8), X5
        0x002f 00047 (/Users/khr/gowork/issue40426.go:70)       SUBSD   8(R8)(R9*8), X5
        0x0036 00054 (/Users/khr/gowork/issue40426.go:71)       MOVSD   16(R8)(DI*8), X6
        0x003d 00061 (/Users/khr/gowork/issue40426.go:71)       SUBSD   16(R8)(R9*8), X6

Not sure if it would be faster. Please try it and let me know.

@rokkerruslan
Copy link
Author

Yes, that CL improve stats:

$ benchstat regression.txt fix.txt 
name  old time/op  new time/op  delta
T-4   54.5ms ±10%  47.7ms ± 0%  -12.44%  (p=0.000 n=19+20)`

And compare with go1.14:

$ benchstat go.14.txt fix.txt 
name  old time/op  new time/op  delta
T-4   48.9ms ± 5%  47.7ms ± 0%  -2.41%  (p=0.000 n=19+20)

@golang golang locked and limited conversation to collaborators Jul 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants