Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: 63% benchmark slowdown in 1.7 -> 1.8rc2 #18740

Closed
ScottMansfield opened this issue Jan 22, 2017 · 12 comments
Closed

cmd/compile: 63% benchmark slowdown in 1.7 -> 1.8rc2 #18740

ScottMansfield opened this issue Jan 22, 2017 · 12 comments

Comments

@ScottMansfield
Copy link

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

$ go version
go version go1.7 darwin/amd64

$ go1.8rc2 version
go version go1.8rc2 darwin/amd64

What operating system and processor architecture are you using (go env)?

$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/smansfield/gopkg"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.7/libexec"
GOTOOLDIR="/usr/local/Cellar/go/1.7/libexec/pkg/tool/darwin_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/f2/wwlcfvc5267dkl3c2_5_mcsw0000gp/T/go-build461692826=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"

What did you do?

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

I have the disasm output for both 1.7 and 1.8 with a command to run benchtime 10s

1.7: https://play.golang.org/p/N04n8MwmS4

1.8: https://play.golang.org/p/sZpOz59PbS

What did you expect to see?

Approximately equal or better performance

What did you see instead?

Worse performance by far for 1.8, by 63% - 65%

What's interesting in the 1.8 output is that the profiling shows a single MOVL taking 4.98 of the 9.37 seconds

4.98s 1144938: MOVL 0x30(SP), CX

Unfortunately the code to do all this is private, but I might be able to extract an example if it's necessary. Right now I don't have one outside my project.

@cespare
Copy link
Contributor

cespare commented Jan 22, 2017

I think a code example will probably be necessary.

@ALTree ALTree changed the title 63% benchmark slowdown in 1.7 -> 1.8rc2 cmd/compile: 63% benchmark slowdown in 1.7 -> 1.8rc2 Jan 22, 2017
@bradfitz
Copy link
Contributor

Of at least that function.

Unless @randall77 or @josharian can tell by glancing at the assembly output.

@bradfitz bradfitz added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 23, 2017
@bradfitz bradfitz added this to the Go1.9Maybe milestone Jan 23, 2017
@ScottMansfield
Copy link
Author

I'm working on extracting the function and the inputs the benchmarks generate

@randall77
Copy link
Contributor

I see nothing obviously wrong with the assembly. I see no reason why that particular instruction should be slow - it's just reading a stack slot that was written dozens of instructions ago.

@ScottMansfield
Copy link
Author

Got it isolated. Right now I have an example of the benchmark case that had the worst slowdown. Unfortunately the benchmark included somewhat large generated input data, so I have had to manually include that as very large arrays. I can work on a smaller one, but this shows pretty easily the slow down.

I can't paste it at play.golang.org (I get a server error) so I put it in a github gist: https://gist.github.com/ScottMansfield/c129bd11b66a19e50ed1bab1444bd389

I see not quite the same, but still significant, difference. Running the benchmark 30 times each I get:

$ benchstat go1.7 go1.8rc2
name          old time/op  new time/op  delta
Issue18740-8  26.1µs ± 3%  40.5µs ± 2%  +55.31%  (p=0.000 n=30+29)

@ALTree
Copy link
Member

ALTree commented Jan 23, 2017

go1.7, for binary.LittleEndian.Uint32(b), is generating

movzwl 0x38(%rsp),%eax
movzbl 0x3a(%rsp),%ecx
shl    $0x10,%ecx
or     %ecx,%eax
movzbl 0x3b(%rsp),%ecx
shl    $0x18,%ecx
or     %ecx,%eax

while go1.8 is generating

mov    0x38(%rsp),%eax

this is due to 26a6131.

Unfortunately the single mov generated by 1.8 is stalling the CPU at the following instruction, another mov generated by the if curId == id line, (which is present, unchanged, in both 1.7 and 1.8):

mov    0x68(%rsp),%ecx

If you substitute the following line:

curID := binary.LittleEndian.Uint32(b)

with:

_ = b[3]
curId := uint32(b[0]) | uint32(b[2])<<16 | uint32(b[1])<<8 | uint32(b[3])<<24

the performance is restored to go1.7 levels (at least on my machine).

@dgryski
Copy link
Contributor

dgryski commented Jan 23, 2017

So, increased memory latency due to unaligned memory access in 1.8. Interesting to read and compare with http://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

@randall77
Copy link
Contributor

This looks like some sort of partial read/write stall.
Compare these two benchmarks:

var g [4096]byte
var sink uint32

func BenchmarkA(b *testing.B) {
	for j := 0; j < b.N; j++ {
		s := uint32(0)
		for i := 0; i < 4096; i += 4 {
			s += binary.LittleEndian.Uint32(g[i:])
		}
		sink = s
	}
}

func BenchmarkC(b *testing.B) {
	var buf [4]byte
	for j := 0; j < b.N; j++ {
		s := uint32(0)
		for i := 0; i < 4096; i += 4 {
			copy(buf[:], g[i:])
			s += binary.LittleEndian.Uint32(buf[:])
		}
		sink = s
	}
}

1.7 to 1.8:

BenchmarkA-8     1948          1340          -31.21%
BenchmarkC-8     5419          7477          +37.98%

BenchmarkA is faster probably exactly because of the load combining in the binary.LittleEndian.Uint32 call that 1.8 does. Alignment doesn't matter, I get approximately the same performance if we load from g[i+1:] instead.

The copy is what makes BenchmarkC fall over. I think what is happening is that the copy is done inside memmove with 2 2-byte load/store, and then we read the result with a 4-byte load. I'm guessing the processor is barfing (stalling somehow) on such a combination. In 1.7, we do 2 2-byte load/stores and then 4 1-byte loads, which the processor can handle effortlessly.

If I fix memmove to use a 4-byte load/store in the 4 byte case, that fixes BenchmarkC.

BenchmarkA-8     1948          1349          -30.75%
BenchmarkC-8     5419          4987          -7.97%

Patch:

--- a/src/runtime/memmove_amd64.s
+++ b/src/runtime/memmove_amd64.s
@@ -146,10 +146,16 @@ move_1or2:
 move_0:
        RET
 move_3or4:
+       CMPQ    BX, $4
+       JB      move_3
+       MOVL    (SI), AX
+       MOVL    AX, (DI)
+       RET
+move_3:
        MOVW    (SI), AX
-       MOVW    -2(SI)(BX*1), CX
+       MOVB    2(SI), CX
        MOVW    AX, (DI)
-       MOVW    CX, -2(DI)(BX*1)
+       MOVB    CX, 2(DI)
        RET
 move_5through7:
        MOVL    (SI), AX

@TocarIP : Intel guys, does this sound right? Expected?

I'll prepare a CL.

@randall77 randall77 self-assigned this Jan 23, 2017
@randall77 randall77 modified the milestones: Go1.8, Go1.9Maybe Jan 23, 2017
@randall77 randall77 removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 23, 2017
@randall77
Copy link
Contributor

@TocarIP
Copy link
Contributor

TocarIP commented Jan 23, 2017

Looks like store-forwarding issue.
Optimization manual says (2.4.4.4):
The following rules must be met for store to load forwarding to occur
...
The store must be equal or greater in size than the size of data being loaded

@TocarIP
Copy link
Contributor

TocarIP commented Jan 23, 2017

Just to be sure I've collected perf events ( ld_blocks_store_forward).
As expected, there is a 3 orders of magnitude difference
301,876,563 vs 276,025 (before vs after patch)

@gopherbot
Copy link

CL https://golang.org/cl/35567 mentions this issue.

@golang golang locked and limited conversation to collaborators Jan 23, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants