cmd/compile: 63% benchmark slowdown in 1.7 -> 1.8rc2 #18740

ScottMansfield · 2017-01-22T03:17:50Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (`go version`)?

$ go version
go version go1.7 darwin/amd64

$ go1.8rc2 version
go version go1.8rc2 darwin/amd64

What operating system and processor architecture are you using (`go env`)?

$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/smansfield/gopkg"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.7/libexec"
GOTOOLDIR="/usr/local/Cellar/go/1.7/libexec/pkg/tool/darwin_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/f2/wwlcfvc5267dkl3c2_5_mcsw0000gp/T/go-build461692826=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"

What did you do?

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

I have the disasm output for both 1.7 and 1.8 with a command to run benchtime 10s

1.7: https://play.golang.org/p/N04n8MwmS4

1.8: https://play.golang.org/p/sZpOz59PbS

What did you expect to see?

Approximately equal or better performance

What did you see instead?

Worse performance by far for 1.8, by 63% - 65%

What's interesting in the 1.8 output is that the profiling shows a single MOVL taking 4.98 of the 9.37 seconds

4.98s 1144938: MOVL 0x30(SP), CX

Unfortunately the code to do all this is private, but I might be able to extract an example if it's necessary. Right now I don't have one outside my project.

The text was updated successfully, but these errors were encountered:

cespare · 2017-01-22T03:20:14Z

I think a code example will probably be necessary.

bradfitz · 2017-01-23T00:57:07Z

Of at least that function.

Unless @randall77 or @josharian can tell by glancing at the assembly output.

ScottMansfield · 2017-01-23T02:04:24Z

I'm working on extracting the function and the inputs the benchmarks generate

randall77 · 2017-01-23T03:08:25Z

I see nothing obviously wrong with the assembly. I see no reason why that particular instruction should be slow - it's just reading a stack slot that was written dozens of instructions ago.

ScottMansfield · 2017-01-23T03:23:40Z

Got it isolated. Right now I have an example of the benchmark case that had the worst slowdown. Unfortunately the benchmark included somewhat large generated input data, so I have had to manually include that as very large arrays. I can work on a smaller one, but this shows pretty easily the slow down.

I can't paste it at play.golang.org (I get a server error) so I put it in a github gist: https://gist.github.com/ScottMansfield/c129bd11b66a19e50ed1bab1444bd389

I see not quite the same, but still significant, difference. Running the benchmark 30 times each I get:

$ benchstat go1.7 go1.8rc2
name          old time/op  new time/op  delta
Issue18740-8  26.1µs ± 3%  40.5µs ± 2%  +55.31%  (p=0.000 n=30+29)

ALTree · 2017-01-23T10:59:59Z

go1.7, for binary.LittleEndian.Uint32(b), is generating

movzwl 0x38(%rsp),%eax
movzbl 0x3a(%rsp),%ecx
shl    $0x10,%ecx
or     %ecx,%eax
movzbl 0x3b(%rsp),%ecx
shl    $0x18,%ecx
or     %ecx,%eax

while go1.8 is generating

mov    0x38(%rsp),%eax

this is due to 26a6131.

Unfortunately the single mov generated by 1.8 is stalling the CPU at the following instruction, another mov generated by the if curId == id line, (which is present, unchanged, in both 1.7 and 1.8):

mov    0x68(%rsp),%ecx

If you substitute the following line:

curID := binary.LittleEndian.Uint32(b)

with:

_ = b[3]
curId := uint32(b[0]) | uint32(b[2])<<16 | uint32(b[1])<<8 | uint32(b[3])<<24

the performance is restored to go1.7 levels (at least on my machine).

dgryski · 2017-01-23T12:42:44Z

So, increased memory latency due to unaligned memory access in 1.8. Interesting to read and compare with http://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

randall77 · 2017-01-23T16:13:56Z

This looks like some sort of partial read/write stall.
Compare these two benchmarks:

var g [4096]byte
var sink uint32

func BenchmarkA(b *testing.B) {
	for j := 0; j < b.N; j++ {
		s := uint32(0)
		for i := 0; i < 4096; i += 4 {
			s += binary.LittleEndian.Uint32(g[i:])
		}
		sink = s
	}
}

func BenchmarkC(b *testing.B) {
	var buf [4]byte
	for j := 0; j < b.N; j++ {
		s := uint32(0)
		for i := 0; i < 4096; i += 4 {
			copy(buf[:], g[i:])
			s += binary.LittleEndian.Uint32(buf[:])
		}
		sink = s
	}
}

1.7 to 1.8:

BenchmarkA-8     1948          1340          -31.21%
BenchmarkC-8     5419          7477          +37.98%

BenchmarkA is faster probably exactly because of the load combining in the binary.LittleEndian.Uint32 call that 1.8 does. Alignment doesn't matter, I get approximately the same performance if we load from g[i+1:] instead.

The copy is what makes BenchmarkC fall over. I think what is happening is that the copy is done inside memmove with 2 2-byte load/store, and then we read the result with a 4-byte load. I'm guessing the processor is barfing (stalling somehow) on such a combination. In 1.7, we do 2 2-byte load/stores and then 4 1-byte loads, which the processor can handle effortlessly.

If I fix memmove to use a 4-byte load/store in the 4 byte case, that fixes BenchmarkC.

BenchmarkA-8     1948          1349          -30.75%
BenchmarkC-8     5419          4987          -7.97%

Patch:

--- a/src/runtime/memmove_amd64.s
+++ b/src/runtime/memmove_amd64.s
@@ -146,10 +146,16 @@ move_1or2:
 move_0:
        RET
 move_3or4:
+       CMPQ    BX, $4
+       JB      move_3
+       MOVL    (SI), AX
+       MOVL    AX, (DI)
+       RET
+move_3:
        MOVW    (SI), AX
-       MOVW    -2(SI)(BX*1), CX
+       MOVB    2(SI), CX
        MOVW    AX, (DI)
-       MOVW    CX, -2(DI)(BX*1)
+       MOVB    CX, 2(DI)
        RET
 move_5through7:
        MOVL    (SI), AX

@TocarIP : Intel guys, does this sound right? Expected?

I'll prepare a CL.

randall77 · 2017-01-23T16:30:15Z

CL is https://go-review.googlesource.com/c/35567/

TocarIP · 2017-01-23T16:33:31Z

Looks like store-forwarding issue.
Optimization manual says (2.4.4.4):
The following rules must be met for store to load forwarding to occur
...
The store must be equal or greater in size than the size of data being loaded

TocarIP · 2017-01-23T17:33:42Z

Just to be sure I've collected perf events ( ld_blocks_store_forward).
As expected, there is a 3 orders of magnitude difference
301,876,563 vs 276,025 (before vs after patch)

gopherbot · 2017-01-23T18:19:01Z

CL https://golang.org/cl/35567 mentions this issue.

ALTree changed the title ~~63% benchmark slowdown in 1.7 -> 1.8rc2~~ cmd/compile: 63% benchmark slowdown in 1.7 -> 1.8rc2 Jan 22, 2017

bradfitz added the WaitingForInfo label Jan 23, 2017

bradfitz added this to the Go1.9Maybe milestone Jan 23, 2017

randall77 self-assigned this Jan 23, 2017

randall77 added the Performance label Jan 23, 2017

randall77 modified the milestones: Go1.8, Go1.9Maybe Jan 23, 2017

randall77 removed the WaitingForInfo label Jan 23, 2017

gopherbot closed this as completed in a96e117 Jan 23, 2017

golang locked and limited conversation to collaborators Jan 23, 2018

gopherbot added the FrozenDueToAge label Jan 23, 2018

rsc unassigned randall77 Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: 63% benchmark slowdown in 1.7 -> 1.8rc2 #18740

cmd/compile: 63% benchmark slowdown in 1.7 -> 1.8rc2 #18740

ScottMansfield commented Jan 22, 2017

cespare commented Jan 22, 2017

bradfitz commented Jan 23, 2017

ScottMansfield commented Jan 23, 2017

randall77 commented Jan 23, 2017

ScottMansfield commented Jan 23, 2017

ALTree commented Jan 23, 2017 •

edited

Loading

dgryski commented Jan 23, 2017

randall77 commented Jan 23, 2017

randall77 commented Jan 23, 2017

TocarIP commented Jan 23, 2017

TocarIP commented Jan 23, 2017

gopherbot commented Jan 23, 2017

cmd/compile: 63% benchmark slowdown in 1.7 -> 1.8rc2 #18740

cmd/compile: 63% benchmark slowdown in 1.7 -> 1.8rc2 #18740

Comments

ScottMansfield commented Jan 22, 2017

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

cespare commented Jan 22, 2017

bradfitz commented Jan 23, 2017

ScottMansfield commented Jan 23, 2017

randall77 commented Jan 23, 2017

ScottMansfield commented Jan 23, 2017

ALTree commented Jan 23, 2017 • edited Loading

dgryski commented Jan 23, 2017

randall77 commented Jan 23, 2017

randall77 commented Jan 23, 2017

TocarIP commented Jan 23, 2017

TocarIP commented Jan 23, 2017

gopherbot commented Jan 23, 2017

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

ALTree commented Jan 23, 2017 •

edited

Loading