Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: benchmark regression in github.com/cespare/xxhash between 1.9.2 and 1.10beta2 #23424

Closed
cespare opened this issue Jan 12, 2018 · 15 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@cespare
Copy link
Contributor

cespare commented Jan 12, 2018

I have a small benchmark that shows a significant regression between go1.9.2 and go1.10beta2.

You can check out the code at https://github.com/cespare/xxhash. This is done with the latest commit as of writing (e4e2bd419cbedb5b2ec48e4a88e4c993ddb74555).

I'm on linux/amd64. The benchmark in question is BenchmarkStringHash. The code is simply:

var result uint64

func BenchmarkStringHash(b *testing.B) {
	const s = "abcdefghijklmnop"
	var r uint64
	b.ReportAllocs()
	for n := 0; n < b.N; n++ {
		r = Sum64([]byte(s))
	}
	result = r
}

Between 1.9.2 and 1.10beta2 I see a 2x slowdown:

StringHash-72                      10.8ns ± 0%    22.5ns ± 0%  +108.33%  (p=0.008 n=5+5)

Note that the implementation of Sum64 is in assembly. If I use the noasm tag to select a pure-Go implementation, I still see a large slowdown:

StringHash-72                      12.3ns ± 0%    24.1ns ± 0%   +95.93%  (p=0.008 n=5+5)

Because of the slowdown in the assembly case where the code for Sum64 should be identical, I infer that the difference is inside the BenchmarkStringHash function itself (related to the string->[]byte conversion or the function call). Perhaps that's an incorrect assumption, though.

I've poked around a profile a bit but I haven't spotted a smoking gun. Perhaps someone else can make more sense of it.

Here's some pprof disasm output for 1.9.2 (fast):

$ go1.9.2 test -bench StringHash -benchtime 5s -cpuprofile go1.9.2.prof
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash
BenchmarkStringHash-72          1000000000              10.8 ns/op             0 B/op          0 allocs/op
PASS
ok      github.com/cespare/xxhash       12.010s
$ PPROF_BINARY_PATH=. go1.9.2 tool pprof go1.9.2.prof
File: xxhash.test
Type: cpu
Time: Jan 12, 2018 at 2:57am (UTC)
Duration: 12.01s, Total samples = 11.84s (98.60%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) disasm StringHash
Total: 11.84s
ROUTINE ======================== github.com/cespare/xxhash.BenchmarkStringHash
     1.55s     11.84s (flat, cum)   100% of Total
         .          .     4f0040: MOVQ FS:0xfffffff8, CX                         ;xxhash_test.go:18
         .          .     4f0049: CMPQ 0x10(CX), SP
         .          .     4f004d: JBE 0x4f00ed
         .          .     4f0053: SUBQ $0x60, SP
         .          .     4f0057: MOVQ BP, 0x58(SP)
         .          .     4f005c: LEAQ 0x58(SP), BP
         .          .     4f0061: MOVQ 0x68(SP), AX
         .          .     4f0066: MOVB $0x1, 0x122(AX)                     ;benchmark.go:119
         .          .     4f006d: XORL CX, CX
         .          .     4f006f: XORL DX, DX
         .          .     4f0071: JMP 0x4f00d0                             ;xxhash_test.go:22
     330ms      330ms     4f0073: MOVQ CX, 0x30(SP)                           ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:22
      10ms       10ms     4f0078: LEAQ 0x38(SP), AX
         .          .     4f007d: MOVQ AX, 0(SP)                                 ;xxhash_test.go:23
         .          .     4f0081: LEAQ 0x4c3dc(IP), AX
     280ms      280ms     4f0088: MOVQ AX, 0x8(SP)                             ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:23
         .          .     4f008d: MOVQ $0x10, 0x10(SP)                     ;xxhash_test.go:23
         .      4.69s     4f0096: CALL runtime.stringtoslicebyte(SB)         ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:23
         .          .     4f009b: MOVQ 0x20(SP), AX                           ;xxhash_test.go:23
     340ms      340ms     4f00a0: MOVQ 0x18(SP), CX                           ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:23
         .          .     4f00a5: MOVQ 0x28(SP), DX                           ;xxhash_test.go:23
         .          .     4f00aa: MOVQ CX, 0(SP)
         .          .     4f00ae: MOVQ AX, 0x8(SP)
     320ms      320ms     4f00b3: MOVQ DX, 0x10(SP)                           ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:23
      20ms      5.62s     4f00b8: CALL github.com/cespare/xxhash.Sum64(SB)
         .          .     4f00bd: MOVQ 0x30(SP), AX                           ;xxhash_test.go:23
         .          .     4f00c2: LEAQ 0x1(AX), CX                             ;xxhash_test.go:22
     250ms      250ms     4f00c6: MOVQ 0x18(SP), DX                           ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:23
         .          .     4f00cb: MOVQ 0x68(SP), AX                           ;xxhash_test.go:23
         .          .     4f00d0: MOVQ 0xf0(AX), BX                           ;xxhash_test.go:22
         .          .     4f00d7: CMPQ BX, CX
         .          .     4f00da: JL 0x4f0073
         .          .     4f00dc: MOVQ DX, 0x118d1d(IP)                   ;xxhash_test.go:25
         .          .     4f00e3: MOVQ 0x58(SP), BP                           ;xxhash_test.go:26
         .          .     4f00e8: ADDQ $0x60, SP
         .          .     4f00ec: RET
         .          .     4f00ed: CALL runtime.morestack_noctxt(SB)           ;xxhash_test.go:18
         .          .     4f00f2: ?
         .          .     4f00f3: ?
         .          .     4f00f4: ?
         .          .     4f00f5: ?
(pprof)

and here's the same output for 1.10beta2 (slow):

$ go1.10beta2 test -bench StringHash -benchtime 5s -cpuprofile go1.10beta2.prof
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash
BenchmarkStringHash-72          300000000               22.5 ns/op             0 B/op          0 allocs/op
PASS
ok      github.com/cespare/xxhash       9.209s
$ PPROF_BINARY_PATH=. go1.10beta2 tool pprof go1.10beta2.prof
File: xxhash.test
Type: cpu
Time: Jan 12, 2018 at 2:58am (UTC)
Duration: 9.21s, Total samples = 9s (97.76%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) disasm StringHash
Total: 9s
ROUTINE ======================== github.com/cespare/xxhash.BenchmarkStringHash
     2.46s         9s (flat, cum)   100% of Total
         .          .     4e7470: MOVQ FS:0xfffffff8, CX                         ;xxhash_test.go:18
         .          .     4e7479: CMPQ 0x10(CX), SP
         .          .     4e747d: JBE 0x4e7513
         .          .     4e7483: SUBQ $0x60, SP
         .          .     4e7487: MOVQ BP, 0x58(SP)
         .          .     4e748c: LEAQ 0x58(SP), BP
         .          .     4e7491: MOVQ 0x68(SP), AX
         .          .     4e7496: MOVB $0x1, 0x122(AX)                     ;benchmark.go:119
         .          .     4e749d: XORL CX, CX
         .          .     4e749f: XORL DX, DX
         .          .     4e74a1: JMP 0x4e74f6                             ;xxhash_test.go:22
         .          .     4e74a3: MOVQ CX, 0x30(SP)
         .          .     4e74a8: LEAQ 0x38(SP), AX                           ;xxhash_test.go:23
         .          .     4e74ad: MOVQ AX, 0(SP)
     120ms      120ms     4e74b1: LEAQ 0x4ca45(IP), AX                     ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:23
         .          .     4e74b8: MOVQ AX, 0x8(SP)                             ;xxhash_test.go:23
         .          .     4e74bd: MOVQ $0x10, 0x10(SP)
         .      1.64s     4e74c6: CALL runtime.stringtoslicebyte(SB)         ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:23
         .          .     4e74cb: MOVUPS 0x18(SP), X0                       ;xxhash_test.go:23
     1.92s      1.92s     4e74d0: MOVQ 0x28(SP), AX                           ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:23
         .          .     4e74d5: MOVUPS X0, 0(SP)                             ;xxhash_test.go:23
     260ms      260ms     4e74d9: MOVQ AX, 0x10(SP)                           ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:23
         .      4.90s     4e74de: CALL github.com/cespare/xxhash.Sum64(SB)
         .          .     4e74e3: MOVQ 0x30(SP), AX                           ;xxhash_test.go:23
         .          .     4e74e8: LEAQ 0x1(AX), CX                             ;xxhash_test.go:22
      80ms       80ms     4e74ec: MOVQ 0x18(SP), DX                           ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:23
         .          .     4e74f1: MOVQ 0x68(SP), AX                           ;xxhash_test.go:23
         .          .     4e74f6: MOVQ 0xf0(AX), BX                           ;xxhash_test.go:22
         .          .     4e74fd: CMPQ BX, CX
      80ms       80ms     4e7500: JL 0x4e74a3                               ;github.com/cespare/xxhash.BenchmarkStringHash xxhash_test.go:22
                                                                          ;xxhash_test.go:25
         .          .     4e7502: MOVQ DX, github.com/cespare/xxhash.result(SB)
         .          .     4e7509: MOVQ 0x58(SP), BP                           ;xxhash_test.go:26
         .          .     4e750e: ADDQ $0x60, SP
         .          .     4e7512: RET
         .          .     4e7513: CALL runtime.morestack_noctxt(SB)           ;xxhash_test.go:18
         .          .     4e7518: ?
         .          .     4e7519: PUSHL BX
         .          .     4e751a: ?
         .          .     4e751b: ?
(pprof)
@minaevmike
Copy link
Contributor

minaevmike commented Jan 12, 2018

hi @cespare, i think that you are right - the slowdown in string to []byte conversion.
diff

--- a/xxhash_test.go
+++ b/xxhash_test.go
@@ -16,11 +16,11 @@ import (
 var result uint64
 
 func BenchmarkStringHash(b *testing.B) {
-       const s = "abcdefghijklmnop"
+       var s = []byte("abcdefghijklmnop")
        var r uint64
        b.ReportAllocs()
        for n := 0; n < b.N; n++ {
-               r = Sum64([]byte(s))
+               r = Sum64(s)
        }
        result = r
 }

befor this change i have (on linux machine)

$ go test -bench StringHash -benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash
BenchmarkStringHash-8   	500000000	        10.4 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/cespare/xxhash	6.480s
 go1.10beta2 test -bench StringHash -benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash
BenchmarkStringHash-8   	300000000	        20.3 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/cespare/xxhash	8.136s

after

$ go1.10beta2 test -bench StringHash -benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash
BenchmarkStringHash-8   	2000000000	         5.17 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/cespare/xxhash	10.875s
$ go test -bench StringHash -benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash
BenchmarkStringHash-8   	2000000000	         5.21 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/cespare/xxhash	10.951s

@dsnet dsnet added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 12, 2018
@dsnet dsnet added this to the Go1.10 milestone Jan 12, 2018
@ALTree
Copy link
Member

ALTree commented Jan 12, 2018

Intel i7-4510U, I get (1.9.2 vs tip):

name          old time/op    new time/op    delta
StringHash-4    14.7ns ± 1%    26.4ns ± 1%  +79.08%  (p=0.008 n=5+5)

It seems like the MOVUPS / MOVQ sequence between the stringtoslicebyte and Sum64 calls is stalling the CPU. Before (1.9.2) we had:

0x0050 00080 (prova_test.go:16)	MOVQ	32(SP), AX
0x0055 00085 (prova_test.go:16)	MOVQ	24(SP), CX
0x005a 00090 (prova_test.go:16)	MOVQ	40(SP), DX
0x005f 00095 (prova_test.go:16)	MOVQ	CX, (SP)
0x0063 00099 (prova_test.go:16)	MOVQ	AX, 8(SP)
0x0068 00104 (prova_test.go:16)	MOVQ	DX, 16(SP)

now (tip) we have

0x0050 00080 (prova_test.go:16)	MOVUPS	24(SP), X0
0x0055 00085 (prova_test.go:16)	MOVQ	40(SP), AX
0x005a 00090 (prova_test.go:16)	MOVUPS	X0, (SP)
0x005e 00094 (prova_test.go:16)	MOVQ	AX, 16(SP)

I'm not very good at reading perf output, but it seems that the second sequence is stalling the CPU at Backend_Bound.Memory_Bound.Store_Bound.

@cespare can you confirm that the slowdown was introduced by 9c99512?

cc @randall77 @TocarIP

@TocarIP
Copy link
Contributor

TocarIP commented Jan 12, 2018

Reverting 9c99512 seems to fix this.

@TocarIP
Copy link
Contributor

TocarIP commented Jan 12, 2018

This is somewhat unrelated, but moving string to [] byte conversion out of the loop results in :

StringHash-6    43.1ns ± 0%    12.0ns ± 0%  -72.14%  (p=0.000 n=10+9)

Making it ~2x faster than 1.9

@randall77
Copy link
Contributor

I concur, sounds like we need to revert that CL. Or at least the part of it that does the 8->16 byte combination. (Maybe we keep the smaller ones?)

Do we understand why using sse registers is so much slower? Does this optimization break store forwarding or something?

@josharian
Copy link
Contributor

Do we understand why using sse registers is so much slower?

If this were ARM/NEON, I'd say it was a memory hazard. I couldn't say about amd64, though.

@TocarIP
Copy link
Contributor

TocarIP commented Jan 12, 2018

Looks like problem with forwarding 64-bit store to 128-bit load, there is ~1 ld_blocks.store_forward event per iteration. Honestly I'm surprised by performance impact. We did sse loads of 64-bit values in memmove, duffcopy, opcopy etc. for a while without problems.

@ulikunitz
Copy link
Contributor

The unaligned MOVUPS 0x18(SP), X0 is clearly the issue here. While MOVUPS allows unaligned reads and writes, they are not as fast as aligned read and writes.

@TocarIP
Copy link
Contributor

TocarIP commented Jan 12, 2018

Doesn't look like it. Rsp contains 0xc420034f08, so 0x18(SP) = 0xC420034F20 is a 32-byte aligned address.

@cespare
Copy link
Contributor Author

cespare commented Jan 12, 2018

@minaevmike @TocarIP -- somewhat offtopic, but FYI this benchmark was added specifically to check that Sum64 doesn't cause the argument to escape (see cespare/xxhash#2) so doing the string->slice inside the loop is intentional.

@cespare
Copy link
Contributor Author

cespare commented Jan 13, 2018

Thanks all for investigating.

I can confirm that Go 1.10beta2 + reverting 9c99512 fixes my regression. It also fixes another, similar regression in the same benchmark suite (which I suspected was the same issue.)

@cespare
Copy link
Contributor Author

cespare commented Jan 13, 2018

I was also going to file a bug for slowdowns of a bunch of benchmarks in github.com/spaolacci/murmur3, but happily the same revert also fixes those.

If you're interested in reproducing those:

  1. I've made a PR with some benchmark improvements (Write benchmarks using sub-benchmarks spaolacci/murmur3#20) that hasn't been merged yet, so check out my fork at https://github.com/cespare/murmur3 (these results from cespare/murmur3@4208fb8).
  2. Run benchmarks like go test -run xxx -bench . -benchtime 100ms -count 5

Comparing 1.9.2 to 1.10beta2:

name              old time/op    new time/op    delta
32/1-72             8.14ns ± 0%    7.63ns ± 0%   -6.31%  (p=0.008 n=5+5)
32/2-72             8.92ns ± 0%    8.17ns ± 0%   -8.45%  (p=0.000 n=4+5)
32/4-72             8.25ns ± 0%    7.97ns ± 0%   -3.37%  (p=0.008 n=5+5)
32/8-72             9.90ns ± 0%    9.39ns ± 0%   -5.19%  (p=0.000 n=5+4)
32/16-72            13.3ns ± 0%    12.4ns ± 0%   -6.49%  (p=0.008 n=5+5)
32/32-72            18.0ns ± 0%    18.1ns ± 0%   +0.56%  (p=0.008 n=5+5)
32/64-72            29.4ns ± 0%    29.5ns ± 0%   +0.34%  (p=0.008 n=5+5)
32/128-72           56.9ns ± 0%    54.1ns ± 0%   -4.92%  (p=0.008 n=5+5)
32/256-72            108ns ± 0%     105ns ± 0%   -2.78%  (p=0.008 n=5+5)
32/512-72            211ns ± 0%     212ns ± 0%   +0.47%  (p=0.008 n=5+5)
32/1024-72           416ns ± 0%     417ns ± 0%   +0.24%  (p=0.016 n=5+4)
32/2048-72           828ns ± 0%     829ns ± 0%   +0.14%  (p=0.048 n=5+5)
32/4096-72          1.65µs ± 0%    1.65µs ± 0%     ~     (p=0.079 n=5+5)
32/8192-72          3.30µs ± 0%    3.30µs ± 0%     ~     (p=0.143 n=5+5)
Partial32/8-72       191ns ± 1%     229ns ± 1%  +20.13%  (p=0.008 n=5+5)
Partial32/16-72      205ns ± 0%     254ns ± 0%  +23.71%  (p=0.008 n=5+5)
Partial32/32-72      284ns ± 0%     334ns ± 0%  +17.61%  (p=0.016 n=4+5)
Partial32/64-72      238ns ± 0%     278ns ± 0%  +16.97%  (p=0.008 n=5+5)
Partial32/128-72     268ns ± 0%     303ns ± 0%  +13.23%  (p=0.008 n=5+5)
64/1-72             17.0ns ± 0%    26.8ns ± 0%  +57.65%  (p=0.008 n=5+5)
64/2-72             17.9ns ± 0%    27.3ns ± 0%  +52.51%  (p=0.016 n=4+5)
64/4-72             18.8ns ± 0%    28.1ns ± 0%  +49.47%  (p=0.016 n=4+5)
64/8-72             21.1ns ± 0%    29.2ns ± 0%  +38.39%  (p=0.008 n=5+5)
64/16-72            19.1ns ± 0%    21.0ns ± 0%   +9.95%  (p=0.008 n=5+5)
64/32-72            21.2ns ± 0%    22.5ns ± 0%   +6.13%  (p=0.008 n=5+5)
64/64-72            25.9ns ± 0%    27.3ns ± 0%   +5.41%  (p=0.016 n=4+5)
64/128-72           35.9ns ± 1%    36.9ns ± 0%   +2.79%  (p=0.008 n=5+5)
64/256-72           57.1ns ± 0%    59.9ns ± 0%   +4.98%  (p=0.008 n=5+5)
64/512-72            100ns ± 1%     106ns ± 0%   +6.53%  (p=0.016 n=5+4)
64/1024-72           184ns ± 0%     201ns ± 0%   +9.24%  (p=0.016 n=4+5)
64/2048-72           360ns ± 0%     394ns ± 0%   +9.44%  (p=0.008 n=5+5)
64/4096-72           700ns ± 0%     771ns ± 0%  +10.21%  (p=0.008 n=5+5)
64/8192-72          1.38µs ± 0%    1.52µs ± 0%  +10.51%  (p=0.008 n=5+5)
128/1-72            17.9ns ± 0%    26.7ns ± 0%  +49.16%  (p=0.008 n=5+5)
128/2-72            18.6ns ± 0%    27.3ns ± 0%  +46.77%  (p=0.008 n=5+5)
128/4-72            19.5ns ± 0%    28.1ns ± 0%  +44.10%  (p=0.008 n=5+5)
128/8-72            22.0ns ± 0%    29.3ns ± 0%  +33.18%  (p=0.016 n=5+4)
128/16-72           19.7ns ± 0%    21.0ns ± 0%   +6.38%  (p=0.008 n=5+5)
128/32-72           22.2ns ± 0%    22.9ns ± 0%   +3.15%  (p=0.008 n=5+5)
128/64-72           26.9ns ± 0%    27.7ns ± 0%   +2.97%  (p=0.008 n=5+5)
128/128-72          36.8ns ± 0%    37.6ns ± 0%   +2.28%  (p=0.008 n=5+5)
128/256-72          57.6ns ± 1%    60.4ns ± 0%   +4.90%  (p=0.016 n=5+4)
128/512-72           100ns ± 0%     107ns ± 0%   +7.00%  (p=0.008 n=5+5)
128/1024-72          185ns ± 1%     201ns ± 0%   +8.77%  (p=0.008 n=5+5)
128/2048-72          361ns ± 1%     394ns ± 0%   +9.14%  (p=0.008 n=5+5)
128/4096-72          702ns ± 1%     771ns ± 0%   +9.77%  (p=0.016 n=5+4)
128/8192-72         1.38µs ± 0%    1.52µs ± 0%  +10.38%  (p=0.016 n=5+4)

name              old speed      new speed      delta
32/1-72            123MB/s ± 0%   131MB/s ± 0%   +6.74%  (p=0.008 n=5+5)
32/2-72            224MB/s ± 0%   245MB/s ± 0%   +9.24%  (p=0.008 n=5+5)
32/4-72            485MB/s ± 0%   502MB/s ± 0%   +3.49%  (p=0.008 n=5+5)
32/8-72            808MB/s ± 0%   852MB/s ± 0%   +5.52%  (p=0.008 n=5+5)
32/16-72          1.21GB/s ± 0%  1.29GB/s ± 0%   +6.74%  (p=0.008 n=5+5)
32/32-72          1.78GB/s ± 0%  1.77GB/s ± 0%   -0.51%  (p=0.008 n=5+5)
32/64-72          2.17GB/s ± 0%  2.17GB/s ± 0%   -0.37%  (p=0.008 n=5+5)
32/128-72         2.25GB/s ± 0%  2.37GB/s ± 0%   +5.18%  (p=0.008 n=5+5)
32/256-72         2.36GB/s ± 0%  2.43GB/s ± 0%   +2.65%  (p=0.008 n=5+5)
32/512-72         2.43GB/s ± 0%  2.41GB/s ± 0%   -0.49%  (p=0.008 n=5+5)
32/1024-72        2.46GB/s ± 0%  2.45GB/s ± 0%   -0.26%  (p=0.008 n=5+5)
32/2048-72        2.47GB/s ± 0%  2.47GB/s ± 0%   -0.16%  (p=0.016 n=4+5)
32/4096-72        2.48GB/s ± 0%  2.48GB/s ± 0%   -0.07%  (p=0.032 n=5+5)
32/8192-72        2.48GB/s ± 0%  2.48GB/s ± 0%     ~     (p=0.222 n=5+5)
Partial32/8-72    41.8MB/s ± 1%  34.8MB/s ± 1%  -16.64%  (p=0.008 n=5+5)
Partial32/16-72   77.8MB/s ± 0%  63.0MB/s ± 0%  -19.11%  (p=0.008 n=5+5)
Partial32/32-72    112MB/s ± 0%    96MB/s ± 0%  -14.94%  (p=0.008 n=5+5)
Partial32/64-72    268MB/s ± 0%   229MB/s ± 0%  -14.46%  (p=0.008 n=5+5)
Partial32/128-72   477MB/s ± 0%   421MB/s ± 0%  -11.73%  (p=0.008 n=5+5)
64/1-72           58.7MB/s ± 0%  37.3MB/s ± 0%  -36.43%  (p=0.008 n=5+5)
64/2-72            112MB/s ± 0%    73MB/s ± 0%  -34.67%  (p=0.008 n=5+5)
64/4-72            212MB/s ± 0%   142MB/s ± 0%  -33.09%  (p=0.008 n=5+5)
64/8-72            379MB/s ± 0%   274MB/s ± 0%  -27.70%  (p=0.008 n=5+5)
64/16-72           839MB/s ± 0%   761MB/s ± 0%   -9.29%  (p=0.008 n=5+5)
64/32-72          1.51GB/s ± 0%  1.42GB/s ± 0%   -5.76%  (p=0.008 n=5+5)
64/64-72          2.47GB/s ± 0%  2.34GB/s ± 0%   -5.12%  (p=0.008 n=5+5)
64/128-72         3.57GB/s ± 1%  3.47GB/s ± 0%   -2.68%  (p=0.008 n=5+5)
64/256-72         4.49GB/s ± 0%  4.28GB/s ± 0%   -4.67%  (p=0.008 n=5+5)
64/512-72         5.15GB/s ± 1%  4.79GB/s ± 0%   -6.99%  (p=0.016 n=5+4)
64/1024-72        5.55GB/s ± 0%  5.09GB/s ± 0%   -8.32%  (p=0.008 n=5+5)
64/2048-72        5.68GB/s ± 0%  5.19GB/s ± 0%   -8.59%  (p=0.016 n=5+4)
64/4096-72        5.85GB/s ± 0%  5.31GB/s ± 0%   -9.24%  (p=0.008 n=5+5)
64/8192-72        5.93GB/s ± 0%  5.37GB/s ± 0%   -9.49%  (p=0.008 n=5+5)
128/1-72          55.9MB/s ± 0%  37.4MB/s ± 0%  -33.06%  (p=0.008 n=5+5)
128/2-72           108MB/s ± 0%    73MB/s ± 0%  -31.94%  (p=0.016 n=5+4)
128/4-72           205MB/s ± 0%   142MB/s ± 0%  -30.57%  (p=0.008 n=5+5)
128/8-72           364MB/s ± 0%   273MB/s ± 0%  -24.97%  (p=0.016 n=5+4)
128/16-72          810MB/s ± 0%   761MB/s ± 0%   -6.05%  (p=0.008 n=5+5)
128/32-72         1.44GB/s ± 0%  1.40GB/s ± 0%   -2.94%  (p=0.008 n=5+5)
128/64-72         2.38GB/s ± 0%  2.31GB/s ± 0%   -2.63%  (p=0.008 n=5+5)
128/128-72        3.48GB/s ± 0%  3.40GB/s ± 0%   -2.27%  (p=0.008 n=5+5)
128/256-72        4.45GB/s ± 1%  4.24GB/s ± 0%   -4.64%  (p=0.008 n=5+5)
128/512-72        5.10GB/s ± 0%  4.77GB/s ± 0%   -6.51%  (p=0.008 n=5+5)
128/1024-72       5.53GB/s ± 0%  5.08GB/s ± 0%   -8.19%  (p=0.008 n=5+5)
128/2048-72       5.67GB/s ± 1%  5.19GB/s ± 0%   -8.50%  (p=0.008 n=5+5)
128/4096-72       5.83GB/s ± 1%  5.31GB/s ± 0%   -8.98%  (p=0.008 n=5+5)
128/8192-72       5.93GB/s ± 0%  5.37GB/s ± 0%   -9.42%  (p=0.008 n=5+5)

Comparing 1.9.2 to [1.10beta2 + revert of 9c99512]:

name              old time/op    new time/op    delta
32/1-72             8.14ns ± 0%    7.74ns ± 0%  -4.96%  (p=0.008 n=5+5)
32/2-72             8.92ns ± 0%    8.56ns ± 0%  -4.08%  (p=0.000 n=4+5)
32/4-72             8.25ns ± 0%    7.97ns ± 0%  -3.40%  (p=0.008 n=5+5)
32/8-72             9.90ns ± 0%    9.67ns ± 0%  -2.36%  (p=0.000 n=5+4)
32/16-72            13.3ns ± 0%    13.1ns ± 0%  -1.21%  (p=0.008 n=5+5)
32/32-72            18.0ns ± 0%    18.1ns ± 0%  +0.56%  (p=0.008 n=5+5)
32/64-72            29.4ns ± 0%    29.6ns ± 0%  +0.68%  (p=0.008 n=5+5)
32/128-72           56.9ns ± 0%    53.8ns ± 0%  -5.41%  (p=0.008 n=5+5)
32/256-72            108ns ± 0%     105ns ± 0%  -2.78%  (p=0.008 n=5+5)
32/512-72            211ns ± 0%     211ns ± 0%    ~     (all equal)
32/1024-72           416ns ± 0%     417ns ± 0%  +0.24%  (p=0.008 n=5+5)
32/2048-72           828ns ± 0%     828ns ± 0%    ~     (p=0.444 n=5+5)
32/4096-72          1.65µs ± 0%    1.65µs ± 0%    ~     (p=0.079 n=5+5)
32/8192-72          3.30µs ± 0%    3.30µs ± 0%    ~     (p=0.683 n=5+5)
Partial32/8-72       191ns ± 1%     189ns ± 1%    ~     (p=0.079 n=5+5)
Partial32/16-72      205ns ± 0%     200ns ± 0%  -2.44%  (p=0.008 n=5+5)
Partial32/32-72      284ns ± 0%     280ns ± 0%  -1.41%  (p=0.029 n=4+4)
Partial32/64-72      238ns ± 0%     236ns ± 0%  -0.84%  (p=0.008 n=5+5)
Partial32/128-72     268ns ± 0%     264ns ± 0%  -1.35%  (p=0.008 n=5+5)
64/1-72             17.0ns ± 0%    16.8ns ± 0%  -1.18%  (p=0.000 n=5+4)
64/2-72             17.9ns ± 0%    17.5ns ± 0%    ~     (p=0.079 n=4+5)
64/4-72             18.8ns ± 0%    18.6ns ± 0%  -1.06%  (p=0.029 n=4+4)
64/8-72             21.1ns ± 0%    21.2ns ± 0%  +0.47%  (p=0.016 n=5+4)
64/16-72            19.1ns ± 0%    18.6ns ± 0%  -2.62%  (p=0.008 n=5+5)
64/32-72            21.2ns ± 0%    21.1ns ± 0%  -0.47%  (p=0.008 n=5+5)
64/64-72            25.9ns ± 0%    26.3ns ± 0%  +1.54%  (p=0.016 n=4+5)
64/128-72           35.9ns ± 1%    36.4ns ± 0%  +1.39%  (p=0.008 n=5+5)
64/256-72           57.1ns ± 0%    56.2ns ± 0%  -1.51%  (p=0.008 n=5+5)
64/512-72            100ns ± 1%      98ns ± 0%  -1.01%  (p=0.016 n=5+4)
64/1024-72           184ns ± 0%     183ns ± 0%    ~     (p=0.079 n=4+5)
64/2048-72           360ns ± 0%     358ns ± 0%  -0.56%  (p=0.008 n=5+5)
64/4096-72           700ns ± 0%     697ns ± 0%  -0.37%  (p=0.008 n=5+5)
64/8192-72          1.38µs ± 0%    1.38µs ± 0%  -0.29%  (p=0.016 n=5+4)
128/1-72            17.9ns ± 0%    17.6ns ± 0%  -1.68%  (p=0.008 n=5+5)
128/2-72            18.6ns ± 0%    18.2ns ± 0%  -2.15%  (p=0.000 n=5+4)
128/4-72            19.5ns ± 0%    19.2ns ± 0%  -1.54%  (p=0.008 n=5+5)
128/8-72            22.0ns ± 0%    22.0ns ± 0%    ~     (all equal)
128/16-72           19.7ns ± 0%    19.4ns ± 0%  -1.52%  (p=0.008 n=5+5)
128/32-72           22.2ns ± 0%    22.0ns ± 0%  -0.90%  (p=0.008 n=5+5)
128/64-72           26.9ns ± 0%    27.3ns ± 0%  +1.64%  (p=0.008 n=5+5)
128/128-72          36.8ns ± 0%    37.5ns ± 0%  +1.90%  (p=0.016 n=5+4)
128/256-72          57.6ns ± 1%    57.4ns ± 0%    ~     (p=0.206 n=5+5)
128/512-72           100ns ± 0%     100ns ± 0%  -0.30%  (p=0.008 n=5+5)
128/1024-72          185ns ± 1%     184ns ± 0%    ~     (p=0.167 n=5+5)
128/2048-72          361ns ± 1%     358ns ± 0%  -0.83%  (p=0.008 n=5+5)
128/4096-72          702ns ± 1%     697ns ± 0%  -0.77%  (p=0.008 n=5+5)
128/8192-72         1.38µs ± 0%    1.38µs ± 0%  -0.41%  (p=0.008 n=5+5)

name              old speed      new speed      delta
32/1-72            123MB/s ± 0%   129MB/s ± 0%  +5.19%  (p=0.008 n=5+5)
32/2-72            224MB/s ± 0%   234MB/s ± 0%  +4.26%  (p=0.008 n=5+5)
32/4-72            485MB/s ± 0%   502MB/s ± 0%  +3.53%  (p=0.008 n=5+5)
32/8-72            808MB/s ± 0%   827MB/s ± 0%  +2.42%  (p=0.008 n=5+5)
32/16-72          1.21GB/s ± 0%  1.22GB/s ± 0%  +1.24%  (p=0.008 n=5+5)
32/32-72          1.78GB/s ± 0%  1.77GB/s ± 0%  -0.60%  (p=0.008 n=5+5)
32/64-72          2.17GB/s ± 0%  2.17GB/s ± 0%  -0.44%  (p=0.008 n=5+5)
32/128-72         2.25GB/s ± 0%  2.38GB/s ± 0%  +5.74%  (p=0.008 n=5+5)
32/256-72         2.36GB/s ± 0%  2.43GB/s ± 0%  +2.86%  (p=0.008 n=5+5)
32/512-72         2.43GB/s ± 0%  2.42GB/s ± 0%  -0.18%  (p=0.008 n=5+5)
32/1024-72        2.46GB/s ± 0%  2.45GB/s ± 0%  -0.08%  (p=0.008 n=5+5)
32/2048-72        2.47GB/s ± 0%  2.47GB/s ± 0%  -0.09%  (p=0.016 n=4+5)
32/4096-72        2.48GB/s ± 0%  2.48GB/s ± 0%    ~     (p=0.056 n=5+5)
32/8192-72        2.48GB/s ± 0%  2.48GB/s ± 0%    ~     (p=0.310 n=5+5)
Partial32/8-72    41.8MB/s ± 1%  42.3MB/s ± 1%  +1.25%  (p=0.016 n=5+5)
Partial32/16-72   77.8MB/s ± 0%  79.8MB/s ± 0%  +2.56%  (p=0.008 n=5+5)
Partial32/32-72    112MB/s ± 0%   114MB/s ± 0%  +1.38%  (p=0.008 n=5+5)
Partial32/64-72    268MB/s ± 0%   271MB/s ± 0%  +0.96%  (p=0.008 n=5+5)
Partial32/128-72   477MB/s ± 0%   484MB/s ± 0%  +1.42%  (p=0.008 n=5+5)
64/1-72           58.7MB/s ± 0%  59.6MB/s ± 0%  +1.46%  (p=0.016 n=5+4)
64/2-72            112MB/s ± 0%   114MB/s ± 0%  +2.12%  (p=0.008 n=5+5)
64/4-72            212MB/s ± 0%   214MB/s ± 1%    ~     (p=0.151 n=5+5)
64/8-72            379MB/s ± 0%   377MB/s ± 0%  -0.67%  (p=0.008 n=5+5)
64/16-72           839MB/s ± 0%   860MB/s ± 0%  +2.49%  (p=0.008 n=5+5)
64/32-72          1.51GB/s ± 0%  1.52GB/s ± 0%  +0.75%  (p=0.008 n=5+5)
64/64-72          2.47GB/s ± 0%  2.43GB/s ± 0%  -1.47%  (p=0.008 n=5+5)
64/128-72         3.57GB/s ± 1%  3.52GB/s ± 0%  -1.40%  (p=0.008 n=5+5)
64/256-72         4.49GB/s ± 0%  4.55GB/s ± 0%  +1.54%  (p=0.008 n=5+5)
64/512-72         5.15GB/s ± 1%  5.20GB/s ± 0%  +0.97%  (p=0.008 n=5+5)
64/1024-72        5.55GB/s ± 0%  5.58GB/s ± 0%  +0.61%  (p=0.008 n=5+5)
64/2048-72        5.68GB/s ± 0%  5.72GB/s ± 0%  +0.60%  (p=0.008 n=5+5)
64/4096-72        5.85GB/s ± 0%  5.87GB/s ± 0%  +0.36%  (p=0.008 n=5+5)
64/8192-72        5.93GB/s ± 0%  5.95GB/s ± 0%  +0.32%  (p=0.016 n=5+5)
128/1-72          55.9MB/s ± 0%  56.8MB/s ± 0%  +1.67%  (p=0.008 n=5+5)
128/2-72           108MB/s ± 0%   110MB/s ± 0%  +2.36%  (p=0.008 n=5+5)
128/4-72           205MB/s ± 0%   209MB/s ± 0%  +1.73%  (p=0.016 n=5+4)
128/8-72           364MB/s ± 0%   363MB/s ± 0%  -0.25%  (p=0.008 n=5+5)
128/16-72          810MB/s ± 0%   823MB/s ± 0%  +1.53%  (p=0.008 n=5+5)
128/32-72         1.44GB/s ± 0%  1.46GB/s ± 0%  +1.01%  (p=0.008 n=5+5)
128/64-72         2.38GB/s ± 0%  2.34GB/s ± 0%  -1.50%  (p=0.008 n=5+5)
128/128-72        3.48GB/s ± 0%  3.41GB/s ± 0%  -1.97%  (p=0.008 n=5+5)
128/256-72        4.45GB/s ± 1%  4.46GB/s ± 0%    ~     (p=0.151 n=5+5)
128/512-72        5.10GB/s ± 0%  5.13GB/s ± 0%  +0.75%  (p=0.008 n=5+5)
128/1024-72       5.53GB/s ± 0%  5.55GB/s ± 0%    ~     (p=0.151 n=5+5)
128/2048-72       5.67GB/s ± 1%  5.71GB/s ± 0%  +0.80%  (p=0.008 n=5+5)
128/4096-72       5.83GB/s ± 1%  5.87GB/s ± 0%  +0.70%  (p=0.008 n=5+5)
128/8192-72       5.93GB/s ± 0%  5.95GB/s ± 0%  +0.41%  (p=0.008 n=5+5)

@ulikunitz
Copy link
Contributor

@TocarIP You are right, the load is aligned. But in that case the store is unaligned. So what can be seen in the profile may be an effect of out-of-order execution.

In any case I can also confirm that the reversal of 9c99612 fixes the issue with cespare's code.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/88135 mentions this issue: cmd/compile: don't combine 64-bit loads/stores on amd64

@cespare
Copy link
Contributor Author

cespare commented Jan 17, 2018

I sent a CL which removes the 64-bit combining but leaves 8/16/32-bit in place.

That fixes all of my benchmark regressions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

9 participants