-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: 63% benchmark slowdown in 1.7 -> 1.8rc2 #18740
Comments
I think a code example will probably be necessary. |
Of at least that function. Unless @randall77 or @josharian can tell by glancing at the assembly output. |
I'm working on extracting the function and the inputs the benchmarks generate |
I see nothing obviously wrong with the assembly. I see no reason why that particular instruction should be slow - it's just reading a stack slot that was written dozens of instructions ago. |
Got it isolated. Right now I have an example of the benchmark case that had the worst slowdown. Unfortunately the benchmark included somewhat large generated input data, so I have had to manually include that as very large arrays. I can work on a smaller one, but this shows pretty easily the slow down. I can't paste it at play.golang.org (I get a server error) so I put it in a github gist: https://gist.github.com/ScottMansfield/c129bd11b66a19e50ed1bab1444bd389 I see not quite the same, but still significant, difference. Running the benchmark 30 times each I get:
|
go1.7, for
while go1.8 is generating
this is due to 26a6131. Unfortunately the single
If you substitute the following line:
with:
the performance is restored to go1.7 levels (at least on my machine). |
So, increased memory latency due to unaligned memory access in 1.8. Interesting to read and compare with http://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/ |
This looks like some sort of partial read/write stall.
1.7 to 1.8:
BenchmarkA is faster probably exactly because of the load combining in the The copy is what makes BenchmarkC fall over. I think what is happening is that the copy is done inside memmove with 2 2-byte load/store, and then we read the result with a 4-byte load. I'm guessing the processor is barfing (stalling somehow) on such a combination. In 1.7, we do 2 2-byte load/stores and then 4 1-byte loads, which the processor can handle effortlessly. If I fix memmove to use a 4-byte load/store in the 4 byte case, that fixes BenchmarkC.
Patch:
@TocarIP : Intel guys, does this sound right? Expected? I'll prepare a CL. |
Looks like store-forwarding issue. |
Just to be sure I've collected perf events ( ld_blocks_store_forward). |
CL https://golang.org/cl/35567 mentions this issue. |
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?$ go version
go version go1.7 darwin/amd64
$ go1.8rc2 version
go version go1.8rc2 darwin/amd64
What operating system and processor architecture are you using (
go env
)?$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/smansfield/gopkg"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.7/libexec"
GOTOOLDIR="/usr/local/Cellar/go/1.7/libexec/pkg/tool/darwin_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/f2/wwlcfvc5267dkl3c2_5_mcsw0000gp/T/go-build461692826=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
What did you do?
If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.
I have the disasm output for both 1.7 and 1.8 with a command to run benchtime 10s
1.7: https://play.golang.org/p/N04n8MwmS4
1.8: https://play.golang.org/p/sZpOz59PbS
What did you expect to see?
Approximately equal or better performance
What did you see instead?
Worse performance by far for 1.8, by 63% - 65%
What's interesting in the 1.8 output is that the profiling shows a single MOVL taking 4.98 of the 9.37 seconds
4.98s 1144938: MOVL 0x30(SP), CX
Unfortunately the code to do all this is private, but I might be able to extract an example if it's necessary. Right now I don't have one outside my project.
The text was updated successfully, but these errors were encountered: