Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: performance regression in ec9c84c8 #17250

Closed
mvdan opened this issue Sep 27, 2016 · 9 comments
Closed

runtime: performance regression in ec9c84c8 #17250

mvdan opened this issue Sep 27, 2016 · 9 comments
Milestone

Comments

@mvdan
Copy link
Member

mvdan commented Sep 27, 2016

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version devel +8f9e2ab Mon Sep 26 20:31:56 2016 +0000 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/mvdan/go"
GORACE=""
GOROOT="/home/mvdan/tip"
GOTOOLDIR="/home/mvdan/tip/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build104152461=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

What did you do?

  • Run benchmarks of my package (tip 3311275c)
  • Pull tip and rebuild it (196df6f0)
  • Re-run benchmarks.
  • Compare old and new benchmarks with benchstat.

What did you expect to see?

  • Similar or slightly better benchmark results

What did you see instead?

name                 old time/op    new time/op    delta
Parse/Whitespace-4      639ns ± 1%     651ns ± 1%  +1.85%          (p=0.002 n=6+6)
Parse/Comment-4         347ns ± 1%     354ns ± 1%  +2.16%          (p=0.002 n=6+6)
Parse/LongLit-4        1.02µs ± 3%    1.01µs ± 0%    ~             (p=0.301 n=6+6)
Parse/Cmds-4           10.4µs ± 1%    10.4µs ± 0%    ~             (p=0.513 n=6+6)
Parse/Quoted-4         1.20µs ± 1%    1.25µs ± 1%  +4.36%          (p=0.002 n=6+6)
Parse/NestedStmts-4    5.87µs ± 0%    5.87µs ± 1%    ~             (p=0.873 n=4+5)
Parse/Assign-4         1.66µs ± 1%    1.68µs ± 1%  +0.96%          (p=0.009 n=6+6)
Parse/Clauses-4        3.64µs ± 1%    3.62µs ± 0%  -0.72%          (p=0.013 n=6+6)
Parse/Binary-4         3.71µs ± 3%    3.68µs ± 1%    ~             (p=0.784 n=6+6)
Parse/Redirect-4       3.43µs ± 1%    3.59µs ± 1%  +4.76%          (p=0.002 n=6+6)
Parse/Arithm-4         5.50µs ± 1%    5.44µs ± 2%    ~             (p=0.065 n=6+6)
Fprint-4               5.12µs ± 1%    5.25µs ± 5%  +2.46%          (p=0.017 n=6+6)

I focused on Parse/Quoted and tried to search what commit was producing such a noticeable slowdown. I narrowed it down to ec9c84c:

 $ benchstat 196df6f0c ec9c84c8
name            old time/op    new time/op    delta
Parse/Quoted-4    1.20µs ± 0%    1.25µs ± 1%  +3.77%          (p=0.004 n=6+5)

name            old alloc/op   new alloc/op   delta
Parse/Quoted-4      744B ± 0%      744B ± 0%    ~     (all samples are equal)

name            old allocs/op  new allocs/op  delta
Parse/Quoted-4      14.0 ± 0%      14.0 ± 0%    ~     (all samples are equal)

This test mainly takes a slice of bytes and iterates through it, with some simple logic and garbage generation. I'm not sure what metrics would be useful here, or how to narrow down the problem to provide a small play.golang.org benchmark that wasn't part of a big package.

Here are the pprof results of both before and after the tip commit. Before:

 $ ./1_parser.test -test.bench=Parse/Quoted -test.cpuprofile=1_cpu.out -test.benchtime=5s
 $ go tool pprof 1_parser.test 1_cpu.out
(pprof) top
5520ms of 8110ms total (68.06%)
Dropped 73 nodes (cum <= 40.55ms)
Showing top 10 nodes out of 82 (cum >= 180ms)
      flat  flat%   sum%        cum   cum%
    1760ms 21.70% 21.70%     3290ms 40.57%  runtime.mallocgc
    1070ms 13.19% 34.90%     1400ms 17.26%  github.com/mvdan/sh/parser.(*parser).advanceLitDquote
     820ms 10.11% 45.01%      820ms 10.11%  runtime.heapBitsSetType
     480ms  5.92% 50.92%     1340ms 16.52%  runtime.growslice
     300ms  3.70% 54.62%      300ms  3.70%  runtime.memclr
     260ms  3.21% 57.83%     3580ms 44.14%  github.com/mvdan/sh/parser.(*parser).wordPart
     250ms  3.08% 60.91%      440ms  5.43%  runtime.scanobject
     220ms  2.71% 63.63%     4570ms 56.35%  github.com/mvdan/sh/parser.(*parser).wordParts
     180ms  2.22% 65.84%     1670ms 20.59%  github.com/mvdan/sh/parser.(*parser).next
     180ms  2.22% 68.06%      180ms  2.22%  runtime.greyobject

After:

 $ ./2_parser.test -test.bench=Parse/Quoted -test.cpuprofile=2_cpu.out -test.benchtime=5s
 $ go tool pprof 2_parser.test 2_cpu.out
(pprof) top
5790ms of 8240ms total (70.27%)
Dropped 74 nodes (cum <= 41.20ms)
Showing top 10 nodes out of 81 (cum >= 140ms)
      flat  flat%   sum%        cum   cum%
    1910ms 23.18% 23.18%     3470ms 42.11%  runtime.mallocgc
    1460ms 17.72% 40.90%     1860ms 22.57%  github.com/mvdan/sh/parser.(*parser).advanceLitDquote
     780ms  9.47% 50.36%      780ms  9.47%  runtime.heapBitsSetType
     410ms  4.98% 55.34%      410ms  4.98%  runtime.memclr
     370ms  4.49% 59.83%     1220ms 14.81%  runtime.growslice
     220ms  2.67% 62.50%     3910ms 47.45%  github.com/mvdan/sh/parser.(*parser).wordPart
     180ms  2.18% 64.68%      180ms  2.18%  runtime.greyobject
     160ms  1.94% 66.63%     4760ms 57.77%  github.com/mvdan/sh/parser.(*parser).wordParts
     160ms  1.94% 68.57%      340ms  4.13%  runtime.scanobject
     140ms  1.70% 70.27%      140ms  1.70%  runtime.memmove

I can upload the test binaries if that would help.

@mvdan
Copy link
Member Author

mvdan commented Sep 27, 2016

Just to clarify, the before/after comparisons at the end are between 196df6f and ec9c84c - the latter is the commit introducing the performance regression, and the former is its direct parent.

@bradfitz bradfitz added this to the Go1.8 milestone Sep 27, 2016
@aclements
Copy link
Member

Thanks for the report. This commit wasn't expected to change performance at all.

I'm able to reproduce the slowdown on my workstation using the TimeFormat benchmark from $GOROOT/test/bench/go1.

My simplest hypothesis was that I'd simply screwed up the change and thrown off the scheduling of GC, but that appears not to be the case. The number of GCs is unchanged and all of the heap-based scheduling is identical to the megabyte. The wall-clock times for GC are also nearly identical (easily within the noise). Furthermore, there's only one GC during the benchmark itself and its times are the same to within a microsecond.

Before:

$ GOMAXPROCS=1 GODEBUG=gctrace=1 ./issue-17250/bench.196df6f -test.bench TimeFormat
...
gc 9 @3.564s 0%: 0.011+1.6+0.010 ms clock, 0.011+0/1.6/0+0.010 ms cpu, 427->427->286 MB, 573 MB goal, 1 P
 5000000               346 ns/op

After:

$ GOMAXPROCS=1 GODEBUG=gctrace=1 ./issue-17250/bench.ec9c84c -test.bench TimeFormat                                                         
...
gc 9 @3.618s 0%: 0.012+1.6+0.009 ms clock, 0.012+0/1.6/0+0.009 ms cpu, 427->427->286 MB, 573 MB goal, 1 P
 5000000               360 ns/op

This all suggests a subtler effect, and most likely on malloc rather than GC.

@mvdan
Copy link
Member Author

mvdan commented Sep 27, 2016

@aclements thanks for the quick reply. Let me know if you need me to try out a patch.

@aclements
Copy link
Member

It turns out this is just an unfortunate code alignment issue, at least for TimeFormat. It's not even a data alignment issue. By (unintentionally) taking advantage of an odd property of incremental builds versus a full make.bash build, I was able to build two go1 benchmark binaries at ec9c84c where one of them was significantly slower than the build at 196df6f, while the other performed identically. The only difference between the two binaries is the size and alignment of some of the text symbols. runtime.gcController is at the exact same address in both binaries, so this only affected the text. I also built a version at ec9c84c with a patch (CL 30013) to move the new memstats.gc_trigger field to where the old field was removed from gcController, thus restoring the sizes and alignments of these two structures. This had no effect; the performance was still worse. This alone doesn't rule out cache contention, but I also observed the slow down at GOMAXPROCS=1, so it's not cache contention.

These sorts of alignment shifts happen all the time as the runtime, compiler, and common libraries change. Usually some of the shifts are helpful and others are detrimental, so they wind up cancelling out, but sometimes, by chance, you wind up with more detrimental shifts in a single change. That seems to be what's going on here. The bad news is it's not clear what we can do about it, but the good news is they cancel out in the noise over the long run. Because of this, I'm going to close this bug as resolved. However, @mvdan, if you find that this is a stable change over the longer term with other compiler and runtime changes, please ping this bug.

The process to build the two binaries was:

git checkout ec9c84c && ./make.bash
(cd ../test/bench/go1 && go test -c -o bench1)
touch runtime/mgc.go
(cd ../test/bench/go1 && go test -c -o bench2)

The following symbols differed in size (delta from first/slower to second/faster binary):

  -33 regexp/syntax.(*parser).collapse
  -32 runtime.pclntab
  -12 text/template.(*Template).Clone
  -11 runtime.adjustframe
   -3 bufio.(*Reader).ReadBytes
   +3 encoding/json.(*decodeState).literalStore
   +3 net/textproto.(*Reader).readContinuedLineSlice
   +6 vendor/golang_org/x/net/http2/hpack.(*Decoder).parseFieldIndexed
   +9 encoding/gob.(*Encoder).encodeMap

I don't believe any of these are directly involved in TimeFormat, so the effect must have been the perturbed alignment of other symbols between these.

(I looked into why (*parser).collapse changed and, curiously, it's because the stack spill slots are different. It has a just large enough frame that changes a few stack-relative MOVs between the short five byte encoding and the longer eight byte encoding.)

@mvdan
Copy link
Member Author

mvdan commented Sep 28, 2016

Oh, interesting.

The process to build the two binaries was

Mine was git checkout, git clean -dffx and GOROOT_BOOTSTRAP=/usr/lib/go ./make.bash for each of the two commits. I have no idea whether a git clean (removing untracked files, i.e. compile output and generated files) has any effect in any of this, but better be safe than sorry.

I didn't know alignment issues like these could have such a big change in performance, but what you explain makes sense. Thanks for the thorough explanation.

I was able to build two go1 benchmark binaries at ec9c84c where one of them was significantly slower than the build at 196df6f, while the other performed identically.

Just to be sure I understand; This means that if I do my process enough times, I should get the 3/4% slowdown sometimes, and no slowdown at all some other times? I'm not sure I understand what you mean by incremental builds, but I think I'm not doing that precisely because of the git clean before each build.

@aclements
Copy link
Member

Just to be sure I understand; This means that if I do my process enough times, I should get the 3/4% slowdown sometimes, and no slowdown at all some other times? I'm not sure I understand what you mean by incremental builds, but I think I'm not doing that precisely because of the git clean before each build.

You're doing a full build because of the make.bash, versus just touching the runtime and letting go build figure out that it has to rebuild things. If you repeatedly do the full build, I wouldn't expect any change because you should get the same binary each time, which will have the same alignment of everything. However, if you rebuild after later commits that change the compiler or runtime again (I'm not sure there have been any yet), I would expect these alignments to be perturbed again and settle back to their average "goodness".

@mvdan
Copy link
Member Author

mvdan commented Sep 28, 2016

That makes sense. What I don't understand is how you managed to get two test binaries, one that showed the slowdown and one that didn't - presumably because the alignment was different in each.

Is there a doc that outlines the workflow when working with tip? I have no idea if full builds are even necessary after pulling a bunch of commits on tip, versus something like go install std.

@aclements
Copy link
Member

What I don't understand is how you managed to get two test binaries, one that showed the slowdown and one that didn't - presumably because the alignment was different in each.

This was entirely accidental. In principle, the incremental build should produce the same binary as the full build given the same source tree. I don't know why it doesn't, and this is probably a (minor) bug in the compiler, but I just took advantage of it when I noticed. :)

Is there a doc that outlines the workflow when working with tip? I have no idea if full builds are even necessary after pulling a bunch of commits on tip, versus something like go install std.

You should always do a full make.bash after pulling. If you're just modifying something in the standard library (including the runtime), you should be able to just use go install, since it will rebuild standard library packages if necessary. If you're modifying the compiler, things get hairier, but you generally want a make.bash.

@mvdan
Copy link
Member Author

mvdan commented Sep 29, 2016

Right, thanks for the explanation!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants