-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: go1.7beta2 performance regressions #16117
Comments
I'm away from my arm(64) machines for a couple of weeks. Help bisecting would be lovely. |
benchviz for Pine64 |
CC @randall77 |
@quentinmit and I are looking at this. We found that it also slows down on linux/386, and linux/amd64 with |
Could be the runtime, or it could be front-end compiler changes. I think there were near zero 386 or arm-specific fixes this cycle, so it might be worth comparing the generated code to rule out the compiler (perhaps after some profiling to pinpoint the change). |
Al those packages use reflection. Could it be related to the type reworking? |
Reflection is definitely part of the puzzle. Here's the state of our investigation so far:
|
The slow version contains a call to this routine that the faster version On Tue, Jun 21, 2016 at 2:50 PM, Quentin Smith notifications@github.com
|
CC @crawshaw |
@RLH Is it easy for you to find the significant callers of It's possible that the significant callers are from code that is looking for a specific name, which we could probably speed up somewhat by doing things like checking the length before building the string we return. |
I have an existence proof that code alignment can cost you 10% on Revcomp. |
@ianlancetaylor If I'm wielding pprof correctly, the callers are:
Like @RLH, I only see 3% attributed to this function, though, not 10-20%. |
Apologies for the kibitzing, but if you rearrange the |
I fooled around with |
I don't think it's worth recovering the performance of We could either build a map in the reflect package to make FieldByName fast, or better we could modify the template package so that, like encoders and decoders, on first encounter with a type it builds a contraption for reading type data. That way FieldByName is only called once per type instead of once per template execution. |
Using the simple cache in https://golang.org/cl/24322 I see about a 10% speedup in the go1 template test. Whether this is worth doing for 1.7 I don't know. |
CL https://golang.org/cl/24322 mentions this issue. |
For |
On arm64, |
This was removed in CL 19695 but it slows down reflect.New, which ends up on the hot path of things like JSON decoding. There is no immediate cost in binary size, but it will make it harder to further shrink run time type information in Go 1.8. Before BenchmarkNew-40 30000000 36.3 ns/op After BenchmarkNew-40 50000000 29.5 ns/op Fixes golang#16161 Updates golang#16117 Change-Id: If7cb7f3e745d44678f3f5cf3a5338c59847529d2 Reviewed-on: https://go-review.googlesource.com/24400 Reviewed-by: Ian Lance Taylor <iant@golang.org>
CL https://golang.org/cl/24410 mentions this issue. |
Improves JSON decoding on linux/amd64. name old time/op new time/op delta CodeUnmarshal-40 89.3ms ± 2% 86.3ms ± 2% -3.31% (p=0.000 n=22+22) name old speed new speed delta CodeUnmarshal-40 21.7MB/s ± 2% 22.5MB/s ± 2% +3.44% (p=0.000 n=22+22) Updates golang#16117 Change-Id: I52acf31d7729400cfe6693e46292d41e1addba3d Reviewed-on: https://go-review.googlesource.com/24410 Run-TryBot: David Crawshaw <crawshaw@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
CL https://golang.org/cl/24433 mentions this issue. |
The encoding/json package uses NumMethod()==0 as a fast check for interface satisfaction. In the case when a type has no methods at all, we don't need to grab the RWMutex. Improves JSON decoding benchmark on linux/amd64: name old time/op new time/op delta CodeDecoder-8 44.2ms ± 2% 40.6ms ± 1% -8.11% (p=0.000 n=10+10) name old speed new speed delta CodeDecoder-8 43.9MB/s ± 2% 47.8MB/s ± 1% +8.82% (p=0.000 n=10+10) For #16117 Change-Id: Id717e7fcd2f41b7d51d50c26ac167af45bae3747 Reviewed-on: https://go-review.googlesource.com/24433 Reviewed-by: Ian Lance Taylor <iant@golang.org> Run-TryBot: David Crawshaw <crawshaw@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
CL https://golang.org/cl/24452 is also for this issue. With it and the above, the JSON decoding performance hit on 386 has gone from ~15% to ~4%.:
(On AMD64 with SSA, decoding is 9% faster than it was in 1.6.) I don't see any other easy fixes, but I'll spend another day looking. @cherrymui could you please compare JSON decoding on arm64 between 1.6 and tip again? |
CL https://golang.org/cl/24469 mentions this issue. |
CL https://golang.org/cl/24503 mentions this issue. |
It's that right that, in Go1.7, quite some Is there any plan to gain those performance back in later release? |
@noblehng, in later releases, all architectures will use SSA. That's a major theme for Go 1.8. |
@bradfitz, I'm looking at this CL https://go-review.googlesource.com/#/c/20968/. I'm not familiar with the runtime, but I'm thinking something like whether using the compact form for compiled binary but a pre-unmarshalled form for runtime reflection is doable in a later release. |
On linux/386 compared to tip: name old time/op new time/op delta DecodeInterfaceSlice-40 1.23ms ± 1% 1.17ms ± 1% -4.93% (p=0.000 n=9+10) Recovers about half the performance regression from Go 1.6 on 386. For #16117. Change-Id: Ie8676d92a4da3e27ff21b91a98b3e13d16730ba1 Reviewed-on: https://go-review.googlesource.com/24468 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
CL https://golang.org/cl/24510 mentions this issue. |
I can take a look at the gzip slowdown for arm and 386. There was much optimization work done on gzip, but primarily targeted towards amd64, some of which completely neglected other architecture or even consciously chose to favor amd64 at the expense of other architectures. The fact that gzip now uses 4-byte matching instead of 3-byte matching is something I would have expected all architectures to benefit from, so a hit from 53efe1e (http://golang.org/cl/20929) is surprising to me. As another optimization, we switched the bit buffer to use an uint64 instead of a uint32, so I'm further surprised that arm64 took a greater hit than arm32. If there's nothing easily actionable, we can punt on the optimizations until Go1.8. |
Several minor changes that remove a good chunk of the overhead added to the reflect Name method over the 1.7 cycle, as seen from the non-SSA architectures. In particular, there are ~20 fewer instructions in reflect.name.name on 386, and the method now qualifies for inlining. The simple JSON decoding benchmark on darwin/386: name old time/op new time/op delta CodeDecoder-8 49.2ms ± 0% 48.9ms ± 1% -0.77% (p=0.000 n=10+9) name old speed new speed delta CodeDecoder-8 39.4MB/s ± 0% 39.7MB/s ± 1% +0.77% (p=0.000 n=10+9) On darwin/amd64 the effect is less pronounced: name old time/op new time/op delta CodeDecoder-8 38.9ms ± 0% 38.7ms ± 1% -0.38% (p=0.005 n=10+10) name old speed new speed delta CodeDecoder-8 49.9MB/s ± 0% 50.1MB/s ± 1% +0.38% (p=0.006 n=10+10) Counterintuitively, I get much more useful benchmark data out of my MacBook Pro than a linux workstation with more expensive Intel chips. While the laptop has fewer cores and an active GUI, the single-threaded performance is significantly better (nearly 1.5x decoding throughput) so the differences are more pronounced. For #16117. Change-Id: I4e0cc1cc2d271d47d5127b1ee1ca926faf34cabf Reviewed-on: https://go-review.googlesource.com/24510 Reviewed-by: Ian Lance Taylor <iant@golang.org>
Sameer pointed this bug out to me. Regressions on microbenchmarks may be very misleading since they are just busy loops around tiny amounts of code. More interesting would be to run the x/bench subrepo benchmarks on these systems and see if there is anything significant there. If not, waiting for SSA on these other systems seems like a decent plan. I haven't looked at any of the CLs, but please remember to make very small, very high confidence changes. Thanks. |
I compared the JSON target in x/benchmarks with JSONDecode from test/bench/go1 and got similar results: a slowdown from 1.6 to tip on 386, and https://golang.org/cl/24521 recovers it. I'm just about ready to call gob and json done. If @dsnet has anything for gzip, great, but I think everything else can wait for 1.8. |
For json benchmark in
go1.7beta2
tip (a2a4db7)
Looking at |
With GOARCH=386 on an otherwise amd64 host, I'm also seeing a 35% slowdown in BinaryTree17 that looks like it happened when we merged dev.garbage: https://rawgit.com/aclements/74e66853d989adce240fadd62c554d0b/raw/491a81c5874e4f78d01133666a4ae916a166ceef/go1-1.7-386.svg (note that this isn't a complete run yet; it's only a sampling of the commits). @RLH, could you look at the BinaryTree17 profiles before (d8d3351) and after (56b5491) the merge with GOARCH=386 and GOHOSTARCH=386? |
What Old:
|
@dsnet We are running the Gzip benchmark as found in test/bench/go1/gzip_test.go |
tl;dr, leave gzip as is for go1.7, we can address arm performance issues in go1.8 Alright, I think I've figured out what's causing the slow-down for ARM. While all of the go1.7 optimizations as a whole have helped arm and arm64 (as seen in my post above), there are some cases where performance regresses (as is the case for test/bench/go1/gzip_test.go) One of major changes made to compress/flate was that we started using 4-byte hashing instead of 3-byte hashing. The new hash function is stronger and leads to less collisions and thus fewer searches, which is good for performance. On amd64, we take advantage of the fact that most CPUs can perform 4-byte unaligned loads pretty efficiently (#14267). However, this optimization is not possible on arm, so the arm targets have the pay extra cost to utilize the stronger hash function. On Go1.6, the simpler hash function compiled on arm32 to approximately 17 instrs:
On Go1.7, the stronger hash function compiled on arm32 to approximately 77 instrs:
On Go1.7, the stronger hash function compiles on amd64 to approximately 31 instrs:
Any changes to fix this would probably be too big to do this late in the release cycle. We can think about better approaches in Go 1.8. I don't know enough about ARM to know if it will ever have unaligned reads/writes. So it may be worth investigating using the previous hash function over the new one for non x86 architectures. |
If you use the dev.ssa branch and hack in config to make arm use ssa, how do things look? I.e. Will ssa for arm save us for free? |
@dsnet, which function did you disassemble? Is it @josharian, SSA can get the performance back on ARM:
This is tip of |
Correct, it is hash4. That is awesome seeing the performance being boosted with SSA :) |
For
With SSA on:
So it is about 50% shorter. Currently it does not do combined unaligned load. It seems CSE helps mostly. It might be possible to do combined load. Newer ARM support unaligned load. But currently the compiler (both backends) generates same instructions, only the assembler rewrite some instructions based on GOARM. |
CL https://golang.org/cl/24640 mentions this issue. |
Ctz is a hot-spot in the Go 1.7 memory manager. In SSA it's implemented as an intrinsic that compiles to a few instructions, but on the old backend (all architectures other than amd64), it's implemented as a fairly complex Go function. As a result, switching to bitmap-based allocation was a significant hit to allocation-heavy workloads like BinaryTree17 on non-SSA platforms. For unknown reasons, this hit 386 particularly hard. We can regain a lot of the lost performance by implementing Ctz in assembly on the 386. This isn't as good as an intrinsic, since it still generates a function call and prevents useful inlining, but it's much better than the pure Go implementation: name old time/op new time/op delta BinaryTree17-12 3.59s ± 1% 3.06s ± 1% -14.74% (p=0.000 n=19+20) Fannkuch11-12 3.72s ± 1% 3.64s ± 1% -2.09% (p=0.000 n=17+19) FmtFprintfEmpty-12 52.3ns ± 3% 52.3ns ± 3% ~ (p=0.829 n=20+19) FmtFprintfString-12 156ns ± 1% 148ns ± 3% -5.20% (p=0.000 n=18+19) FmtFprintfInt-12 137ns ± 1% 136ns ± 1% -0.56% (p=0.000 n=19+13) FmtFprintfIntInt-12 227ns ± 2% 225ns ± 2% -0.93% (p=0.000 n=19+17) FmtFprintfPrefixedInt-12 210ns ± 1% 208ns ± 1% -0.91% (p=0.000 n=19+17) FmtFprintfFloat-12 375ns ± 1% 371ns ± 1% -1.06% (p=0.000 n=19+18) FmtManyArgs-12 995ns ± 2% 978ns ± 1% -1.63% (p=0.000 n=17+17) GobDecode-12 9.33ms ± 1% 9.19ms ± 0% -1.59% (p=0.000 n=20+17) GobEncode-12 7.73ms ± 1% 7.73ms ± 1% ~ (p=0.771 n=19+20) Gzip-12 375ms ± 1% 374ms ± 1% ~ (p=0.141 n=20+18) Gunzip-12 61.8ms ± 1% 61.8ms ± 1% ~ (p=0.602 n=20+20) HTTPClientServer-12 87.7µs ± 2% 86.9µs ± 3% -0.87% (p=0.024 n=19+20) JSONEncode-12 20.2ms ± 1% 20.4ms ± 0% +0.53% (p=0.000 n=18+19) JSONDecode-12 65.3ms ± 0% 65.4ms ± 1% ~ (p=0.385 n=16+19) Mandelbrot200-12 4.11ms ± 1% 4.12ms ± 0% +0.29% (p=0.020 n=19+19) GoParse-12 3.75ms ± 1% 3.61ms ± 2% -3.90% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 104ns ± 0% 103ns ± 0% -0.96% (p=0.000 n=13+16) RegexpMatchEasy0_1K-12 805ns ± 1% 803ns ± 1% ~ (p=0.189 n=18+18) RegexpMatchEasy1_32-12 111ns ± 0% 111ns ± 3% ~ (p=1.000 n=14+19) RegexpMatchEasy1_1K-12 1.00µs ± 1% 1.00µs ± 1% +0.50% (p=0.003 n=19+19) RegexpMatchMedium_32-12 133ns ± 2% 133ns ± 2% ~ (p=0.218 n=20+20) RegexpMatchMedium_1K-12 41.2µs ± 1% 42.2µs ± 1% +2.52% (p=0.000 n=18+16) RegexpMatchHard_32-12 2.35µs ± 1% 2.38µs ± 1% +1.53% (p=0.000 n=18+18) RegexpMatchHard_1K-12 70.9µs ± 2% 72.0µs ± 1% +1.42% (p=0.000 n=19+17) Revcomp-12 1.06s ± 0% 1.05s ± 0% -1.36% (p=0.000 n=20+18) Template-12 86.2ms ± 1% 84.6ms ± 0% -1.89% (p=0.000 n=20+18) TimeParse-12 425ns ± 2% 428ns ± 1% +0.77% (p=0.000 n=18+19) TimeFormat-12 517ns ± 1% 519ns ± 1% +0.43% (p=0.001 n=20+19) [Geo mean] 74.3µs 73.5µs -1.05% Prior to this commit, BinaryTree17-12 on 386 was 33% slower than at the go1.6 tag. With this commit, it's 13% slower. On arm and arm64, BinaryTree17-12 is only ~5% slower than it was at go1.6. It may be worth implementing Ctz for them as well. I consider this change low risk, since the functions it replaces are simple, very well specified, and well tested. For #16117. Change-Id: Ic39d851d5aca91330134596effd2dab9689ba066 Reviewed-on: https://go-review.googlesource.com/24640 Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
CL https://golang.org/cl/24642 mentions this issue. |
CL https://golang.org/cl/24643 mentions this issue. |
This issue has headed in several different directions. I don't think we currently plan to do any further work to speed up the reflect package, which is the core of the original report. It would be nice to know what the performance losses are on tip. I'm going to close this issue. If there is further work to be done for 1.7 let's move it to more focused issues. |
CL https://golang.org/cl/24521 mentions this issue. |
Please answer these questions before submitting your issue. Thanks!
go version
)?go 1.6.2, go1.7beta2
go env
)?linux/arm64, linux/arm
run the standard go1 benchmarks
little to no performance regression
see:
https://gist.github.com/ajstarks/fa05919377de73a552870ce9e45e4844 for linux/arm64 (pine64)
https://gist.github.com/ajstarks/c937a1a80e76b4692cb52c1d57a4494d for linux/arm (CHIP)
regressions occur on GobEncode, GobDecode, JSONEncode, and JSONDecode
The text was updated successfully, but these errors were encountered: