Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: go1.7beta2 performance regressions #16117

Closed
ajstarks opened this issue Jun 19, 2016 · 49 comments
Closed

cmd/compile: go1.7beta2 performance regressions #16117

ajstarks opened this issue Jun 19, 2016 · 49 comments
Milestone

Comments

@ajstarks
Copy link
Contributor

ajstarks commented Jun 19, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go 1.6.2, go1.7beta2

  1. What operating system and processor architecture are you using (go env)?

linux/arm64, linux/arm

  1. What did you do?

run the standard go1 benchmarks

  1. What did you expect to see?

little to no performance regression

  1. What did you see instead?

see:
https://gist.github.com/ajstarks/fa05919377de73a552870ce9e45e4844 for linux/arm64 (pine64)
https://gist.github.com/ajstarks/c937a1a80e76b4692cb52c1d57a4494d for linux/arm (CHIP)

regressions occur on GobEncode, GobDecode, JSONEncode, and JSONDecode

@josharian josharian changed the title go1.7beta2 performance regressions on ARM all: go1.7beta2 performance regressions on ARM Jun 19, 2016
@josharian josharian added this to the Go1.7 milestone Jun 19, 2016
@josharian
Copy link
Contributor

I'm away from my arm(64) machines for a couple of weeks. Help bisecting would be lovely.

@ajstarks
Copy link
Contributor Author

benchviz for Pine64
pine64.pdf

@ianlancetaylor ianlancetaylor changed the title all: go1.7beta2 performance regressions on ARM cmd/compile: go1.7beta2 performance regressions on ARM Jun 20, 2016
@ianlancetaylor
Copy link
Contributor

CC @randall77

@cherrymui
Copy link
Member

@quentinmit and I are looking at this. We found that it also slows down on linux/386, and linux/amd64 with -ssa=0, with GobDecode, JSONDecode and Template being the most significant. Gzip is getting faster, though. So, probably it is not the compiler?

@josharian
Copy link
Contributor

Could be the runtime, or it could be front-end compiler changes. I think there were near zero 386 or arm-specific fixes this cycle, so it might be worth comparing the generated code to rule out the compiler (perhaps after some profiling to pinpoint the change).

@josharian
Copy link
Contributor

Al those packages use reflection. Could it be related to the type reworking?

@quentinmit quentinmit changed the title cmd/compile: go1.7beta2 performance regressions on ARM cmd/compile: go1.7beta2 performance regressions Jun 21, 2016
@quentinmit
Copy link
Contributor

Reflection is definitely part of the puzzle. Here's the state of our investigation so far:

  • Template: ~22% performance hit on arm32, arm64, and 386. Largest hit due to http://golang.org/cl/20968. @RLH is trying to use VTune to pin down the slow code.
  • Revcomp: ~30% performance hit on 386 only. Entirely due to http://golang.org/cl/20901. @dr2chase is looking at this.
  • GobDecode: ~10% performance hit on arm32, arm64, and 386. Very noisy, no obvious culprit.
  • JsonDecode: ~30% performance hit on arm32, arm64, ~15% on 386. Less noisy, still no obvious culprit.
  • Gzip: ~20% performance hit on arm32, arm64. Still need to bisect (not as easy as bisecting on 386).

@RLH
Copy link
Contributor

RLH commented Jun 21, 2016

The slow version contains a call to this routine that the faster version
does not contain. This accounts for aroutne 3% of the total time. It is the
only routine that stood out in the top 20 routines.
inst retired| cpu_clk unhalted
reflect.(_name).name 3.3% 2.9% type.go 0x8166a40 go1-template-bad
reflect.(_name).name

On Tue, Jun 21, 2016 at 2:50 PM, Quentin Smith notifications@github.com
wrote:

Reflection is definitely part of the puzzle. Here's the state of our
investigation so far:

  • Template: ~22% performance hit on arm32, arm64, and 386. Largest hit
    due to http://golang.org/cl/20968. @RLH https://github.com/rlh is
    trying to use VTune to pin down the slow code.
  • Revcomp: ~30% performance hit on 386 only. Entirely due to
    http://golang.org/cl/20901. @dr2chase https://github.com/dr2chase is
    looking at this.
  • GobDecode: ~10% performance hit on arm32, arm64, and 386. Very
    noisy, no obvious culprit.
  • JsonDecode: ~30% performance hit on arm32, arm64, ~15% on 386. Less
    noisy, still no obvious culprit.
  • Gzip: ~20% performance hit on arm32, arm64. Still need to bisect
    (not as easy as bisecting on 386).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16117 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AA7Wn6TQ1oaw8NNBTuIsKaDeSjMWq_SXks5qODKDgaJpZM4I5NAI
.

@ianlancetaylor
Copy link
Contributor

CC @crawshaw

@ianlancetaylor
Copy link
Contributor

@RLH Is it easy for you to find the significant callers of reflect.(*name).name?

It's possible that the significant callers are from code that is looking for a specific name, which we could probably speed up somewhat by doing things like checking the length before building the string we return.

@dr2chase
Copy link
Contributor

I have an existence proof that code alignment can cost you 10% on Revcomp.

@quentinmit
Copy link
Contributor

@ianlancetaylor If I'm wielding pprof correctly, the callers are:

                                             1.64s 84.97% |   reflect.(*structType).FieldByName /usr/local/google/home/quentin/go-386/src/reflect/type.go
                                             0.29s 15.03% |   reflect.(*structType).Field /usr/local/google/home/quentin/go-386/src/reflect/type.go
     1.93s  2.05%  2.05%      1.93s  2.05%                | reflect.(*name).name /usr/local/google/home/quentin/go-386/src/reflect/type.go

Like @RLH, I only see 3% attributed to this function, though, not 10-20%.

@josharian
Copy link
Contributor

Apologies for the kibitzing, but if you rearrange the name method so that it gets inlined, does that help? (Glancing at the code on my phone, it appears that this discussion through data could be simplified to possibly achieve this.) Just because only 3% is spent in the function doesn't mean that's the only cost—there's the call itself and the impact on optimization that comes from having a call in the middle of an otherwise basic block.

@ianlancetaylor
Copy link
Contributor

I fooled around with name.name as called by FieldByName a bit but only got about a 1% improvement on 386.

@crawshaw
Copy link
Member

I don't think it's worth recovering the performance of FieldByName in BenchmarkTemplate. It has always been a linear scan through field names, so it has never been as fast as it could be.

We could either build a map in the reflect package to make FieldByName fast, or better we could modify the template package so that, like encoders and decoders, on first encounter with a type it builds a contraption for reading type data. That way FieldByName is only called once per type instead of once per template execution.

@ianlancetaylor
Copy link
Contributor

Using the simple cache in https://golang.org/cl/24322 I see about a 10% speedup in the go1 template test.

Whether this is worth doing for 1.7 I don't know.

@gopherbot
Copy link

CL https://golang.org/cl/24322 mentions this issue.

@cherrymui
Copy link
Member

For Gzip on arm, commit 53efe1e (http://golang.org/cl/20929) causes 20% slow-down on arm64, and 5% slow-down on arm32.

@cherrymui
Copy link
Member

On arm64, JSONDecode is slowed down 6% by CL http://golang.org/cl/19694, and slowed down 11% by http://golang.org/cl/19695. @quentinmit confirmed that they also slow down JSONDecode on 386.

mk0x9 pushed a commit to mk0x9/go that referenced this issue Jun 23, 2016
This was removed in CL 19695 but it slows down reflect.New, which ends
up on the hot path of things like JSON decoding.

There is no immediate cost in binary size, but it will make it harder to
further shrink run time type information in Go 1.8.

Before

	BenchmarkNew-40         30000000                36.3 ns/op

After

	BenchmarkNew-40         50000000                29.5 ns/op

Fixes golang#16161
Updates golang#16117

Change-Id: If7cb7f3e745d44678f3f5cf3a5338c59847529d2
Reviewed-on: https://go-review.googlesource.com/24400
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@gopherbot
Copy link

CL https://golang.org/cl/24410 mentions this issue.

mk0x9 pushed a commit to mk0x9/go that referenced this issue Jun 23, 2016
Improves JSON decoding on linux/amd64.

name                   old time/op    new time/op    delta
CodeUnmarshal-40         89.3ms ± 2%    86.3ms ± 2%  -3.31%  (p=0.000 n=22+22)

name                   old speed      new speed      delta
CodeUnmarshal-40       21.7MB/s ± 2%  22.5MB/s ± 2%  +3.44%  (p=0.000 n=22+22)

Updates golang#16117

Change-Id: I52acf31d7729400cfe6693e46292d41e1addba3d
Reviewed-on: https://go-review.googlesource.com/24410
Run-TryBot: David Crawshaw <crawshaw@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@gopherbot
Copy link

CL https://golang.org/cl/24433 mentions this issue.

gopherbot pushed a commit that referenced this issue Jun 24, 2016
The encoding/json package uses NumMethod()==0 as a fast check for
interface satisfaction. In the case when a type has no methods at
all, we don't need to grab the RWMutex.

Improves JSON decoding benchmark on linux/amd64:

	name           old time/op    new time/op    delta
	CodeDecoder-8    44.2ms ± 2%    40.6ms ± 1%  -8.11%  (p=0.000 n=10+10)

	name           old speed      new speed      delta
	CodeDecoder-8  43.9MB/s ± 2%  47.8MB/s ± 1%  +8.82%  (p=0.000 n=10+10)

For #16117

Change-Id: Id717e7fcd2f41b7d51d50c26ac167af45bae3747
Reviewed-on: https://go-review.googlesource.com/24433
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: David Crawshaw <crawshaw@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@crawshaw
Copy link
Member

CL https://golang.org/cl/24452 is also for this issue.

With it and the above, the JSON decoding performance hit on 386 has gone from ~15% to ~4%.:

name             old time/op    new time/op    delta
CodeDecoder-8      48.4ms ± 0%    50.3ms ± 0%  +3.89%  (p=0.000 n=10+10)

name             old speed      new speed      delta
CodeDecoder-8    40.1MB/s ± 0%  38.6MB/s ± 0%  -3.75%  (p=0.000 n=10+10)

(On AMD64 with SSA, decoding is 9% faster than it was in 1.6.)

I don't see any other easy fixes, but I'll spend another day looking. @cherrymui could you please compare JSON decoding on arm64 between 1.6 and tip again?

@gopherbot
Copy link

CL https://golang.org/cl/24469 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/24503 mentions this issue.

@noblehng
Copy link

It's that right that, in Go1.7, quite some reflect performance is traded for more compact compiled binary size?
Especially Field() and FieldByName(), which call reflect.(*name).name, and something use Name().
Because of better codegen, the amd64 ssa backend can cover that in other place, but not for other archs.

Is there any plan to gain those performance back in later release?

@bradfitz
Copy link
Contributor

@noblehng, in later releases, all architectures will use SSA. That's a major theme for Go 1.8.

@noblehng
Copy link

@bradfitz, I'm looking at this CL https://go-review.googlesource.com/#/c/20968/.
To get a method name or struct field name, it is just a struct field access in Go1.6, but basically needs to unmarshal from a binary encoding in Go1.7. Sure adding a global cache will help, but a struct field access will still be faster even for SSA.

I'm not familiar with the runtime, but I'm thinking something like whether using the compact form for compiled binary but a pre-unmarshalled form for runtime reflection is doable in a later release.

@bradfitz
Copy link
Contributor

@noblehng, yes, that's what this bug is tracking (the across-the-board performance loss, largely from reflect changes). @crawshaw is working on it. I was merely pointing out that more SSA is coming, which is basically unrelated (except that it covered up the loss on amd64, as you noted).

gopherbot pushed a commit that referenced this issue Jun 28, 2016
On linux/386 compared to tip:

	name                     old time/op  new time/op  delta
	DecodeInterfaceSlice-40  1.23ms ± 1%  1.17ms ± 1%  -4.93%  (p=0.000 n=9+10)

Recovers about half the performance regression from Go 1.6 on 386.

For #16117.

Change-Id: Ie8676d92a4da3e27ff21b91a98b3e13d16730ba1
Reviewed-on: https://go-review.googlesource.com/24468
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link

CL https://golang.org/cl/24510 mentions this issue.

@dsnet
Copy link
Member

dsnet commented Jun 28, 2016

I can take a look at the gzip slowdown for arm and 386. There was much optimization work done on gzip, but primarily targeted towards amd64, some of which completely neglected other architecture or even consciously chose to favor amd64 at the expense of other architectures.

The fact that gzip now uses 4-byte matching instead of 3-byte matching is something I would have expected all architectures to benefit from, so a hit from 53efe1e (http://golang.org/cl/20929) is surprising to me. As another optimization, we switched the bit buffer to use an uint64 instead of a uint32, so I'm further surprised that arm64 took a greater hit than arm32.

If there's nothing easily actionable, we can punt on the optimizations until Go1.8.

gopherbot pushed a commit that referenced this issue Jun 28, 2016
Several minor changes that remove a good chunk of the overhead added
to the reflect Name method over the 1.7 cycle, as seen from the
non-SSA architectures.

In particular, there are ~20 fewer instructions in reflect.name.name
on 386, and the method now qualifies for inlining.

The simple JSON decoding benchmark on darwin/386:

	name           old time/op    new time/op    delta
	CodeDecoder-8    49.2ms ± 0%    48.9ms ± 1%  -0.77%  (p=0.000 n=10+9)

	name           old speed      new speed      delta
	CodeDecoder-8  39.4MB/s ± 0%  39.7MB/s ± 1%  +0.77%  (p=0.000 n=10+9)

On darwin/amd64 the effect is less pronounced:

	name           old time/op    new time/op    delta
	CodeDecoder-8    38.9ms ± 0%    38.7ms ± 1%  -0.38%  (p=0.005 n=10+10)

	name           old speed      new speed      delta
	CodeDecoder-8  49.9MB/s ± 0%  50.1MB/s ± 1%  +0.38%  (p=0.006 n=10+10)

Counterintuitively, I get much more useful benchmark data out of my
MacBook Pro than a linux workstation with more expensive Intel chips.
While the laptop has fewer cores and an active GUI, the single-threaded
performance is significantly better (nearly 1.5x decoding throughput)
so the differences are more pronounced.

For #16117.

Change-Id: I4e0cc1cc2d271d47d5127b1ee1ca926faf34cabf
Reviewed-on: https://go-review.googlesource.com/24510
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@rsc
Copy link
Contributor

rsc commented Jun 28, 2016

Sameer pointed this bug out to me. Regressions on microbenchmarks may be very misleading since they are just busy loops around tiny amounts of code. More interesting would be to run the x/bench subrepo benchmarks on these systems and see if there is anything significant there. If not, waiting for SSA on these other systems seems like a decent plan. I haven't looked at any of the CLs, but please remember to make very small, very high confidence changes. Thanks.

@crawshaw
Copy link
Member

I compared the JSON target in x/benchmarks with JSONDecode from test/bench/go1 and got similar results: a slowdown from 1.6 to tip on 386, and https://golang.org/cl/24521 recovers it.

I'm just about ready to call gob and json done. If @dsnet has anything for gzip, great, but I think everything else can wait for 1.8.

@cherrymui
Copy link
Member

For json benchmark in golang.org/x/benchmarks on ARM64,
go1.6.2

GOPERF-METRIC:allocated=7871382
GOPERF-METRIC:allocs=105419
GOPERF-METRIC:cputime=293700000
GOPERF-METRIC:gc-pause-one=671156
GOPERF-METRIC:gc-pause-total=151010
GOPERF-METRIC:rss=183361536
GOPERF-METRIC:sys-gc=5376000
GOPERF-METRIC:sys-heap=168591360
GOPERF-METRIC:sys-other=7205176
GOPERF-METRIC:sys-stack=753664
GOPERF-METRIC:sys-total=181926200
GOPERF-METRIC:time=36993340
GOPERF-METRIC:virtual-mem=165421056

go1.7beta2

GOPERF-METRIC:allocated=7871557
GOPERF-METRIC:allocs=105417
GOPERF-METRIC:cputime=355650000
GOPERF-METRIC:gc-pause-one=539134
GOPERF-METRIC:gc-pause-total=121305
GOPERF-METRIC:rss=174964736
GOPERF-METRIC:sys-gc=6182912
GOPERF-METRIC:sys-heap=162234368
GOPERF-METRIC:sys-other=7548456
GOPERF-METRIC:sys-stack=819200
GOPERF-METRIC:sys-total=176784936
GOPERF-METRIC:time=44925700
GOPERF-METRIC:virtual-mem=162672640

tip (a2a4db7)

GOPERF-METRIC:allocated=7870908
GOPERF-METRIC:allocs=105404
GOPERF-METRIC:cputime=280480000
GOPERF-METRIC:gc-pause-one=567936
GOPERF-METRIC:gc-pause-total=122674
GOPERF-METRIC:rss=183672832
GOPERF-METRIC:sys-gc=6170624
GOPERF-METRIC:sys-heap=169902080
GOPERF-METRIC:sys-other=7459128
GOPERF-METRIC:sys-stack=819200
GOPERF-METRIC:sys-total=184351032
GOPERF-METRIC:time=35300696
GOPERF-METRIC:virtual-mem=150028288

Looking at time metric (is it right?), it seems go1.7beta2 is ~20% slower than go1.6.2 but tip gets it back.

@aclements
Copy link
Member

With GOARCH=386 on an otherwise amd64 host, I'm also seeing a 35% slowdown in BinaryTree17 that looks like it happened when we merged dev.garbage: https://rawgit.com/aclements/74e66853d989adce240fadd62c554d0b/raw/491a81c5874e4f78d01133666a4ae916a166ceef/go1-1.7-386.svg (note that this isn't a complete run yet; it's only a sampling of the commits).

@RLH, could you look at the BinaryTree17 profiles before (d8d3351) and after (56b5491) the merge with GOARCH=386 and GOHOSTARCH=386?

@dsnet
Copy link
Member

dsnet commented Jun 29, 2016

What gzip benchmark is being run? When I run the compress/flate benchmarks on the architectures in question, I actually see improvement (as I would expect) and not regressions. I'm using the builders to run the tests.

Old: go1.6.2
New: go1.7beta2

GOARCH=386 on "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz"

benchmark                               old MB/s     new MB/s     speedup
BenchmarkDecodeDigitsSpeed1e4-12        46.53        53.72        1.15x
BenchmarkDecodeDigitsSpeed1e5-12        59.11        66.08        1.12x
BenchmarkDecodeDigitsSpeed1e6-12        62.76        69.40        1.11x
BenchmarkDecodeDigitsDefault1e4-12      46.77        52.87        1.13x
BenchmarkDecodeDigitsDefault1e5-12      67.79        66.43        0.98x
BenchmarkDecodeDigitsDefault1e6-12      74.19        70.38        0.95x
BenchmarkDecodeDigitsCompress1e4-12     44.59        54.09        1.21x
BenchmarkDecodeDigitsCompress1e5-12     67.84        66.26        0.98x
BenchmarkDecodeDigitsCompress1e6-12     73.92        69.36        0.94x
BenchmarkDecodeTwainSpeed1e4-12         45.16        49.88        1.10x
BenchmarkDecodeTwainSpeed1e5-12         62.03        67.09        1.08x
BenchmarkDecodeTwainSpeed1e6-12         67.35        72.44        1.08x
BenchmarkDecodeTwainDefault1e4-12       49.03        53.31        1.09x
BenchmarkDecodeTwainDefault1e5-12       77.19        79.99        1.04x
BenchmarkDecodeTwainDefault1e6-12       76.65        87.58        1.14x
BenchmarkDecodeTwainCompress1e4-12      46.22        50.78        1.10x
BenchmarkDecodeTwainCompress1e5-12      77.16        80.89        1.05x
BenchmarkDecodeTwainCompress1e6-12      80.55        88.95        1.10x
BenchmarkEncodeDigitsSpeed1e4-12        24.65        33.23        1.35x
BenchmarkEncodeDigitsSpeed1e5-12        35.38        37.92        1.07x
BenchmarkEncodeDigitsSpeed1e6-12        36.60        40.28        1.10x
BenchmarkEncodeDigitsDefault1e4-12      16.76        16.07        0.96x
BenchmarkEncodeDigitsDefault1e5-12      8.40         15.81        1.88x
BenchmarkEncodeDigitsDefault1e6-12      7.37         14.58        1.98x
BenchmarkEncodeDigitsCompress1e4-12     16.80        23.88        1.42x
BenchmarkEncodeDigitsCompress1e5-12     8.00         16.08        2.01x
BenchmarkEncodeDigitsCompress1e6-12     7.52         14.82        1.97x
BenchmarkEncodeTwainSpeed1e4-12         23.41        31.69        1.35x
BenchmarkEncodeTwainSpeed1e5-12         33.96        40.54        1.19x
BenchmarkEncodeTwainSpeed1e6-12         35.11        42.05        1.20x
BenchmarkEncodeTwainDefault1e4-12       13.88        20.56        1.48x
BenchmarkEncodeTwainDefault1e5-12       8.79         14.09        1.60x
BenchmarkEncodeTwainDefault1e6-12       8.46         13.01        1.54x
BenchmarkEncodeTwainCompress1e4-12      14.91        20.72        1.39x
BenchmarkEncodeTwainCompress1e5-12      7.36         12.76        1.73x
BenchmarkEncodeTwainCompress1e6-12      6.51         11.58        1.78x
GOARCH=arm on "Marvell PJ4Bv7 Processor rev 2"

benchmark                              old MB/s     new MB/s     speedup
BenchmarkDecodeDigitsSpeed1e4-4        4.94         5.58         1.13x
BenchmarkDecodeDigitsSpeed1e5-4        5.63         6.60         1.17x
BenchmarkDecodeDigitsSpeed1e6-4        5.71         6.72         1.18x
BenchmarkDecodeDigitsDefault1e4-4      5.24         5.88         1.12x
BenchmarkDecodeDigitsDefault1e5-4      6.49         7.04         1.08x
BenchmarkDecodeDigitsDefault1e6-4      6.69         7.16         1.07x
BenchmarkDecodeDigitsCompress1e4-4     5.20         5.88         1.13x
BenchmarkDecodeDigitsCompress1e5-4     6.45         7.03         1.09x
BenchmarkDecodeDigitsCompress1e6-4     6.71         7.16         1.07x
BenchmarkDecodeTwainSpeed1e4-4         5.54         5.86         1.06x
BenchmarkDecodeTwainSpeed1e5-4         7.03         7.59         1.08x
BenchmarkDecodeTwainSpeed1e6-4         7.20         7.75         1.08x
BenchmarkDecodeTwainDefault1e4-4       5.79         6.21         1.07x
BenchmarkDecodeTwainDefault1e5-4       8.18         8.83         1.08x
BenchmarkDecodeTwainDefault1e6-4       8.68         9.31         1.07x
BenchmarkDecodeTwainCompress1e4-4      5.74         6.37         1.11x
BenchmarkDecodeTwainCompress1e5-4      8.27         8.86         1.07x
BenchmarkDecodeTwainCompress1e6-4      8.73         9.29         1.06x
BenchmarkEncodeDigitsSpeed1e4-4        2.66         3.99         1.50x
BenchmarkEncodeDigitsSpeed1e5-4        3.19         4.87         1.53x
BenchmarkEncodeDigitsSpeed1e6-4        3.24         5.19         1.60x
BenchmarkEncodeDigitsDefault1e4-4      1.97         2.63         1.34x
BenchmarkEncodeDigitsDefault1e5-4      0.90         1.92         2.13x
BenchmarkEncodeDigitsDefault1e6-4      0.79         1.81         2.29x
BenchmarkEncodeDigitsCompress1e4-4     1.98         2.63         1.33x
BenchmarkEncodeDigitsCompress1e5-4     0.89         1.92         2.16x
BenchmarkEncodeDigitsCompress1e6-4     0.80         1.81         2.26x
BenchmarkEncodeTwainSpeed1e4-4         2.93         4.10         1.40x
BenchmarkEncodeTwainSpeed1e5-4         3.95         5.53         1.40x
BenchmarkEncodeTwainSpeed1e6-4         4.07         5.70         1.40x
BenchmarkEncodeTwainDefault1e4-4       2.01         2.52         1.25x
BenchmarkEncodeTwainDefault1e5-4       1.13         1.81         1.60x
BenchmarkEncodeTwainDefault1e6-4       1.04         1.69         1.62x
BenchmarkEncodeTwainCompress1e4-4      2.00         2.49         1.25x
BenchmarkEncodeTwainCompress1e5-4      0.90         1.66         1.84x
BenchmarkEncodeTwainCompress1e6-4      0.79         1.53         1.94x
GOARCH=arm64 on "???" <= linux-arm64-buildlet

benchmark                              old MB/s     new MB/s     speedup
BenchmarkDecodeDigitsSpeed1e4-8        21.21        24.35        1.15x
BenchmarkDecodeDigitsSpeed1e5-8        22.18        26.06        1.17x
BenchmarkDecodeDigitsSpeed1e6-8        22.25        26.01        1.17x
BenchmarkDecodeDigitsDefault1e4-8      21.99        24.39        1.11x
BenchmarkDecodeDigitsDefault1e5-8      26.28        27.16        1.03x
BenchmarkDecodeDigitsDefault1e6-8      26.90        27.50        1.02x
BenchmarkDecodeDigitsCompress1e4-8     21.97        24.48        1.11x
BenchmarkDecodeDigitsCompress1e5-8     26.20        27.15        1.04x
BenchmarkDecodeDigitsCompress1e6-8     26.90        27.51        1.02x
BenchmarkDecodeTwainSpeed1e4-8         22.37        24.38        1.09x
BenchmarkDecodeTwainSpeed1e5-8         25.77        27.79        1.08x
BenchmarkDecodeTwainSpeed1e6-8         26.22        27.94        1.07x
BenchmarkDecodeTwainDefault1e4-8       24.05        25.91        1.08x
BenchmarkDecodeTwainDefault1e5-8       30.73        32.86        1.07x
BenchmarkDecodeTwainDefault1e6-8       31.73        33.98        1.07x
BenchmarkDecodeTwainCompress1e4-8      24.04        25.79        1.07x
BenchmarkDecodeTwainCompress1e5-8      30.83        32.86        1.07x
BenchmarkDecodeTwainCompress1e6-8      31.92        33.98        1.06x
BenchmarkEncodeDigitsSpeed1e4-8        7.46         12.11        1.62x
BenchmarkEncodeDigitsSpeed1e5-8        9.67         14.73        1.52x
BenchmarkEncodeDigitsSpeed1e6-8        9.90         15.62        1.58x
BenchmarkEncodeDigitsDefault1e4-8      5.67         7.11         1.25x
BenchmarkEncodeDigitsDefault1e5-8      2.46         4.96         2.02x
BenchmarkEncodeDigitsDefault1e6-8      2.13         4.66         2.19x
BenchmarkEncodeDigitsCompress1e4-8     5.66         7.14         1.26x
BenchmarkEncodeDigitsCompress1e5-8     2.46         4.98         2.02x
BenchmarkEncodeDigitsCompress1e6-8     2.15         4.66         2.17x
BenchmarkEncodeTwainSpeed1e4-8         7.35         11.99        1.63x
BenchmarkEncodeTwainSpeed1e5-8         10.28        16.26        1.58x
BenchmarkEncodeTwainSpeed1e6-8         10.63        16.69        1.57x
BenchmarkEncodeTwainDefault1e4-8       5.30         6.60         1.25x
BenchmarkEncodeTwainDefault1e5-8       3.03         4.83         1.59x
BenchmarkEncodeTwainDefault1e6-8       2.78         4.54         1.63x
BenchmarkEncodeTwainCompress1e4-8      5.25         6.50         1.24x
BenchmarkEncodeTwainCompress1e5-8      2.43         4.46         1.84x
BenchmarkEncodeTwainCompress1e6-8      2.18         4.16         1.91x

@quentinmit
Copy link
Contributor

@dsnet We are running the Gzip benchmark as found in test/bench/go1/gzip_test.go

@dsnet
Copy link
Member

dsnet commented Jun 29, 2016

tl;dr, leave gzip as is for go1.7, we can address arm performance issues in go1.8

Alright, I think I've figured out what's causing the slow-down for ARM. While all of the go1.7 optimizations as a whole have helped arm and arm64 (as seen in my post above), there are some cases where performance regresses (as is the case for test/bench/go1/gzip_test.go)

One of major changes made to compress/flate was that we started using 4-byte hashing instead of 3-byte hashing. The new hash function is stronger and leads to less collisions and thus fewer searches, which is good for performance. On amd64, we take advantage of the fact that most CPUs can perform 4-byte unaligned loads pretty efficiently (#14267). However, this optimization is not possible on arm, so the arm targets have the pay extra cost to utilize the stronger hash function.

On Go1.6, the simpler hash function compiled on arm32 to approximately 17 instrs:

102638: MOVW 0x44(R5), R1
10263c: MOVW $2, R2
102640: ADD R2, R1, R2
102644: ADD $72, R5, R1
102648: MOVW 0x4(R1), R3
10264c: CMP R3, R2
102650: B.CC 0x10265c
102654: BL runtime.panicindex(SB)
102658: UNDEF
10265c: MOVW (R1), R1
102660: MOVB (R1)(R2), R0
102664: MOVW 0x74(R5), R1
102668: LSL $6, R1, R1
10266c: ADD R1, R0, R0
102670: MOVW 0x46c(R15), R1
102674: AND R1, R0, R0
102678: MOVW R0, 0x74(R5)

On Go1.7, the stronger hash function compiled on arm32 to approximately 77 instrs:

e7d74: MOVW 0x5e8(R15), R11
e7d78: MOVW (R5)(R11), R2
e7d7c: MOVW 0x5e0(R15), R11
e7d80: MOVW (R5)(R11), R0
e7d84: MOVW $4, R1
e7d88: ADD R1, R0, R3
e7d8c: MOVW R2, R1
e7d90: MOVW R3, R2
e7d94: MOVW 0x5d4(R15), R11
e7d98: MOVW (R5)(R11), R3
e7d9c: CMP R3, R2
e7da0: B.HI 0xe8a90
e7da4: CMP R2, R1
e7da8: B.HI 0xe8a90
e7dac: SUB R1, R2, R2
e7db0: SUB R1, R3, R3
e7db4: MOVW 0x5b8(R15), R11
e7db8: MOVW (R5)(R11), R4
e7dbc: CMP $0, R3
e7dc0: B.EQ 0xe7dc8
e7dc4: ADD R1, R4, R4
e7dc8: MOVW R2, R8
e7dcc: MOVW R3, R7
e7dd0: MOVW R4, 0x50(R13)
e7dd4: MOVW R2, 0x54(R13)
e7dd8: MOVW R3, 0x58(R13)
e7ddc: MOVW $0, R0
e7de0: ADD $80, R13, R0
e7de4: MOVW 0x4(R0), R1
e7de8: MOVW $3, R2
e7dec: CMP R2, R1
e7df0: B.HI 0xe7dfc
e7df4: BL runtime.panicindex(SB)
e7df8: UNDEF
e7dfc: MOVW (R0), R0
e7e00: MOVB 0x3(R0), R0
e7e04: ADD $80, R13, R1
e7e08: MOVW 0x4(R1), R2
e7e0c: MOVW $2, R3
e7e10: CMP R3, R2
e7e14: B.HI 0xe7e20
e7e18: BL runtime.panicindex(SB)
e7e1c: UNDEF
e7e20: MOVW (R1), R1
e7e24: MOVB 0x2(R1), R1
e7e28: LSL $8, R1, R1
e7e2c: ORR R1, R0, R0
e7e30: ADD $80, R13, R1
e7e34: MOVW 0x4(R1), R2
e7e38: MOVW $1, R3
e7e3c: CMP R3, R2
e7e40: B.HI 0xe7e4c
e7e44: BL runtime.panicindex(SB)
e7e48: UNDEF
e7e4c: MOVW (R1), R1
e7e50: MOVB 0x1(R1), R1
e7e54: LSL $16, R1, R1
e7e58: ORR R1, R0, R0
e7e5c: ADD $80, R13, R1
e7e60: MOVW 0x4(R1), R2
e7e64: CMP $0, R2
e7e68: B.HI 0xe7e74
e7e6c: BL runtime.panicindex(SB)
e7e70: UNDEF
e7e74: MOVW (R1), R1
e7e78: MOVB (R1), R1
e7e7c: LSL $24, R1, R1
e7e80: ORR R1, R0, R0
e7e84: MOVW 0x4ec(R15), R1
e7e88: MUL R0, R1, R0
e7e8c: LSR $15, R0, R2
e7e90: CMP $0, R5
e7e94: MOVW.EQ R5, (R5)
e7e98: MOVW 0x4dc(R15), R11
e7e9c: ?
e8a90: BL runtime.panicslice(SB)
e8a94: UNDEF

On Go1.7, the stronger hash function compiles on amd64 to approximately 31 instrs:

4bb0d9: MOVQ 0xa0080(AX), CX
4bb0e0: MOVQ 0xa0070(AX), BX
4bb0e7: LEAQ 0x4(DX), SI
4bb0eb: CMPQ SI, DX
4bb0ee: JA 0x4bbc9f
4bb0f4: CMPQ CX, SI
4bb0f7: JA 0x4bbc9f
4bb0fd: SUBQ DX, CX
4bb100: TESTQ CX, CX
4bb103: JE 0x4bbc98
4bb109: MOVZX 0x3(BX)(DX*1), CX
4bb10e: MOVZX 0x2(BX)(DX*1), SI
4bb113: MOVZX 0x1(BX)(DX*1), DI
4bb118: MOVZX 0(BX)(DX*1), DX
4bb11c: SHLL $0x8, SI
4bb11f: ORL SI, CX
4bb121: SHLL $0x10, DI
4bb124: ORL DI, CX
4bb126: SHLL $0x18, DX
4bb129: ORL DX, CX
4bb12b: IMULL $0x1e35a7bd, CX, CX
4bb131: SHRL $0xf, CX
4bb134: MOVL CX, 0xa00c8(AX)
4bb261: XORL SI, SI
4bb5a8: XORL DX, DX
4bb644: XORL BX, BX
4bbab2: XORL CX, CX
4bbc98: XORL DX, DX
4bbc9a: JMP 0x4bb109
4bbc9f: CALL runtime.panicslice(SB)
4bbca4: UD2

Any changes to fix this would probably be too big to do this late in the release cycle. We can think about better approaches in Go 1.8. I don't know enough about ARM to know if it will ever have unaligned reads/writes. So it may be worth investigating using the previous hash function over the new one for non x86 architectures.

@josharian
Copy link
Contributor

If you use the dev.ssa branch and hack in config to make arm use ssa, how do things look? I.e. Will ssa for arm save us for free?

@cherrymui
Copy link
Member

@dsnet, which function did you disassemble? Is it hash4 in compress/flate?

@josharian, SSA can get the performance back on ARM:

Gzip-4                      3.68s ± 0%      2.59s ± 1%  -29.76%

Gzip-4                   5.27MB/s ± 0%   7.50MB/s ± 1%  +42.36%

This is tip of dev.ssa plus my pending CLs, comparing with SSA off vs on. Not sure about the hash function specifically (as I don't know which function it is...)

@dsnet
Copy link
Member

dsnet commented Jun 30, 2016

Correct, it is hash4. That is awesome seeing the performance being boosted with SSA :)

@cherrymui
Copy link
Member

For hash4, somehow I see a different disassembly than what you saw:
On ARM32, with SSA off:

    0x0000 00000    TEXT    "".hash4(SB), $0-16
    0x0000 00000    MOVW    8(g), R1
    0x0004 00004    CMP R1, R13
    0x0008 00008    BLS 204
    0x000c 00012    MOVW.W  R14, -4(R13)
    0x0010 00016    FUNCDATA    $0, gclocals·8355ad952265fec823c17fcf739bd009(SB)
    0x0010 00016    FUNCDATA    $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
    0x0010 00016    MOVW    $0, R0
    0x0014 00020    MOVW    $"".b(FP), R0
    0x0018 00024    MOVW    4(R0), R1
    0x001c 00028    MOVW    $3, R2
    0x0020 00032    CMP R2, R1
    0x0024 00036    BHI $1, 48
    0x0028 00040    PCDATA  $0, $1
    0x0028 00040    CALL    runtime.panicindex(SB)
    0x002c 00044    UNDEF
    0x0030 00048    MOVW    (R0), R0
    0x0034 00052    MOVBU   3(R0), R0
    0x0038 00056    MOVW    $"".b(FP), R1
    0x003c 00060    MOVW    4(R1), R2
    0x0040 00064    MOVW    $2, R3
    0x0044 00068    CMP R3, R2
    0x0048 00072    BHI $1, 84
    0x004c 00076    PCDATA  $0, $1
    0x004c 00076    CALL    runtime.panicindex(SB)
    0x0050 00080    UNDEF
    0x0054 00084    MOVW    (R1), R1
    0x0058 00088    MOVBU   2(R1), R1
    0x005c 00092    MOVW    R1<<8, R1
    0x0060 00096    ORR R1, R0, R0
    0x0064 00100    MOVW    $"".b(FP), R1
    0x0068 00104    MOVW    4(R1), R2
    0x006c 00108    MOVW    $1, R3
    0x0070 00112    CMP R3, R2
    0x0074 00116    BHI $1, 128
    0x0078 00120    PCDATA  $0, $1
    0x0078 00120    CALL    runtime.panicindex(SB)
    0x007c 00124    UNDEF
    0x0080 00128    MOVW    (R1), R1
    0x0084 00132    MOVBU   1(R1), R1
    0x0088 00136    MOVW    R1<<16, R1
    0x008c 00140    ORR R1, R0
    0x0090 00144    MOVW    $"".b(FP), R1
    0x0094 00148    MOVW    4(R1), R2
    0x0098 00152    CMP $0, R2
    0x009c 00156    BHI $1, 168
    0x00a0 00160    PCDATA  $0, $1
    0x00a0 00160    CALL    runtime.panicindex(SB)
    0x00a4 00164    UNDEF
    0x00a8 00168    MOVW    (R1), R1
    0x00ac 00172    MOVBU   (R1), R1
    0x00b0 00176    MOVW    R1<<24, R1
    0x00b4 00180    ORR R1, R0
    0x00b8 00184    MOVW    $506832829, R1
    0x00bc 00188    MULU    R1, R0
    0x00c0 00192    MOVW    R0>>15, R0
    0x00c4 00196    MOVW    R0, "".~r1+12(FP)
    0x00c8 00200    MOVW.P  4(R13), R15
    0x00cc 00204    NOP
    0x00cc 00204    MOVW    R14, R3
    0x00d0 00208    CALL    runtime.morestack_noctxt(SB)
    0x00d4 00212    JMP 0
    0x00d8 00216    JMP 0(PC)
    0x00dc 00220    WORD    $506832829

With SSA on:

    0x0000 00000    TEXT    "".hash4(SB), $4-16
    0x0000 00000    MOVW    8(g), R1
    0x0004 00004    CMP R1, R13
    0x0008 00008    BLS 92
    0x000c 00012    MOVW.W  R14, -8(R13)
    0x0010 00016    FUNCDATA    $0, gclocals·8355ad952265fec823c17fcf739bd009(SB)
    0x0010 00016    FUNCDATA    $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
    0x0010 00016    MOVW    "".b+4(FP), R0
    0x0014 00020    CMP $3, R0
    0x0018 00024    BLS 84
    0x001c 00028    MOVW    "".b(FP), R0
    0x0020 00032    MOVBU   3(R0), R1
    0x0024 00036    MOVBU   2(R0), R2
    0x0028 00040    MOVBU   1(R0), R3
    0x002c 00044    MOVBU   (R0), R0
    0x0030 00048    ORR R2<<8, R1, R1
    0x0034 00052    MOVW    R3, R2
    0x0038 00056    ORR R2<<16, R1, R1
    0x003c 00060    ORR R0<<24, R1, R0
    0x0040 00064    MOVW    $506832829, R1
    0x0044 00068    MUL R0, R1, R0
    0x0048 00072    SRL $15, R0, R0
    0x004c 00076    MOVW    R0, "".~r1+12(FP)
    0x0050 00080    MOVW.P  8(R13), R15
    0x0054 00084    PCDATA  $0, $1
    0x0054 00084    CALL    runtime.panicindex(SB)
    0x0058 00088    UNDEF
    0x005c 00092    NOP
    0x005c 00092    MOVW    R14, R3
    0x0060 00096    CALL    runtime.morestack_noctxt(SB)
    0x0064 00100    JMP 0
    0x0068 00104    JMP 0(PC)
    0x006c 00108    WORD    $506832829

So it is about 50% shorter. Currently it does not do combined unaligned load. It seems CSE helps mostly.

It might be possible to do combined load. Newer ARM support unaligned load. But currently the compiler (both backends) generates same instructions, only the assembler rewrite some instructions based on GOARM.

@gopherbot
Copy link

CL https://golang.org/cl/24640 mentions this issue.

gopherbot pushed a commit that referenced this issue Jun 30, 2016
Ctz is a hot-spot in the Go 1.7 memory manager. In SSA it's
implemented as an intrinsic that compiles to a few instructions, but
on the old backend (all architectures other than amd64), it's
implemented as a fairly complex Go function. As a result, switching to
bitmap-based allocation was a significant hit to allocation-heavy
workloads like BinaryTree17 on non-SSA platforms.

For unknown reasons, this hit 386 particularly hard. We can regain a
lot of the lost performance by implementing Ctz in assembly on the
386. This isn't as good as an intrinsic, since it still generates a
function call and prevents useful inlining, but it's much better than
the pure Go implementation:

name                      old time/op    new time/op    delta
BinaryTree17-12              3.59s ± 1%     3.06s ± 1%  -14.74%  (p=0.000 n=19+20)
Fannkuch11-12                3.72s ± 1%     3.64s ± 1%   -2.09%  (p=0.000 n=17+19)
FmtFprintfEmpty-12          52.3ns ± 3%    52.3ns ± 3%     ~     (p=0.829 n=20+19)
FmtFprintfString-12          156ns ± 1%     148ns ± 3%   -5.20%  (p=0.000 n=18+19)
FmtFprintfInt-12             137ns ± 1%     136ns ± 1%   -0.56%  (p=0.000 n=19+13)
FmtFprintfIntInt-12          227ns ± 2%     225ns ± 2%   -0.93%  (p=0.000 n=19+17)
FmtFprintfPrefixedInt-12     210ns ± 1%     208ns ± 1%   -0.91%  (p=0.000 n=19+17)
FmtFprintfFloat-12           375ns ± 1%     371ns ± 1%   -1.06%  (p=0.000 n=19+18)
FmtManyArgs-12               995ns ± 2%     978ns ± 1%   -1.63%  (p=0.000 n=17+17)
GobDecode-12                9.33ms ± 1%    9.19ms ± 0%   -1.59%  (p=0.000 n=20+17)
GobEncode-12                7.73ms ± 1%    7.73ms ± 1%     ~     (p=0.771 n=19+20)
Gzip-12                      375ms ± 1%     374ms ± 1%     ~     (p=0.141 n=20+18)
Gunzip-12                   61.8ms ± 1%    61.8ms ± 1%     ~     (p=0.602 n=20+20)
HTTPClientServer-12         87.7µs ± 2%    86.9µs ± 3%   -0.87%  (p=0.024 n=19+20)
JSONEncode-12               20.2ms ± 1%    20.4ms ± 0%   +0.53%  (p=0.000 n=18+19)
JSONDecode-12               65.3ms ± 0%    65.4ms ± 1%     ~     (p=0.385 n=16+19)
Mandelbrot200-12            4.11ms ± 1%    4.12ms ± 0%   +0.29%  (p=0.020 n=19+19)
GoParse-12                  3.75ms ± 1%    3.61ms ± 2%   -3.90%  (p=0.000 n=20+20)
RegexpMatchEasy0_32-12       104ns ± 0%     103ns ± 0%   -0.96%  (p=0.000 n=13+16)
RegexpMatchEasy0_1K-12       805ns ± 1%     803ns ± 1%     ~     (p=0.189 n=18+18)
RegexpMatchEasy1_32-12       111ns ± 0%     111ns ± 3%     ~     (p=1.000 n=14+19)
RegexpMatchEasy1_1K-12      1.00µs ± 1%    1.00µs ± 1%   +0.50%  (p=0.003 n=19+19)
RegexpMatchMedium_32-12      133ns ± 2%     133ns ± 2%     ~     (p=0.218 n=20+20)
RegexpMatchMedium_1K-12     41.2µs ± 1%    42.2µs ± 1%   +2.52%  (p=0.000 n=18+16)
RegexpMatchHard_32-12       2.35µs ± 1%    2.38µs ± 1%   +1.53%  (p=0.000 n=18+18)
RegexpMatchHard_1K-12       70.9µs ± 2%    72.0µs ± 1%   +1.42%  (p=0.000 n=19+17)
Revcomp-12                   1.06s ± 0%     1.05s ± 0%   -1.36%  (p=0.000 n=20+18)
Template-12                 86.2ms ± 1%    84.6ms ± 0%   -1.89%  (p=0.000 n=20+18)
TimeParse-12                 425ns ± 2%     428ns ± 1%   +0.77%  (p=0.000 n=18+19)
TimeFormat-12                517ns ± 1%     519ns ± 1%   +0.43%  (p=0.001 n=20+19)
[Geo mean]                  74.3µs         73.5µs        -1.05%

Prior to this commit, BinaryTree17-12 on 386 was 33% slower than at
the go1.6 tag. With this commit, it's 13% slower.

On arm and arm64, BinaryTree17-12 is only ~5% slower than it was at
go1.6. It may be worth implementing Ctz for them as well.

I consider this change low risk, since the functions it replaces are
simple, very well specified, and well tested.

For #16117.

Change-Id: Ic39d851d5aca91330134596effd2dab9689ba066
Reviewed-on: https://go-review.googlesource.com/24640
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link

CL https://golang.org/cl/24642 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/24643 mentions this issue.

@ianlancetaylor
Copy link
Contributor

This issue has headed in several different directions.

I don't think we currently plan to do any further work to speed up the reflect package, which is the core of the original report. It would be nice to know what the performance losses are on tip.

I'm going to close this issue. If there is further work to be done for 1.7 let's move it to more focused issues.

@gopherbot
Copy link

CL https://golang.org/cl/24521 mentions this issue.

@golang golang locked and limited conversation to collaborators Sep 30, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests