Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: performance degradation for versions > 1.4.x #14189

Closed
toddboom opened this issue Feb 2, 2016 · 19 comments
Closed

runtime: performance degradation for versions > 1.4.x #14189

toddboom opened this issue Feb 2, 2016 · 19 comments
Milestone

Comments

@toddboom
Copy link

toddboom commented Feb 2, 2016

This issue is a follow-up to a thread that started on the mailing list here:
https://groups.google.com/forum/#!topic/golang-dev/sfwfQTu7eM4

We recently reverted back to Go 1.4.3 for the most recent release of InfluxDB after testing against 1.5.x and collecting feedback from both our own benchmarks and the experiences of InfluxDB users. There were significant performance hits with both 1.5.x and 1.6beta1, across a variety of dimensions. For example, with write performance, we saw a 33% reduction in overall throughput, coupled with a 26% increase in average response time. We expected that the GC improvements would lead to a reduction in response times, but that hasn't proven to be the case. It's far from exhaustive, but here are the findings from a medium-scale write test:

version points/sec response time
Go 1.4.3 398,747 33.97ms
Go 1.5.3 300,750 43.12ms
Go 1.6rc1 315,545 39.44ms

As requested in the mailing list thread, we ran the test with GODEBUG=gctrace=1 and recorded the output alongside our log messages for context. It doesn't look like the GC-specific lines should be too hard to grep out if our other log lines are too noisy.

We also ran with cpuprofile and memprofile on, and collected the output. We've included results for these tests on all three versions of Go listed above, and provided the compiled binaries as well. You can find them in this Google Drive folder:
https://drive.google.com/open?id=0B3YrTka8Ai_McUNCak1YSW9ZM28

I hope this is enough information to assist in debugging the underlying issue. Definitely let me know if there are any other tests I can run to help. We're looking forward to getting to better performance than Go 1.4.3!

@bradfitz bradfitz changed the title Performance degradation for versions > 1.4.x runtime: performance degradation for versions > 1.4.x Feb 2, 2016
@RLH
Copy link
Contributor

RLH commented Feb 2, 2016

gctrace reports that you have 8 HW threads (GOMAXPROCS=8) and the GC is
constraining the heap to half a Gigabyte. Allowing the GC to grow the heap
to fill available RAM may solve this problem. Perhaps adjusting GOGC to
GOGC=1600 and rerunning these numbers will help.

On Tue, Feb 2, 2016 at 12:25 AM, Todd Persen notifications@github.com
wrote:

This issue is a follow-up to a thread that started on the mailing list
here:
https://groups.google.com/forum/#!topic/golang-dev/sfwfQTu7eM4

We recently reverted back to Go 1.4.3 for the most recent release of
InfluxDB after testing against 1.5.x and collecting feedback from both our
own benchmarks and the experiences of InfluxDB users. There were
significant performance hits with both 1.5.x and 1.6beta1, across a variety
of dimensions. For example, with write performance, we saw a 33% reduction
in overall throughput, coupled with a 26% increase in average response
time. We expected that the GC improvements would lead to a reduction in
response times, but that hasn't proven to be the case. It's far from
exhaustive, but here are the findings from a medium-scale write test:
version points/sec response time
Go 1.4.3 398,747 33.97ms
Go 1.5.3 300,750 43.12ms
Go 1.6rc1 315,545 39.44ms

As requested in the mailing list thread, we ran the test with
GODEBUG=gctrace=1 and recorded the output alongside our log messages for
context. It doesn't look like the GC-specific lines should be too hard to
grep out if our other log lines are too noisy.

We also ran with cpuprofile and memprofile on, and collected the output.
We've included results for these tests on all three versions of Go listed
above, and provided the compiled binaries as well. You can find them in
this Google Drive folder:
https://drive.google.com/open?id=0B3YrTka8Ai_McUNCak1YSW9ZM28

I hope this is enough information to assist in debugging the underlying
issue. Definitely let me know if there are any other tests I can run to
help. We're looking forward to getting to better performance than Go
1.4.3! [image: 💥]


Reply to this email directly or view it on GitHub
#14189.

@toddboom
Copy link
Author

toddboom commented Feb 2, 2016

@RLH I'll try that now with 1.6rc1 and report back. Thanks!

@toddboom
Copy link
Author

toddboom commented Feb 2, 2016

@RLH So, some preliminary results. Interesting.

Using GOGC=1600, I got the following:

version points/sec response time
Go 1.4.3 496,271 24.67ms
Go 1.6rc1 452,760 17.08ms

So, the response time for both came down, with a greater improvement on 1.6rc1. It's now in line with what I'd expect for the new GC, but we're still seeing better performance with 1.4.3 on overall throughput. Does the GOGC parameter behave the same way on 1.4.x as it does on later versions?

@RLH
Copy link
Contributor

RLH commented Feb 2, 2016

Yes, GOGC behaves the same.

On Tue, Feb 2, 2016 at 1:54 AM, Todd Persen notifications@github.com
wrote:

@RLH https://github.com/RLH So, some preliminary results. Interesting.

Using GOGC=1600, I got the following:
version points/sec response time
Go 1.4.3 496,271 24.67ms
Go 1.6rc1 452,760 17.08ms

So, the response time for both came down, with a greater improvement on
1.6rc1. It's now in line with what I'd expect for the new GC, but we're
still seeing better performance with 1.4.3 on overall throughput. Does the
GOGC parameter behave the same way on 1.4.x as it does on later versions?


Reply to this email directly or view it on GitHub
#14189 (comment).

@methane
Copy link
Contributor

methane commented Feb 2, 2016

Does GODEBUG=gcstoptheworld=1 helps?

@RLH
Copy link
Contributor

RLH commented Feb 2, 2016

The unrelated discussion concerning tsml.Cache needs to be moved to its own
issue.

On Tue, Feb 2, 2016 at 1:20 PM, INADA Naoki notifications@github.com
wrote:

I'm sorry about slightly offtopic.

It seems tsm1.Cache is GC unfriendly. There are so many pointers in the
cache.
Each value type has time time.Time. time.Time contains pointer *Location.
So changing time time.Time to unixnano int64 may reduce GC cost.

I've wrote quick benchmark:

func BenchmarkCacheFloatEntries(b *testing.B) {
for i := 0; i < b.N; i++ {
cache := NewCache(10000)
for j := 0; j < 10000; j++ {
v := NewValue(time.Unix(1, 0), float64(j))
cache.Write("test", []Value{v})
}
}
}

before:
$ go test -run=xxx -bench=BenchmarkCache
PASS
BenchmarkCacheFloatEntries-4 300 4019495 ns/op

after:
$ go test -run=xxx -bench=BenchmarkCache
PASS
BenchmarkCacheFloatEntries-4 500 3720449 ns/op

patch: https://gist.github.com/methane/59c8777907b10dc983a1


Reply to this email directly or view it on GitHub
#14189 (comment).

@aclements
Copy link
Member

It looks our GC rate in 1.6 is 1.5X what it was in 1.4. Based on your logs, go-1.4.3.txt does 762 GCs in 8.73 minutes = 87 GCs/minute and go-1.6.txt does 856 GCs in 6.35 minutes = 135 GCs/minute. I haven't analyzed the heap sizes yet, but that suggests we're keeping the heap somewhat smaller than 1.4 did, which is simply forcing the GC to do more work. It would also be good to compare the scan rates (bytes/sec). Comparing the heap size CDFs and the scan rates should give us a better sense of whether this is a raw GC throughput problem or a heap trigger problem.

@iand
Copy link
Contributor

iand commented Feb 18, 2016

We have a similar issue that is preventing us from moving to Go 1.5 on our production services. I have some godebug gc traces comparing go 1.4.3 and 1.5.2 under various values of GOGC. Would these be of interest for this issue?

@RLH
Copy link
Contributor

RLH commented Feb 18, 2016

As of today our focus is 1.6 and we won't be changing 1.5. Do you have 1.6
numbers we can compare to your 1.4 and 1.5 numbers. Such numbers would be
valuable.

On Thursday, February 18, 2016, Ian Davis notifications@github.com wrote:

We have a similar issue that is preventing us from moving to Go 1.5 on our
production services. I have some godebug gc traces comparing go 1.4.3 and
1.5.2 under various values of GOGC. Would these be of interest for this
issue?


Reply to this email directly or view it on GitHub
#14189 (comment).

@iand
Copy link
Contributor

iand commented Feb 18, 2016

I will work on getting the equivalent 1.6 numbers

@iand
Copy link
Contributor

iand commented Feb 18, 2016

Here are details of a similar degradation we are experiencing that is keeping us on Go 1.4.3. For context the service I took these measurements from is one instance in a cluster of real time advertising bidders. It receives bid requests via HTTP and performs two key operations that I timed. The first "augment" makes a single HTTP call to a couchbase cluster (using https://github.com/couchbase/go-couchbase) which returns about 5kB of protobuf encoded data. The second "evaluate" is a pure go section of the code that makes no network calls and is heavily optimised to avoid garbage generation. Responses from our service are expected to complete within 50ms.

I ran each test for 10 minutes with gctrace=1 and various values of GOGC. Our instrumentation takes samples every 10 seconds. The service is running on an Amazon c4.xlarge instance with 4 cpus and 7.5GB RAM. At the time of the Go 1.5 and 1.4.3 tests the service was receiving about 1100 requests per second. When I ran the 1.6 tests the rate was about 1400 per second. We rate this service on this instance type at about 2000 per second max (under Go 1.4.3)

 time  go version  GOGC    
---------------------------
 9:42       1.5.2   100         
 9:52       1.5.2   200
10.02       1.5.2   400
10:12       1.5.2   800
10:22       1.4.3  default
13:56       1.6     100         
14:06       1.6     200         
14:16       1.6     400         
14:26       1.6     800       

Charts of timings of 99.9th (orange) and maximum values (blue) for augment operation. (Go 1.4.3 is the last section of the top graph, starting at 10:22)

augment-go1 5
augment-go1 6

time     99.9% augment  max augment  
------------------------------------
 9:42       17.8ms         216.3ms
 9:52       12.0ms         204.9ms
10:02        7.9ms         225.2ms
10:12        6.3ms         134.0ms
10:22        5.6ms         236.0ms
13:56       35.8ms         237.2ms
14:06        8.0ms         226.9ms
14:16        5.7ms         240.6ms
14:26        6.5ms         472.6ms

And same for evaluate operation:

evaluate-go1 5
evaluate-go1 6

time     99.9% evaluate  max evaluate  
------------------------------------
 9:42       1.5ms         27.5ms
 9:52       1.2ms          8.8ms
10:02       1.1ms         15.7ms
10:12       1.1ms         13.5ms
10:22       1.2ms        141.3ms
13:56       1.2ms        256.0ms
14:06       1.1ms        221.1ms
14:16       1.2ms        234.0ms
14:26       1.2ms        233.3ms

Overall I would characterise the results as Go 1.5 and 1.6 result in slower timings than 1.4.3 but are more consistent. The 99.9 centiles were larger but there were fewer instances where the maximum time exceeded our deadlines.

Attached are the gctrace outputs from each of the tests.

go1.4-gogcdefault.txt
go1.5-gogc100.txt
go1.5-gogc200.txt
go1.5-gogc400.txt
go1.5-gogc800.txt
go1.6-gogc100.txt
go1.6-gogc200.txt
go1.6-gogc400.txt
go1.6-gogc800.txt

@RLH
Copy link
Contributor

RLH commented Feb 18, 2016

Thanks Ian,

Looking at the raw 1.6 gctrace numbers in go1.6-gogc100, go1.6-gogc200, and
go1.6-gogc400 the GC took up to 3% of the CPU depending on the heap size
and the largest stop the world pauses was less than 3 ms. In fact of the
300 or so GC cycles over the 3 10 minute runs only 1 showed a stop the
world pause greater than 2 ms. and only a handful were greater than 1 ms.

The 10 minute go1.6-gogc800 had an interesting single anomaly. Only 21 GC
cycles happened over the entire run and all the cycles, except one,
completed in less than 300ms including the concurrent phases. While the
total CPU used by the GC was much less than 1% there was a single instance
(gc 13 at 208 seconds into the run) of a 241 ms stop the world pause. I
can't account for this single very large pause time.

I'm concluding that the latency problems are not related to the GC.

The only vaguely similar numbers that I have seen were when the OS /
service provider noticed that the application wasn't fully utilizing the
allocated CPU resources and took them away after an asynchronous
preemption.

It looks like GOMAXPROCS is set to 4, was anything else running on the
machine that might have reduced the number of HW threads available in some
non-deterministic fashion? Of particular interest would be such an
interrupt that might last 200 ms?

  • Rick

On Thu, Feb 18, 2016 at 10:35 AM, Ian Davis notifications@github.com
wrote:

Here are details of a similar degradation we are experiencing that is
keeping us on Go 1.4.3. For context the service I took these measurements
from is one instance in a cluster of real time advertising bidders. It
receives bid requests via HTTP and performs two key operations that I
timed. The first "augment" makes a single HTTP call to a couchbase cluster
(using https://github.com/couchbase/go-couchbase) which returns about 5kB
of protobuf encoded data. The second "evaluate" is a pure go section of the
code that makes no network calls and is heavily optimised to avoid garbage
generation. Responses from our service are expected to complete within 50ms.

I ran each test for 10 minutes with gctrace=1 and various values of GOGC.
Our instrumentation takes samples every 10 seconds. The service is running
on an Amazon c4.xlarge instance with 4 cpus and 7.5GB RAM. At the time of
the Go 1.5 and 1.4.3 tests the service was receiving about 1100 requests
per second. When I ran the 1.6 tests the rate was about 1400 per second. We
rate this service on this instance type at about 2000 per second max (under
Go 1.4.3)

time go version GOGC

9:42 1.5.2 100
9:52 1.5.2 200
10.02 1.5.2 400
10:12 1.5.2 800
10:22 1.4.3 default
13:56 1.6 100
14:06 1.6 200
14:16 1.6 400
14:26 1.6 800

Charts of timings of 99.9th (orange) and maximum values (blue) for augment
operation.

[image: augment-go1 5]
https://cloud.githubusercontent.com/assets/18375/13148113/cebd0972-d653-11e5-8d08-2f97091d7483.png
[image: augment-go1 6]
https://cloud.githubusercontent.com/assets/18375/13148120/d3e3457e-d653-11e5-9002-5d0cd18f5959.png

time 99.9% augment max augment

9:42 17.8ms 216.3ms
9:52 12.0ms 204.9ms
10:02 7.9ms 225.2ms
10:12 6.3ms 134.0ms
10:22 5.6ms 236.0ms
13:56 35.8ms 237.2ms
14:06 8.0ms 226.9ms
14:16 5.7ms 240.6ms
14:26 6.5ms 472.6ms

And same for evaluate operation:

[image: evaluate-go1 5]
https://cloud.githubusercontent.com/assets/18375/13148157/f85e0182-d653-11e5-91f0-b393e9976b3b.png
[image: evaluate-go1 6]
https://cloud.githubusercontent.com/assets/18375/13148162/fd38aab8-d653-11e5-969f-af39f9dd1b28.png

time 99.9% evaluate max evaluate

9:42 1.5ms 27.5ms
9:52 1.2ms 8.8ms
10:02 1.1ms 15.7ms
10:12 1.1ms 13.5ms
10:22 1.2ms 141.3ms
13:56 1.2ms 256.0ms
14:06 1.1ms 221.1ms
14:16 1.2ms 234.0ms
14:26 1.2ms 233.3ms

Overall I would characterise the results as Go 1.5 and 1.6 result in
slower timings than 1.4.3 but are more consistent. The 99.9 centiles were
larger but there were fewer instances where the maximum time exceeded our
deadlines.

Attached are the gctrace outputs from each of the tests.

go1.4-gogcdefault.txt
https://github.com/golang/go/files/136435/go1.4-gogcdefault.txt
go1.5-gogc100.txt
https://github.com/golang/go/files/136432/go1.5-gogc100.txt
go1.5-gogc200.txt
https://github.com/golang/go/files/136434/go1.5-gogc200.txt
go1.5-gogc400.txt
https://github.com/golang/go/files/136433/go1.5-gogc400.txt
go1.5-gogc800.txt
https://github.com/golang/go/files/136436/go1.5-gogc800.txt
go1.6-gogc100.txt
https://github.com/golang/go/files/136437/go1.6-gogc100.txt
go1.6-gogc200.txt
https://github.com/golang/go/files/136438/go1.6-gogc200.txt
go1.6-gogc400.txt
https://github.com/golang/go/files/136439/go1.6-gogc400.txt
go1.6-gogc800.txt
https://github.com/golang/go/files/136440/go1.6-gogc800.txt


Reply to this email directly or view it on GitHub
#14189 (comment).

@iand
Copy link
Contributor

iand commented Feb 18, 2016

There are other support processes on that instance (e.g. consul agent) but they have very low cpu utilization and memory usage. GOMAXPROCS was 4 for that service (in 1.5 builds we don't set it at all so it's left as the Go default)

I saw the interrupt/spike in the gc times instrumentation we also capture. I don't know what that is for sure but it could have been a refresh of an in-memory data cache that we use as part of our bid prediction logic. I can re-run the test with those settings if you think it is worth it.

With the analysis you gave I agree that it's probably not the time spent in GC that is causing the latency. However there is lower latency when there are fewer GC cycles. Do you think it could be thread contention? Would repeating with different GOMAXPROCS give any insight?

@RLH
Copy link
Contributor

RLH commented Feb 18, 2016

If a CPU is being taken away, setting GOMAXPROCS to 3 will give the process
a free CPU for the other non-Go stuff. It is worth a try but it doesn't
explain 1.4 not having this problem.

On Thu, Feb 18, 2016 at 2:55 PM, Ian Davis notifications@github.com wrote:

There are other support processes on that instance (e.g. consul agent) but
they have very low cpu utilization and memory usage. GOMAXPROCS was 4 for
that service (in 1.5 builds we don't set it at all so it's left as the Go
default)

I saw the interrupt/spike in the gc times instrumentation we also capture.
I don't know what that is for sure but it could have been a refresh of an
in-memory data cache that we use as part of our bid prediction logic. I can
re-run the test with those settings if you think it is worth it.

With the analysis you gave I agree that it's probably not the time spent
in GC that is causing the latency. However there is lower latency when
there are fewer GC cycles. Do you think it could be thread contention?
Would repeating with different GOMAXPROCS give any insight?


Reply to this email directly or view it on GitHub
#14189 (comment).

@aclements
Copy link
Member

@iand, CPU profiles may be enlightening. It's not exactly the right tool, but if this is being caused by something in the runtime or the standard libraries, it's possible it would show up as out of place in a profile.

@iand
Copy link
Contributor

iand commented Feb 19, 2016

I retested go 1.5.2 and 1.6 versions of our service with varying GOMAXPROCS values (3, 6 and 8). GOGC was left as default. I didn't see any appreciable change in latency. If I squinted hard I might say that the network operation "augment" saw a slight improvement at higher GOMAXPROCS and the cpu operation "evaluate" saw a slight worsening. I can post images of our metrics but it hardly seems worth the noise.

I took multiple cpu profiles of the service with default settings for GOGC and GOMAXPROCS under Go 1.4.3, 1.5.2 and 1.6. I can make those available with binaries privately if you think they will help. Email me at me@iandavis.com and I will share a link to them. I have had a superficial look at them and nothing leaps out as obviously different.

@gopherbot
Copy link

CL https://golang.org/cl/21503 mentions this issue.

gopherbot pushed a commit that referenced this issue Apr 5, 2016
Two GC-related functions, scang and casgstatus, wait in an active spin loop.
Active spinning is never a good idea in user-space. Once we wait several
times more than the expected wait time, something unexpected is happenning
(e.g. the thread we are waiting for is descheduled or handling a page fault)
and we need to yield to OS scheduler. Moreover, the expected wait time is
very high for these functions: scang wait time can be tens of milliseconds,
casgstatus can be hundreds of microseconds. It does not make sense to spin
even for that time.

go install -a std profile on a 4-core machine shows that 11% of time is spent
in the active spin in scang:

  6.12%    compile  compile                [.] runtime.scang
  3.27%    compile  compile                [.] runtime.readgstatus
  1.72%    compile  compile                [.] runtime/internal/atomic.Load

The active spin also increases tail latency in the case of the slightest
oversubscription: GC goroutines spend whole quantum in the loop instead of
executing user code.

Here is scang wait time histogram during go install -a std:

13707.0000 - 1815442.7667 [   118]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎...
1815442.7667 - 3617178.5333 [     9]: ∎∎∎∎∎∎∎∎∎
3617178.5333 - 5418914.3000 [    11]: ∎∎∎∎∎∎∎∎∎∎∎
5418914.3000 - 7220650.0667 [     5]: ∎∎∎∎∎
7220650.0667 - 9022385.8333 [    12]: ∎∎∎∎∎∎∎∎∎∎∎∎
9022385.8333 - 10824121.6000 [    13]: ∎∎∎∎∎∎∎∎∎∎∎∎∎
10824121.6000 - 12625857.3667 [    15]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
12625857.3667 - 14427593.1333 [    18]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
14427593.1333 - 16229328.9000 [    18]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
16229328.9000 - 18031064.6667 [    32]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
18031064.6667 - 19832800.4333 [    28]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
19832800.4333 - 21634536.2000 [     6]: ∎∎∎∎∎∎
21634536.2000 - 23436271.9667 [    15]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
23436271.9667 - 25238007.7333 [    11]: ∎∎∎∎∎∎∎∎∎∎∎
25238007.7333 - 27039743.5000 [    27]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
27039743.5000 - 28841479.2667 [    20]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
28841479.2667 - 30643215.0333 [    10]: ∎∎∎∎∎∎∎∎∎∎
30643215.0333 - 32444950.8000 [     7]: ∎∎∎∎∎∎∎
32444950.8000 - 34246686.5667 [     4]: ∎∎∎∎
34246686.5667 - 36048422.3333 [     4]: ∎∎∎∎
36048422.3333 - 37850158.1000 [     1]: ∎
37850158.1000 - 39651893.8667 [     5]: ∎∎∎∎∎
39651893.8667 - 41453629.6333 [     2]: ∎∎
41453629.6333 - 43255365.4000 [     2]: ∎∎
43255365.4000 - 45057101.1667 [     2]: ∎∎
45057101.1667 - 46858836.9333 [     1]: ∎
46858836.9333 - 48660572.7000 [     2]: ∎∎
48660572.7000 - 50462308.4667 [     3]: ∎∎∎
50462308.4667 - 52264044.2333 [     2]: ∎∎
52264044.2333 - 54065780.0000 [     2]: ∎∎

and the zoomed-in first part:

13707.0000 - 19916.7667 [     2]: ∎∎
19916.7667 - 26126.5333 [     2]: ∎∎
26126.5333 - 32336.3000 [     9]: ∎∎∎∎∎∎∎∎∎
32336.3000 - 38546.0667 [     8]: ∎∎∎∎∎∎∎∎
38546.0667 - 44755.8333 [    12]: ∎∎∎∎∎∎∎∎∎∎∎∎
44755.8333 - 50965.6000 [    10]: ∎∎∎∎∎∎∎∎∎∎
50965.6000 - 57175.3667 [     5]: ∎∎∎∎∎
57175.3667 - 63385.1333 [     6]: ∎∎∎∎∎∎
63385.1333 - 69594.9000 [     5]: ∎∎∎∎∎
69594.9000 - 75804.6667 [     6]: ∎∎∎∎∎∎
75804.6667 - 82014.4333 [     6]: ∎∎∎∎∎∎
82014.4333 - 88224.2000 [     4]: ∎∎∎∎
88224.2000 - 94433.9667 [     1]: ∎
94433.9667 - 100643.7333 [     1]: ∎
100643.7333 - 106853.5000 [     2]: ∎∎
106853.5000 - 113063.2667 [     0]:
113063.2667 - 119273.0333 [     2]: ∎∎
119273.0333 - 125482.8000 [     2]: ∎∎
125482.8000 - 131692.5667 [     1]: ∎
131692.5667 - 137902.3333 [     1]: ∎
137902.3333 - 144112.1000 [     0]:
144112.1000 - 150321.8667 [     2]: ∎∎
150321.8667 - 156531.6333 [     1]: ∎
156531.6333 - 162741.4000 [     1]: ∎
162741.4000 - 168951.1667 [     0]:
168951.1667 - 175160.9333 [     0]:
175160.9333 - 181370.7000 [     1]: ∎
181370.7000 - 187580.4667 [     1]: ∎
187580.4667 - 193790.2333 [     2]: ∎∎
193790.2333 - 200000.0000 [     0]:

Here is casgstatus wait time histogram:

  631.0000 -  5276.6333 [     3]: ∎∎∎
 5276.6333 -  9922.2667 [     5]: ∎∎∎∎∎
 9922.2667 - 14567.9000 [     2]: ∎∎
14567.9000 - 19213.5333 [     6]: ∎∎∎∎∎∎
19213.5333 - 23859.1667 [     5]: ∎∎∎∎∎
23859.1667 - 28504.8000 [     6]: ∎∎∎∎∎∎
28504.8000 - 33150.4333 [     6]: ∎∎∎∎∎∎
33150.4333 - 37796.0667 [     2]: ∎∎
37796.0667 - 42441.7000 [     1]: ∎
42441.7000 - 47087.3333 [     3]: ∎∎∎
47087.3333 - 51732.9667 [     0]:
51732.9667 - 56378.6000 [     1]: ∎
56378.6000 - 61024.2333 [     0]:
61024.2333 - 65669.8667 [     0]:
65669.8667 - 70315.5000 [     0]:
70315.5000 - 74961.1333 [     1]: ∎
74961.1333 - 79606.7667 [     0]:
79606.7667 - 84252.4000 [     0]:
84252.4000 - 88898.0333 [     0]:
88898.0333 - 93543.6667 [     0]:
93543.6667 - 98189.3000 [     0]:
98189.3000 - 102834.9333 [     0]:
102834.9333 - 107480.5667 [     1]: ∎
107480.5667 - 112126.2000 [     0]:
112126.2000 - 116771.8333 [     0]:
116771.8333 - 121417.4667 [     0]:
121417.4667 - 126063.1000 [     0]:
126063.1000 - 130708.7333 [     0]:
130708.7333 - 135354.3667 [     0]:
135354.3667 - 140000.0000 [     1]: ∎

Ideally we eliminate the waiting by switching to async
state machine for GC, but for now just yield to OS scheduler
after a reasonable wait time.

To choose yielding parameters I've measured
golang.org/x/benchmarks/http tail latencies with different yield
delays and oversubscription levels.

With no oversubscription (to the degree possible):

scang yield delay = 1, casgstatus yield delay = 1
Latency-50   1.41ms ±15%  1.41ms ± 5%    ~     (p=0.611 n=13+12)
Latency-95   5.21ms ± 2%  5.15ms ± 2%  -1.15%  (p=0.012 n=13+13)
Latency-99   7.16ms ± 2%  7.05ms ± 2%  -1.54%  (p=0.002 n=13+13)
Latency-999  10.7ms ± 9%  10.2ms ±10%  -5.46%  (p=0.004 n=12+13)

scang yield delay = 5000, casgstatus yield delay = 3000
Latency-50   1.41ms ±15%  1.41ms ± 8%    ~     (p=0.511 n=13+13)
Latency-95   5.21ms ± 2%  5.14ms ± 2%  -1.23%  (p=0.006 n=13+13)
Latency-99   7.16ms ± 2%  7.02ms ± 2%  -1.94%  (p=0.000 n=13+13)
Latency-999  10.7ms ± 9%  10.1ms ± 8%  -6.14%  (p=0.000 n=12+13)

scang yield delay = 10000, casgstatus yield delay = 5000
Latency-50   1.41ms ±15%  1.45ms ± 6%    ~     (p=0.724 n=13+13)
Latency-95   5.21ms ± 2%  5.18ms ± 1%    ~     (p=0.287 n=13+13)
Latency-99   7.16ms ± 2%  7.05ms ± 2%  -1.64%  (p=0.002 n=13+13)
Latency-999  10.7ms ± 9%  10.0ms ± 5%  -6.72%  (p=0.000 n=12+13)

scang yield delay = 30000, casgstatus yield delay = 10000
Latency-50   1.41ms ±15%  1.51ms ± 7%  +6.57%  (p=0.002 n=13+13)
Latency-95   5.21ms ± 2%  5.21ms ± 2%    ~     (p=0.960 n=13+13)
Latency-99   7.16ms ± 2%  7.06ms ± 2%  -1.50%  (p=0.012 n=13+13)
Latency-999  10.7ms ± 9%  10.0ms ± 6%  -6.49%  (p=0.000 n=12+13)

scang yield delay = 100000, casgstatus yield delay = 50000
Latency-50   1.41ms ±15%  1.53ms ± 6%  +8.48%  (p=0.000 n=13+12)
Latency-95   5.21ms ± 2%  5.23ms ± 2%    ~     (p=0.287 n=13+13)
Latency-99   7.16ms ± 2%  7.08ms ± 2%  -1.21%  (p=0.004 n=13+13)
Latency-999  10.7ms ± 9%   9.9ms ± 3%  -7.99%  (p=0.000 n=12+12)

scang yield delay = 200000, casgstatus yield delay = 100000
Latency-50   1.41ms ±15%  1.47ms ± 5%    ~     (p=0.072 n=13+13)
Latency-95   5.21ms ± 2%  5.17ms ± 2%    ~     (p=0.091 n=13+13)
Latency-99   7.16ms ± 2%  7.02ms ± 2%  -1.99%  (p=0.000 n=13+13)
Latency-999  10.7ms ± 9%   9.9ms ± 5%  -7.86%  (p=0.000 n=12+13)

With slight oversubscription (another instance of http benchmark
was running in background with reduced GOMAXPROCS):

scang yield delay = 1, casgstatus yield delay = 1
Latency-50    840µs ± 3%   804µs ± 3%  -4.37%  (p=0.000 n=15+18)
Latency-95   6.52ms ± 4%  6.03ms ± 4%  -7.51%  (p=0.000 n=18+18)
Latency-99   10.8ms ± 7%  10.0ms ± 4%  -7.33%  (p=0.000 n=18+14)
Latency-999  18.0ms ± 9%  16.8ms ± 7%  -6.84%  (p=0.000 n=18+18)

scang yield delay = 5000, casgstatus yield delay = 3000
Latency-50    840µs ± 3%   809µs ± 3%  -3.71%  (p=0.000 n=15+17)
Latency-95   6.52ms ± 4%  6.11ms ± 4%  -6.29%  (p=0.000 n=18+18)
Latency-99   10.8ms ± 7%   9.9ms ± 6%  -7.55%  (p=0.000 n=18+18)
Latency-999  18.0ms ± 9%  16.5ms ±11%  -8.49%  (p=0.000 n=18+18)

scang yield delay = 10000, casgstatus yield delay = 5000
Latency-50    840µs ± 3%   823µs ± 5%  -2.06%  (p=0.002 n=15+18)
Latency-95   6.52ms ± 4%  6.32ms ± 3%  -3.05%  (p=0.000 n=18+18)
Latency-99   10.8ms ± 7%  10.2ms ± 4%  -5.22%  (p=0.000 n=18+18)
Latency-999  18.0ms ± 9%  16.7ms ±10%  -7.09%  (p=0.000 n=18+18)

scang yield delay = 30000, casgstatus yield delay = 10000
Latency-50    840µs ± 3%   836µs ± 5%    ~     (p=0.442 n=15+18)
Latency-95   6.52ms ± 4%  6.39ms ± 3%  -2.00%  (p=0.000 n=18+18)
Latency-99   10.8ms ± 7%  10.2ms ± 6%  -5.15%  (p=0.000 n=18+17)
Latency-999  18.0ms ± 9%  16.6ms ± 8%  -7.48%  (p=0.000 n=18+18)

scang yield delay = 100000, casgstatus yield delay = 50000
Latency-50    840µs ± 3%   836µs ± 6%    ~     (p=0.401 n=15+18)
Latency-95   6.52ms ± 4%  6.40ms ± 4%  -1.79%  (p=0.010 n=18+18)
Latency-99   10.8ms ± 7%  10.2ms ± 5%  -4.95%  (p=0.000 n=18+18)
Latency-999  18.0ms ± 9%  16.5ms ±14%  -8.17%  (p=0.000 n=18+18)

scang yield delay = 200000, casgstatus yield delay = 100000
Latency-50    840µs ± 3%   828µs ± 2%  -1.49%  (p=0.001 n=15+17)
Latency-95   6.52ms ± 4%  6.38ms ± 4%  -2.04%  (p=0.001 n=18+18)
Latency-99   10.8ms ± 7%  10.2ms ± 4%  -4.77%  (p=0.000 n=18+18)
Latency-999  18.0ms ± 9%  16.9ms ± 9%  -6.23%  (p=0.000 n=18+18)

With significant oversubscription (background http benchmark
was running with full GOMAXPROCS):

scang yield delay = 1, casgstatus yield delay = 1
Latency-50   1.32ms ±12%  1.30ms ±13%    ~     (p=0.454 n=14+14)
Latency-95   16.3ms ±10%  15.3ms ± 7%  -6.29%  (p=0.001 n=14+14)
Latency-99   29.4ms ±10%  27.9ms ± 5%  -5.04%  (p=0.001 n=14+12)
Latency-999  49.9ms ±19%  45.9ms ± 5%  -8.00%  (p=0.008 n=14+13)

scang yield delay = 5000, casgstatus yield delay = 3000
Latency-50   1.32ms ±12%  1.29ms ± 9%    ~     (p=0.227 n=14+14)
Latency-95   16.3ms ±10%  15.4ms ± 5%  -5.27%  (p=0.002 n=14+14)
Latency-99   29.4ms ±10%  27.9ms ± 6%  -5.16%  (p=0.001 n=14+14)
Latency-999  49.9ms ±19%  46.8ms ± 8%  -6.21%  (p=0.050 n=14+14)

scang yield delay = 10000, casgstatus yield delay = 5000
Latency-50   1.32ms ±12%  1.35ms ± 9%     ~     (p=0.401 n=14+14)
Latency-95   16.3ms ±10%  15.0ms ± 4%   -7.67%  (p=0.000 n=14+14)
Latency-99   29.4ms ±10%  27.4ms ± 5%   -6.98%  (p=0.000 n=14+14)
Latency-999  49.9ms ±19%  44.7ms ± 5%  -10.56%  (p=0.000 n=14+11)

scang yield delay = 30000, casgstatus yield delay = 10000
Latency-50   1.32ms ±12%  1.36ms ±10%     ~     (p=0.246 n=14+14)
Latency-95   16.3ms ±10%  14.9ms ± 5%   -8.31%  (p=0.000 n=14+14)
Latency-99   29.4ms ±10%  27.4ms ± 7%   -6.70%  (p=0.000 n=14+14)
Latency-999  49.9ms ±19%  44.9ms ±15%  -10.13%  (p=0.003 n=14+14)

scang yield delay = 100000, casgstatus yield delay = 50000
Latency-50   1.32ms ±12%  1.41ms ± 9%  +6.37%  (p=0.008 n=14+13)
Latency-95   16.3ms ±10%  15.1ms ± 8%  -7.45%  (p=0.000 n=14+14)
Latency-99   29.4ms ±10%  27.5ms ±12%  -6.67%  (p=0.002 n=14+14)
Latency-999  49.9ms ±19%  45.9ms ±16%  -8.06%  (p=0.019 n=14+14)

scang yield delay = 200000, casgstatus yield delay = 100000
Latency-50   1.32ms ±12%  1.42ms ±10%   +7.21%  (p=0.003 n=14+14)
Latency-95   16.3ms ±10%  15.0ms ± 7%   -7.59%  (p=0.000 n=14+14)
Latency-99   29.4ms ±10%  27.3ms ± 8%   -7.20%  (p=0.000 n=14+14)
Latency-999  49.9ms ±19%  44.8ms ± 8%  -10.21%  (p=0.001 n=14+13)

All numbers are on 8 cores and with GOGC=10 (http benchmark has
tiny heap, few goroutines and low allocation rate, so by default
GC barely affects tail latency).

10us/5us yield delays seem to provide a reasonable compromise
and give 5-10% tail latency reduction. That's what used in this change.

go install -a std results on 4 core machine:

name      old time/op  new time/op  delta
Time       8.39s ± 2%   7.94s ± 2%  -5.34%  (p=0.000 n=47+49)
UserTime   24.6s ± 2%   22.9s ± 2%  -6.76%  (p=0.000 n=49+49)
SysTime    1.77s ± 9%   1.89s ±11%  +7.00%  (p=0.000 n=49+49)
CpuLoad    315ns ± 2%   313ns ± 1%  -0.59%  (p=0.000 n=49+48) # %CPU
MaxRSS    97.1ms ± 4%  97.5ms ± 9%    ~     (p=0.838 n=46+49) # bytes

Update #14396
Update #14189

Change-Id: I3f4109bf8f7fd79b39c466576690a778232055a2
Reviewed-on: https://go-review.googlesource.com/21503
Run-TryBot: Dmitry Vyukov <dvyukov@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@toddboom
Copy link
Author

I just wanted to revisit this issue and let everyone know that as of v0.13.0 (released today), InfluxDB is now being built with Go 1.6.2.

We did a lot of work over the last couple of months to reduce unnecessary allocations, but the GC performance in Go has also been steadily improving. Based on our own internal testing (influxdata/influxdb#6511) and user feedback (influxdata/influxdb#5217), we're consistently seeing better performance across the board when using Go 1.6.2. This is also without tuning the GOGC parameter, which further improves performance.

I'm going to close this issue out because I don't think it represents an actionable item for the Go team any longer. For anyone else who still sees better performance on Go 1.4.3, I'd recommend working to identify and reduce unnecessary allocations - you'll be happier for it!

Finally, we're seeing even better performance with tip, so we'll be looking forward to Go 1.7 later this year. Thanks to the entire Go team for all of the help and responsiveness along the way!

@aclements
Copy link
Member

@toddboom, that's great! Thanks for updating the issue.

@golang golang locked and limited conversation to collaborators May 12, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants