Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: investigate Server speed regressions from more select case #20302

Closed
bradfitz opened this issue May 9, 2017 · 23 comments
Closed

Comments

@bradfitz
Copy link
Contributor

bradfitz commented May 9, 2017

Go bug for tracking kubernetes/kubernetes#45216

It appears that the new select cases in http2.serverConn.serve slowed things down.

But tip (~Go 1.9) had a number of select improvements after Go 1.8, so it doesn't show up as much perhaps. See comments and numbers in the linked bug.

/cc @tombergan

@bradfitz bradfitz added this to the Go1.9 milestone May 9, 2017
@bradfitz bradfitz self-assigned this May 9, 2017
@gopherbot
Copy link

CL https://golang.org/cl/43034 mentions this issue.

@tombergan
Copy link
Contributor

From the profile linked in kubernetes/kubernetes#45216, this looks like lock contention.

runtime.selectgoImpl: 13.06s of 451.69s
runtime.selunlock: 0.58s of 179.29s
runtime.sellock: 1.74s of 251.83s

Compare this to the profile from go 1.7:

runtime.selectgoImpl: 4.65s of 12.69s
runtime.selunlock: doesn't register
runtime.sellock: doesn't register

@josharian
Copy link
Contributor

Pity--looks like mutex profiling doesn't include futex. I wonder how to detect which case is contended; if it is in fact lock contention, knowing which lock it is should give reproduction hints.

@bradfitz
Copy link
Contributor Author

bradfitz commented May 9, 2017

/cc @mdempsky @randall77 @aclements for any runtime help.

This is a pretty serious speed regression for Kubernetes, but one we can't yet repro in a stand-alone benchmark. Any clues?

@josharian
Copy link
Contributor

@bradfitz it looks like BenchmarkGets has no concurrency (at least according to the trace viewer), so the select statement isn't under any contention. So if the problem is lock contention, it won't show up. Maybe b.RunParallel to the rescue?

@josharian
Copy link
Contributor

(I tried hacking it in myself but broke all sorts of things--obviously requires more domain expertise than I have to get it right.)

@tombergan
Copy link
Contributor

Here is a version of BenchmarkServerGets that adds concurrency. I cannot reproduce the bug though. The benchmark has the same runtime before and after commenting out the select cases listed in this comment.

I wonder if there's something about their testbed setup that makes really long scheduling delays possible? Or maybe they're running in a VM that has a weird implementation of sys_futex?

@josharian
Copy link
Contributor

@tombergan thanks. That benchmark patch doesn't apply cleanly for me (mail a CL next time?), but I applied it manually, and got:

BenchmarkServerGets-8   	--- FAIL: BenchmarkServerGets-8
	server_test.go:2989: Error: timeout waiting for frame
	server_test.go:235: Framer read log:
		2017-05-10 07:25:51.540231465 Framer 0xc4205082a0: read SETTINGS len=24, settings: MAX_FRAME_SIZE=1048576, MAX_CONCURRENT_STREAMS=101, MAX_HEADER_LIST_SIZE=1048896, INITIAL_WINDOW_SIZE=1048576
		2017-05-10 07:25:51.54027134 Framer 0xc4205082a0: read WINDOW_UPDATE len=4 (conn) incr=983041
		2017-05-10 07:25:51.540286228 Framer 0xc4205082a0: read SETTINGS flags=ACK len=0
		2017-05-10 07:25:51.540439446 Framer 0xc4205082a0: read HEADERS flags=END_HEADERS stream=1 len=49
		2017-05-10 07:25:51.540454635 Framer 0xc4205082a0: read DATA flags=END_STREAM stream=1 len=12 data="Hello, world"

and lots more errors.

Anyway, from visually inspecting the patch, it does set up a concurrent get, but not N concurrent gets, so it doesn't appear that it would really introduce contention. You want something (waves hands) like:

  setUpServer()

  b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            get()
        }
    })

You can check for full concurrency using runtime tracing. Be sure to use a short benchtime to avoid overwhelming amounts of data.

$ go test -bench=Gets -benchtime=50ms -trace=trace.out
$ go tool trace trace.out

You should see all procs active the whole time, not just one or two. If it is fully parallel, and there are contention effects, you should also be able to see the effect when you mess with -cpu:

$ go test -bench=Gets -cpu=1,2,4,8,16,32,64,128

@tombergan
Copy link
Contributor

tombergan commented May 10, 2017

Sorry the patch didn't apply cleanly, not sure why that happened.

HTTP/2 frames must be serialized onto a TCP connection. Whether or not the "gets" run concurrently on the client side is not relevant to server-side concurrency. Previously, the test forced serial execution on the server because it waited for a response immediately after sending each request. My patched test reads the responses in a background goroutine so the server can respond concurrently with the client sending new requests. The client still needs to send requests serially (the st.writeHeaders call) because, again, HTTP/2 frames are serialized.

The server will serialize response frames, but each HTTP request handler should run concurrently. That serialization is managed by the serverConn.serve loop, which is the method that includes the large select stmt. The select stmt selects on new frames from the client, new write requests from HTTP handlers, completed net.Conn.write() calls (which happen in a background goroutine), timers, and other small things. Assuming lock contention is the problem, the contention is likely on st.wantWriteFrameCh, since that channel is written to by all request handler goroutines, while the other channels have only one writer.

Thanks for the tips about runtime tracing. Hopefully I have time later today to try that out.

@josharian
Copy link
Contributor

Cool. Let me know if you have any problems with tracing; happy to help. Feel free to also ping me on some form of chat if interactive discussion would be helpful. Brad can give you my contact info.

@bradfitz
Copy link
Contributor Author

@tombergan, yes, that's the sort of benchmark I was imagining: one connection, bunch of concurrent requests streaming from long-running handlers doing Write+Flush for b.N iterations or something. That sounds like what they're hitting with their "Watch" APIs.

@gopherbot
Copy link

CL https://golang.org/cl/43230 mentions this issue.

@tombergan
Copy link
Contributor

For the diagnosis, see this comment onwards.

@josharian
Copy link
Contributor

Depending on the fix, this might be for Go 1.8.2, since it causes serious problems and there isn't a clear workaround. Marking as Go 1.8.2, with the expectation that others may disagree and move it back.

@josharian josharian modified the milestones: Go1.8.2, Go1.9 May 11, 2017
@gopherbot
Copy link

CL https://golang.org/cl/43455 mentions this issue.

@bradfitz
Copy link
Contributor Author

I filed a more general bug at #20351

gopherbot pushed a commit to golang/net that referenced this issue May 13, 2017
gracefulShutdownCh is shared by all connections in a server. When a
server accumulates many connections (e.g., 5000 in the kubemark-5000
benchmark), we have 5000 serverConn.serve goroutines selecting on this
channel. This means 5000 goroutines hammer the channel's lock, which
causes severe lock contention.

The fix in this CL is to make a local proxy for gracefulShutdownCh in
each connection so that each connection selects on gracefulShutdownCh
at most once per connection rather than once per serverConn.serve loop
iteration.

This fix is intended to be backported quickly into Go 1.8.2. The
downside of this fix is 2KB extra stack usage per connection. A better
fix will be implemented in Go 1.9.

Unfortunately, I have been unable to reproduce this problem locally.
This fix was verified by the kubernetes team. See:
kubernetes/kubernetes#45216 (comment)

Updates golang/go#20302

Change-Id: I19ab19268a6ccab9b6e9dffa0cfbc89b8c7d0f19
Reviewed-on: https://go-review.googlesource.com/43455
Run-TryBot: Tom Bergan <tombergan@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@gopherbot
Copy link

CL https://golang.org/cl/43459 mentions this issue.

gopherbot pushed a commit to golang/net that referenced this issue May 15, 2017
This drops the number of select cases in serverConn.server from 10 to
6. It was 7 in Go 1.7. It increased to 10 in Go 1.8.

* replace testing-only testHookCh with always-on grab-bag serveMsgCh
  to be used for both test purposes, and infrequently used message
  types

* remove the settingsTimer.C case for the initial settings timer
  and use serveMsgCh and time.AfterFunc instead

* ... and do the same for the idle timeout timer.

* ... and for the shutdown timer.

* remove wantStartPushCh and just send the *startPushRequest to
  serveMsgCh too

I could go further with this (and plan to, later), but these are the
safe and easy ones that don't require more work elsewhere.

The speed gets better the more the request/response go via the
serverConn.serve loop. (once per Request.Body.Read or
ResponseWriter.Flush):

  name            old time/op    new time/op  delta
  ServerGets-4    138µs ± 3%     134µs ± 2%   -2.54%   (p=0.002 n=10+10)
  ServerPosts-4   176µs ±27%     154µs ± 2%   -12.62%  (p=0.011 n=10+10)

Updates kubernetes/kubernetes#45216
Updates golang/go#20302

Change-Id: I18019554089d7e3d76355d7137b5957e9597e803
Reviewed-on: https://go-review.googlesource.com/43034
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Tom Bergan <tombergan@google.com>
@bradfitz bradfitz modified the milestones: Go1.8.2, Go1.8.3 May 18, 2017
@bradfitz bradfitz removed this from the Go1.8.2 milestone May 18, 2017
gopherbot pushed a commit to golang/net that referenced this issue May 23, 2017
…cefulShutdownCh

gracefulShutdownCh is shared by all connections in a server. When a
server accumulates many connections (e.g., 5000 in the kubemark-5000
benchmark), we have 5000 serverConn.serve goroutines selecting on this
channel. This means 5000 goroutines hammer the channel's lock, which
causes severe lock contention.

The fix in this CL is to make a local proxy for gracefulShutdownCh in
each connection so that each connection selects on gracefulShutdownCh
at most once per connection rather than once per serverConn.serve loop
iteration.

This fix is intended to be backported quickly into Go 1.8.2. The
downside of this fix is 2KB extra stack usage per connection. A better
fix will be implemented in Go 1.9.

Unfortunately, I have been unable to reproduce this problem locally.
This fix was verified by the kubernetes team. See:
kubernetes/kubernetes#45216 (comment)

Updates golang/go#20302

Change-Id: I19ab19268a6ccab9b6e9dffa0cfbc89b8c7d0f19
Reviewed-on: https://go-review.googlesource.com/43455
Run-TryBot: Tom Bergan <tombergan@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
(cherry picked from commit d3ede01)
Reviewed-on: https://go-review.googlesource.com/43459
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Chris Broadfoot <cbro@golang.org>
@gopherbot
Copy link

CL https://golang.org/cl/43996 mentions this issue.

gopherbot pushed a commit that referenced this issue May 23, 2017
…tdownCh lock contention slowdown

This updates the bundled x/net/http2 repo to git rev 186fd3fc (from
the net repo's release-branch.go1.8) for:

    [release-branch.go1.8] http2: fix lock contention slowdown due to gracefulShutdownCh
    https://golang.org/cl/43459

Fixes #20302

Change-Id: Ia01a44c6749292de9c16ca330bdebe1e52458b18
Reviewed-on: https://go-review.googlesource.com/43996
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Chris Broadfoot <cbro@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue May 23, 2017
This will be used to allow http2 servers to register a shutdown function
so that net/http.Server.Shutdown will work when the http2 server is
configured via a manual call to http2.ConfigureServer. Currently, Shutdown
only works when the http2 server is configured automatically by the
net/http package.

Updates #20302
Updates #18471

Change-Id: Ifc2b5f3126126a106b49ea4a7e999279852b9cc9
Reviewed-on: https://go-review.googlesource.com/44003
Run-TryBot: Tom Bergan <tombergan@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@tombergan
Copy link
Contributor

Reopening for the go 1.9 fixes.

@tombergan tombergan reopened this May 23, 2017
@gopherbot
Copy link

CL https://golang.org/cl/44003 mentions this issue.

@tombergan tombergan modified the milestones: Go1.9, Go1.8.3 May 23, 2017
gopherbot pushed a commit to golang/net that referenced this issue May 24, 2017
This is a better fix that https://golang.org/cl/43455. Instead of
creating a separate goroutine to wait for the global shutdown channel,
we reuse the new serverMsgCh, which was added in a prior CL.

We also use the new net/http.Server.RegisterOnShutdown method to
register a shutdown callback for each http2.Server.

Updates golang/go#20302
Updates golang/go#18471

Change-Id: Icf29d5e4f65b3779d1fb4ea92924e4fb6bdadb2a
Reviewed-on: https://go-review.googlesource.com/43230
Run-TryBot: Tom Bergan <tombergan@google.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@tombergan
Copy link
Contributor

Still TODO is to bundle https://golang.org/cl/44003 into net/http.

@gopherbot
Copy link

CL https://golang.org/cl/44006 mentions this issue.

@bradfitz
Copy link
Contributor Author

@tombergan, I'm going to switch this back to Go 1.8.3 so it shows up in the Github search link from the Go 1.8.3 release notes.

/cc @broady

@bradfitz bradfitz modified the milestones: Go1.8.3, Go1.9 May 24, 2017
c3mb0 pushed a commit to c3mb0/net that referenced this issue Apr 2, 2018
gracefulShutdownCh is shared by all connections in a server. When a
server accumulates many connections (e.g., 5000 in the kubemark-5000
benchmark), we have 5000 serverConn.serve goroutines selecting on this
channel. This means 5000 goroutines hammer the channel's lock, which
causes severe lock contention.

The fix in this CL is to make a local proxy for gracefulShutdownCh in
each connection so that each connection selects on gracefulShutdownCh
at most once per connection rather than once per serverConn.serve loop
iteration.

This fix is intended to be backported quickly into Go 1.8.2. The
downside of this fix is 2KB extra stack usage per connection. A better
fix will be implemented in Go 1.9.

Unfortunately, I have been unable to reproduce this problem locally.
This fix was verified by the kubernetes team. See:
kubernetes/kubernetes#45216 (comment)

Updates golang/go#20302

Change-Id: I19ab19268a6ccab9b6e9dffa0cfbc89b8c7d0f19
Reviewed-on: https://go-review.googlesource.com/43455
Run-TryBot: Tom Bergan <tombergan@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
c3mb0 pushed a commit to c3mb0/net that referenced this issue Apr 2, 2018
This drops the number of select cases in serverConn.server from 10 to
6. It was 7 in Go 1.7. It increased to 10 in Go 1.8.

* replace testing-only testHookCh with always-on grab-bag serveMsgCh
  to be used for both test purposes, and infrequently used message
  types

* remove the settingsTimer.C case for the initial settings timer
  and use serveMsgCh and time.AfterFunc instead

* ... and do the same for the idle timeout timer.

* ... and for the shutdown timer.

* remove wantStartPushCh and just send the *startPushRequest to
  serveMsgCh too

I could go further with this (and plan to, later), but these are the
safe and easy ones that don't require more work elsewhere.

The speed gets better the more the request/response go via the
serverConn.serve loop. (once per Request.Body.Read or
ResponseWriter.Flush):

  name            old time/op    new time/op  delta
  ServerGets-4    138µs ± 3%     134µs ± 2%   -2.54%   (p=0.002 n=10+10)
  ServerPosts-4   176µs ±27%     154µs ± 2%   -12.62%  (p=0.011 n=10+10)

Updates kubernetes/kubernetes#45216
Updates golang/go#20302

Change-Id: I18019554089d7e3d76355d7137b5957e9597e803
Reviewed-on: https://go-review.googlesource.com/43034
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Tom Bergan <tombergan@google.com>
c3mb0 pushed a commit to c3mb0/net that referenced this issue Apr 2, 2018
This is a better fix that https://golang.org/cl/43455. Instead of
creating a separate goroutine to wait for the global shutdown channel,
we reuse the new serverMsgCh, which was added in a prior CL.

We also use the new net/http.Server.RegisterOnShutdown method to
register a shutdown callback for each http2.Server.

Updates golang/go#20302
Updates golang/go#18471

Change-Id: Icf29d5e4f65b3779d1fb4ea92924e4fb6bdadb2a
Reviewed-on: https://go-review.googlesource.com/43230
Run-TryBot: Tom Bergan <tombergan@google.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@golang golang locked and limited conversation to collaborators May 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants