Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: go 1.17.6's http2-related cancellation change causes long-lived goroutine with custom Transport types #50652

Closed
knz opened this issue Jan 17, 2022 · 7 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@knz
Copy link

knz commented Jan 17, 2022

There is a regression in go 1.17.6 that causes a long-lived goroutine to remain blocked after http.client/setRequestCancel() when using a non-http Transport custom type.
This regression was introduced by https://go-review.googlesource.com/c/go/+/361919

We've analyzed the cause of the issue.
If all the following conditions are met:

  • there's no .Cancel channel defined on the Request
  • there's a non-zero Timeout on the Client, set to X seconds
  • the client's Transport field is set to an object that has a different type than those recognized in knownRoundTripperImpl() (either http.Transport, http.http2Transport or http2noDialH2RoundTripper)
  • the request completes

Then the goroutine spawned at the end of setRequestCancel() remains running for X seconds even if the request completes in a short time.

Prior to go 1.17.6, this goroutine would terminate shortly after the body was read successfully.
The problem was introduced by this change, which removed the .stop() call that would clean up the waiting goroutine.

This is a problem because it's common for the client Timeout to be set to a large amount of time (say, 30 seconds), and then issue thousands of requests to a local HTTP server. With this regression, we see thousands of goroutines remaining running, creating lag and bloated goroutine dumps.

Note also that this problem does not occur when using http's own Transport type, because then we get knownTransport == true and setRequestCancel returns a regular deadline-based context. This is probably why go's own test suite did not detect the regression?

What version of Go are you using (go version)?

go version go1.17.6

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

Issue reproduces on all platforms

What did you do?

  1. created http.Client with custom Transport (custom go type)
  2. set non-zero Timeout (say,30 seconds)
  3. sent request to some server with this client
  4. observed remaining goroutines after request completes, remaining blocked for the duration set in Timeout

(This is part of CockroachDB's unit tests. Internal reference: cockroachdb/cockroach#74657)

What did you expect to see?

No remaining goroutine.

What did you see instead?

 Leaked goroutine: goroutine 9374 [select]:
        net/http.setRequestCancel.func4(0x0, 0xc00706f920, 0xc001c80af0, 0xc005640bcc, 0xc006a39320)
                /usr/local/go/src/net/http/client.go:398 +0xb0
        created by net/http.setRequestCancel
                /usr/local/go/src/net/http/client.go:397 +0x497
@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 17, 2022
@seankhliao
Copy link
Member

cc @neild

@knz
Copy link
Author

knz commented Jan 17, 2022

I was able to work around with

                               cancelCh := make(chan struct{})
                               req.Cancel = cancelCh
                               defer func() { close(cancelCh) }()

However this workaround is undesirable, because .Cancel is deprecated and it makes staticcheck unhappy.

@knz knz changed the title net/http: go 1.17.6's http2-related cancellation change causes long-lived goroutine with non-canonical Transports net/http: go 1.17.6's http2-related cancellation change causes long-lived goroutine with custom Transports types Jan 18, 2022
@knz knz changed the title net/http: go 1.17.6's http2-related cancellation change causes long-lived goroutine with custom Transports types net/http: go 1.17.6's http2-related cancellation change causes long-lived goroutine with custom Transport types Jan 18, 2022
@neild
Copy link
Contributor

neild commented Jan 18, 2022

Are you closing the response body? This should halt the cancel goroutine:
https://go.googlesource.com/go/+/refs/tags/go1.17.6/src/net/http/client.go#982

Note that the user is expected to always close the response body after making a request (Client.Do docs):

If the returned error is nil, the Response will contain a non-nil Body which the user is expected to close.

@knz
Copy link
Author

knz commented Jan 18, 2022

@neild that is contradictory with the sentence immediately below, which is the one we've been basing our work on:

The request Body, if non-nil, will be closed by the underlying Transport, even on errors.

craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Jan 18, 2022
74910: server: work around a go runtime bug in TestServeIndexHTML r=jtsiros a=knz

Fixes #74657.
Works around golang/go#50652

@rickystewart  was this the only failure in the attempt to upgrade to 1.17.6? I'm somewhat expecting more tests that use the HTTP interface with authentication to fail with the same error, and they all need the same workaround.

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@neild
Copy link
Contributor

neild commented Jan 18, 2022

The transport closes the request body. The caller needs to close the response body.

Edit: s/call/close/

@knz
Copy link
Author

knz commented Jan 19, 2022

Interesting.
I think the documentation could do a better job of:

  • highlighting the difference in responsibility between request and reponse.
  • highlighting that it's not just "underlying RoundTripper (typically Transport) may not be able to re-use a persistent TCP connection" if the response body is not close; that resources will also be leaked.

adityamaru added a commit to adityamaru/cockroach that referenced this issue Jan 19, 2022
This change `Close()`s the Reader before resetting it when we
encounter a resumable error in the ResumingReader. This is particularly
important for the http external storage provide, since forgetting to
call Close() results in goroutine leaks from go1.17.6 onwards.

See: golang/go#50652

Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jan 19, 2022
This change `Close()`s the Reader before resetting it when we
encounter a resumable error in the ResumingReader. This is particularly
important for the http external storage provide, since forgetting to
call Close() results in goroutine leaks from go1.17.6 onwards.

See: golang/go#50652

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Jan 19, 2022
74920: server: create new endpoint that return the roles of the sql user r=maryliag a=maryliag

The commit creates a new endpoint `/sqlroles` that returns
a list of roles of the SQL user logged in.

Partially addresses #74817

Release note (api change): Creation of new endpoint `/sqlroles` that
returns a list of the SQL roles for the SQL user logged in.

75088: sql: migrate has_sequence_privilege from evalPrivilegeCheck to ctx.Pl… r=otan a=ecwall

…anner.HasPrivilege

refs #66173

Migrate has_sequence_privilege from evalPrivilegeCheck to ctx.Planner.HasPrivilege.

Release note: None

75116: bazel: update comments in `BUILD.bazel`, include reference to `dev -h` r=irfansharif a=rickystewart

Most of this stuff is out-of-date at this point.

Release note: None

75145: sql: deflake TestTelemetry r=rytaft a=rytaft

This commit deflakes `TestTelemetry` by adding a more precise
`feature-allowlist`.

Fixes #75138

Release note: None

75149: cloud: close Reader before resetting in ResumingReader r=knz a=adityamaru

This change `Close()`s the Reader before resetting it when we
encounter a resumable error in the ResumingReader. This is particularly
important for the http external storage provide, since forgetting to
call Close() results in goroutine leaks from go1.17.6 onwards.

See: golang/go#50652

Fixes: #75143

Release note: None

Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com>
Co-authored-by: Evan Wall <wall@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
Co-authored-by: Aditya Maru <adityamaru@gmail.com>
@neild
Copy link
Contributor

neild commented Jan 19, 2022

The documentation is quite specific on the need to close the body; it's even the third sentence in the package docs. ("The client must close the response body when finished with it")

If you've got a concrete suggestion on how the docs can be improved, I'm happy to review a CL.

Closing this issue as working-as-intended.

@neild neild closed this as completed Jan 19, 2022
gtr pushed a commit to gtr/cockroach that referenced this issue Jan 24, 2022
This change `Close()`s the Reader before resetting it when we
encounter a resumable error in the ResumingReader. This is particularly
important for the http external storage provide, since forgetting to
call Close() results in goroutine leaks from go1.17.6 onwards.

See: golang/go#50652

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Jan 25, 2022
75136: *: always call Close() on http.Response.Body r=Azhng a=knz

Since go 1.17.6, forgetting to call Close() results in goroutine
leaks.

See: golang/go#50652

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@golang golang locked and limited conversation to collaborators Jan 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants