Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: Transport.CancelRequest is inherently racy #11013

Closed
jacobsa opened this issue Jun 1, 2015 · 13 comments
Closed

net/http: Transport.CancelRequest is inherently racy #11013

jacobsa opened this issue Jun 1, 2015 · 13 comments
Milestone

Comments

@jacobsa
Copy link
Contributor

jacobsa commented Jun 1, 2015

The implementation of http.Transport.CancelRequest looks for the supplied request struct in a map of in-flight requests, doing nothing if it doesn't exist. If I understand correctly, that means that this snippet is racy, in the sense that cancellation of the context may be lost if it is seen before RoundTrip gets around to sticking the request in the map:

func cancellableDo(ctx context.Context, req *http.Request) (err error) {
  // Propagate cancellation of the context to the HTTP request.
  go func() {
    <-context.Done()
    http.DefaultTransport.CancelRequest(req)
  }()

  // Call through.
  err = http.DefaultTransport.RoundTrip(req)
  return
}

This is made more likely to happen by RoundTripper implementations like oauth2.Transport, which may make network calls before calling through to RoundTrip. In the end I'm forced to cancel in a hacky loop to ensure that the HTTP package pays attention.

Is there anything that can be done to eliminate this race?

@mikioh mikioh changed the title http: Transport.CancelRequest is inherently racy net/http: Transport.CancelRequest is inherently racy Jun 1, 2015
@ianlancetaylor ianlancetaylor added this to the Go1.5Maybe milestone Jun 3, 2015
@ianlancetaylor
Copy link
Contributor

CC @bradfitz

@DanielMorsing
Copy link
Contributor

This came up on the list again. https://groups.google.com/d/topic/golang-nuts/m-Ocj7YVahU/discussion

I'm starting to think that it's impossible to do cancellation without races with the current interface. We might have to make a new RoundTrip interface which takes a context object and store the cancellation on there, like what http://blog.golang.org/context describes.

@bradfitz
Copy link
Contributor

As pointed out in that thread, the documentation is correct. You can't cancel it until it's been started.

This code is also "racy":

var mu sync.Mutex
m := map[string]string{}
go func() { mu.Lock(); m["foo"] = "bar"; mu.Unlock() }()
go func() { mu.Lock(); delete(m, "foo"); mu.Unlock() }()

... sometimes the map entry will be there and sometimes it won't, because you're not waiting to delete it until it's been added.

That doesn't mean there's a bug with maps or with delete.

We could add more words if that would help, but we can't start adding contexts to APIs for cancellation because the standard library a) doesn't have them, and b) is frozen.

@DanielMorsing
Copy link
Contributor

Having the outcome of the map example be non-deterministic is fine, but for the Transport, the stakes are a bit higher. If you cancel a request before it has been started, RoundTrip might block indefinitely when the request does issue, leading to a leaked goroutine and a leaked net.Conn since you never end up calling resp.Body.Close() on the response.

Clearly this is a punt to go1.6 issue, but still think we need to address this. The current API makes it too easy to write code that is subtly wrong.

@jacobsa
Copy link
Contributor Author

jacobsa commented Jun 11, 2015

@bradfitz: Of course there's no bug, in the sense that the API does what is promises. But I'm saying I think it's a less useful API than it could be. Your example is not fair: the user can synchronize on the completion of an insertion into a map they own, but it's impossible for them to synchronize on RoundTrip inserting into its internal map. So it's impossible for them to ensure that a cancellation will not be lost.

You can't change the existing CancelRequest function because of the API freeze, I understand. But surely there is another solution that is backwards compatible, by adding a new method? context.Context would be the "best" thing to use if designing from scratch since the community has standardized on it, but that doesn't mean the standard library can't have its own bespoke version (e.g. pass in a cancellation channel).

@bradfitz
Copy link
Contributor

@jacobsa, the use cases this was originally designed for was calling Transport.CancelRequest after Transport.RoundTrip has returned (and then you know it's in the in-flight map). See #3672 for history.

And for those concerned about Transport.RoundTrip taking too long to return the headers can use Transport.ResponseHeaderTimeout. History is in #3362.

If you really want to know what phase of its lifecycle a Transport's *Request is in, perhaps that ties into #9194

In addition to lack of contexts, it's also unfortunate that CancelRequest doesn't return a value too.

But unless there's something concrete to do here, I'm tempted to close this bug. So please propose something if you have an idea.

Maybe Transport could keep track of the set of requests which have been canceled but not yet started (and then fail the request if it does come in later), and start a timer to remove that map entry in N seconds? But what is N?

@jacobsa
Copy link
Contributor Author

jacobsa commented Jun 11, 2015

@bradfitz: Thanks for the background. I see, it makes a lot more sense when you view it from the lens of wanting to cancel once you've already received the response headers and are still reading the body. I came at it from "wanting to cancel at any time".

I'm sure you're aware that setting a timeout up-front is less satisfying than being able to cancel at an arbitrary time, especially when you're trying to bridge a context.Context API with net/http. But it's still helpful of course, since it bounds your latency from "context is cancelled" to "I actually return from my function that calls into net/http". Still, it has the same trouble you mentioned: how do I choose the timeout?

Here are two concrete proposals:

  1. Add a callback that lets me know when the request becomes cancellable, so I can synchronize on that. (As you said, this might tie in with net/http: Transport: add ConnState callback for http.Transport #9194.)
  2. Add a chan struct{} field to http.Request. Approximately the same semantics of Context.Done: closing it means the request is regarded is cancelled, nil means not cancellable in this manner.

I personally like (2). The cost is that it would require net/http to maintain a background goroutine with a list of cancellation channels for in-flight requests (if any had ever been made), calling reflect.Select to wait for one of them to be closed or for a request to be finished. But this is what users have to do today if they want to support cancellation in the context.Context style, and they would still need to do it if (1) were chosen.

@bradfitz
Copy link
Contributor

I kinda like (2) also. But: what is its type?

a) <-chan struct{} // closed on cancelation
b) chan struct{} // closed on cancelation

And if (b), who can close it? You wouldn't want multiple goroutines calling close on it. One might panic then.

What about Transport.CancelRequest? It would reasonably want to cancel it also. Maybe? I suppose we don't want to allocate the channel by default (and Transport.CancelRequest can't assume it's already there, nor set it from another goroutine), so I CancelRequest will have to forever use its existing mechanism. That leads me to say (a), which is also the type returned by Context.Done in http://godoc.org/golang.org/x/net/context#Context

@jacobsa
Copy link
Contributor Author

jacobsa commented Jun 12, 2015

I think <-chan struct{} would be better, yeah.

Regarding Transport.CancelRequest: I think you could actually leave this as-is, documenting its downsides and maybe marking it as deprecated, but have it continue to do the heavy lifting of cancellation. That is:

  • RoundTrip inserts into the map then tells the cancellation goroutine about the channel.
  • The cancellation goroutine calls CancelRequest when one of the channels it knows about is closed.

Odd to be using the deprecated API to implement the new one internally, but it's safe because you can synchronize on the insert into the map. Put another way: if the old API weren't already promised, it could just be an implementation detail.

@jacobsa
Copy link
Contributor Author

jacobsa commented Jun 28, 2015

@bradfitz: What do you think of my previous comment? If you're good with the idea, I'm happy to send a CL.

@bradfitz
Copy link
Contributor

I don't care much about the implementation. What's easiest or most efficient. I think both the easiest and most efficient is selecting from the *Request.Cancel <-chan struct{} channel (nil or not) would be easiest, without a new map insert or goroutine or anything.

Please send a CL, ideally soon. The tree is increasingly frozen.

@jacobsa
Copy link
Contributor Author

jacobsa commented Jun 29, 2015

@bradfitz: Oh duh, I was way overcomplicating it, you're right. Code review that just adds new select cases (no new goroutines) is here.

@gopherbot
Copy link

CL https://golang.org/cl/11601 mentions this issue.

@mikioh mikioh modified the milestones: Go1.5, Go1.5Maybe Jun 30, 2015
dcbw added a commit to dcbw/flannel that referenced this issue Aug 11, 2015
Due to golang/go#11013 race conditions
in the remote testcases mean that HTTP requests aren't actually
canceled.  In client.go::httpDo() the goroutine that actually
runs the http.Do() may not execute until after that routine's
select{} loop, at which point the Context may have already been
canceled.  Unfortunatel with Go < 1.5 there is a race in the
http.Transport package that drops cancellations on the floor.

Work around the Go http.Transport cancellation issue by checking
a private cancellation variable, since context.Context has no
mechanism to check whether it has already been canceled.  This
can be removed once Flannel can depend on Go 1.5, at which point
the code should start to use http.Request.Cancel instead of
calling http.transport.CancelRequest().

Second, use a WaitGroup in doTestWatch() to assert that the
outstanding watch is actually canceled.
dcbw added a commit to dcbw/flannel that referenced this issue Aug 11, 2015
Due to golang/go#11013 race conditions
in the remote testcases mean that HTTP requests aren't actually
canceled.  In client.go::httpDo() the goroutine that actually
runs the http.Do() may not execute until after that routine's
select{} loop, at which point the Context may have already been
canceled.  Unfortunatel with Go < 1.5 there is a race in the
http.Transport package that drops cancellations on the floor.

Work around the Go http.Transport cancellation issue by checking
a private cancellation variable, since context.Context has no
mechanism to check whether it has already been canceled.  This
can be removed once Flannel can depend on Go 1.5, at which point
the code should start to use http.Request.Cancel instead of
calling http.transport.CancelRequest().

Second, use a WaitGroup in doTestWatch() to assert that the
outstanding watch is actually canceled.
dcbw added a commit to dcbw/flannel that referenced this issue Aug 11, 2015
Due to golang/go#11013 race conditions
in the remote testcases mean that HTTP requests aren't actually
canceled.  In client.go::httpDo() the goroutine that actually
runs the http.Do() may not execute until after that routine's
select{} loop, at which point the Context may have already been
canceled.  Unfortunatel with Go < 1.5 there is a race in the
http.Transport package that drops cancellations on the floor.

Work around the Go http.Transport cancellation issue by checking
a private cancellation variable, since context.Context has no
mechanism to check whether it has already been canceled.  This
can be removed once Flannel can depend on Go 1.5, at which point
the code should start to use http.Request.Cancel instead of
calling http.transport.CancelRequest().

Second, use a WaitGroup in doTestWatch() to assert that the
outstanding watch is actually canceled.

This issue can be reproduced by using the remote_test.go hunks
but not the client.go hunks.  Specifically, doTestWatch() calls
cancel() before watch.go::WatchLeases() calls sm.WatchLeases(),
which in this case is RemoteManager.WatchLeases(), which then
calls RemoteManager.httpGet() -> httpDo(), which then calls
http.Client.Do() which has the aformentioned cancellation race.
dcbw added a commit to dcbw/flannel that referenced this issue Aug 11, 2015
Due to golang/go#11013 race conditions
in the remote testcases mean that HTTP requests aren't actually
canceled.  In client.go::httpDo() the goroutine that actually
runs the http.Do() may not execute until after that routine's
select{} loop, at which point the Context may have already been
canceled.  Unfortunately with Go < 1.5 there is a race in the
http.Transport package that drops cancellations on the floor.

This issue can be reproduced by using the remote_test.go hunks
that use a WaitGroup to wait for subnet.WatchLeases() to exit
after the Context is canceled, but not the client.go hunks.
Specifically, doTestWatch() calls cancel() before
watch.go::WatchLeases() calls sm.WatchLeases(), which in this
case is RemoteManager.WatchLeases(), which then calls
RemoteManager.httpGet() -> httpDo(), which then calls
http.Client.Do() which has the aformentioned cancellation race.

Work around the Go http.Transport cancellation issue by checking
a private cancellation variable, since context.Context has no
mechanism to check whether it has already been canceled.  This
can be removed once Flannel can depend on Go 1.5, at which point
the code should start to use http.Request.Cancel instead of
calling http.transport.CancelRequest().  This obviously doesn't
fix the race, but I think that has to wait for Go 1.5.

Second, use a WaitGroup in doTestWatch() to assert that the
outstanding watch is actually canceled.
dcbw added a commit to dcbw/flannel that referenced this issue Aug 11, 2015
Due to golang/go#11013 race conditions
in the remote testcases mean that HTTP requests aren't actually
canceled.  In client.go::httpDo() the goroutine that actually
runs the http.Do() may not execute until after that routine's
select{} loop, at which point the Context may have already been
canceled.  Unfortunately with Go < 1.5 there is a race in the
http.Transport package that drops cancellations on the floor.

This issue can be reproduced by using the remote_test.go hunks
that use a WaitGroup to wait for subnet.WatchLeases() to exit
after the Context is canceled, but not the client.go hunks.
Specifically, doTestWatch() calls cancel() before
watch.go::WatchLeases() calls sm.WatchLeases(), which in this
case is RemoteManager.WatchLeases(), which then calls
RemoteManager.httpGet() -> httpDo(), which then calls
http.Client.Do() which has the aformentioned cancellation race.

To fix the races more completely (though not as completely
as Go 1.5 will do) we have to copy most of http.Transport
into the tree and implement some cancellation fixups ourselves.
A few parts of the new Transport are notable:

1) CancelRequest() - if the request is not yet being tracked,
the cancelation is saved in another map for later

2) getConn() - the real work is done here; when adding a
new request to the internal table, check if it the client has
asked to cancel it, and actually cancel it.  Second if we
are going to use an idle connection for the request, and the
request has already been canceled, then cancel the idle
connection too.

Second, use a WaitGroup in doTestWatch() to assert that the
outstanding watch is actually canceled.
@golang golang locked and limited conversation to collaborators Jun 29, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants