-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http: http tries to reuse dead TLS connections #3514
Labels
Milestone
Comments
Adam, thoughts on this? I didn't know TSL even had a "close_notify" packet type. For background: the HTTP transport code opens two goroutines: one for reading and one for writing. It stops using a TCP connection for HTTP (or HTTPS) once the remote side says "Connection: close" (or HTTP/1.0 without Connection keep-alive), or once the reader goroutine peeks a byte and it's EOF. There might be two issues here: 1) Amazon ELB is not sending a Connection: close on its 503 response? 2) Peeking a byte on a tls.Conn should return EOF or something once it gets a close_notify packet? Or, if it already returns a different error, the net/http Transport code should learn what it is. I see no vars declared in crypto/tls, though. Owner changed to @bradfitz. Status changed to Accepted. |
In that case, it's totally legit AFAIU for us to re-use that connection. If the Go client has an outbound request ready before it sees the close_notify packet (assume we do/will handle it correctly), there's a potential race between us sending a request and us hearing that the connection is dead. |
Agreed. I just want to set expectations: even if we have a tls.Conn and/or http.Transport bug here and they get fixed, there will still be a race if you have a busy client. But that's likely true with any other language too. If you find that is NOT the case, I'd be curious what other libraries due to avoid this. Treating a 503 as a dead connection always seems lame, for instance, if that's the answer. |
TLS has a secure EOF called "close_notify". In theory, every half-connection is closed with a close-notify and any transport EOF without a close_notify is a truncation error. In practice, things are very messy. However, I can well believe that we ignore close_notify. If treating it as EOF fixes this problem then that sounds like a reasonable thing to do. |
I believe close_notify is already treated as EOF. It's just that we don't end up reading it straight after reading the response. The sequence goes: 1. Request 1 is sent as normal 2. Response 1 is received as normal (and it happens to be a 503) 3. Request 2 is sent 4. We try to read response 2, but then realize that close_notify was sent straight after response 1. Request 2 was totally ignored by the server. EOF is returned. |
#7's analysis sounds plausible to me. Surely there is a race: we could plausibly finish reading the response and then decide to send another request before the close notify has arrived. That suggests a more general experiment that could be carried out. Write an HTTP/1.1 server that answers with a framed request (or a Content-Length), so that the client would think the connection could be reused, but then hang up on the client, say, 2 seconds after writing the response. If we are keeping connections open in general then this would simulate the server timing us out at exactly the moment we decide to issue a new request. It seems like we should redial in this situation. Do we? |
Here's a repro (HTTPS + fake keep-alive response with forced disconnect). With the 500 millisecond sleep uncommented, it's reliable. Without, you get varying errors: serve_test.go:699: Get https://127.0.0.1:38706/: EOF serve_test.go:699: Get https://127.0.0.1:58785/: http: can't write HTTP request on broken connection // Trying to repro golang.org/issue/3514 func TestTLSServerClosesConnection(t *testing.T) { closec := make(chan bool) closedc := make(chan bool) ts := httptest.NewTLSServer(HandlerFunc(func(w ResponseWriter, r *Request) { if strings.Contains(r.URL.Path, "/kill") { conn, _, _ := w.(Hijacker).Hijack() conn.Close() return } if strings.Contains(r.URL.Path, "/keep-alive-then-die") { conn, _, _ := w.(Hijacker).Hijack() conn.Write([]byte("HTTP/1.1 200 OK\r\nContent-Length: 3\r\n\r\nfoo")) <-closec conn.Close() closedc <- true return } fmt.Fprintf(w, "hello") })) defer ts.Close() noVerifyTransport := &Transport{ TLSClientConfig: &tls.Config{ InsecureSkipVerify: true, }, } client := &Client{Transport: noVerifyTransport} _, err := client.Get(ts.URL + "/kill") if err == nil { t.Error("expected error from /kill request") } res, err := client.Get(ts.URL + "/keep-alive-then-die") if err != nil { t.Fatal(err) } slurp, err := ioutil.ReadAll(res.Body) if err != nil { t.Fatal(err) } if string(slurp) != "foo" { t.Errorf("Got %q, want foo", slurp) } closec <- true <-closedc // time.Sleep(500 * time.Millisecond) res, err = client.Get(ts.URL + "/") if err != nil { t.Fatal(err) } slurp, err = ioutil.ReadAll(res.Body) if err != nil { t.Fatal(err) } if string(slurp) != "hello" { t.Errorf("Got %q, want hello", slurp) } } The latter bug ("http: can't write HTTP request on broken connection") occurs before we've written anything on the wire, so we could in theory back up a bit and pick a new conn. But the EOF one seems more common, and that's the one where we can't really do anything. Short of maybe issue #4677. Not quite sure what to do here, if anything. Status changed to Thinking. |
I'm running into a problem that sounds very similar, and wanted to add a few observations. My test program is pasted at the bottom. -This problem gets worse with concurrency. Running 1000 concurrent GET requests to https://www.google.com consistently fails 0.1-1% of the time: $ go run test_get.go -c -s -n 1000 -u www.google.com | grep failed 2/1000 failed Up that to 10000 concurrent requests and the rate skyrockets to >80%(!): $ go run test_get.go -c -s -n 10000 -u www.google.com | grep failed 8441/10000 failed -This problem also gets worse when redirects are involved, even without concurrency. (For these I used google.com, which 301s to www.google.com.) $ go run test_get.go -c -s -n 1000 -u google.com | grep failed 311/1000 failed $ go run test_get.go -s -n 1000 -u google.com | grep failed 7/1000 failed -Just as a control, which is what makes me suspect it's this bug, all of the above report 0 failures without -s. $ cat test_get.go package main import ( "flag" "fmt" "net/http" ) var ( n = flag.Int("n", 100, "number of GETs to run") s = flag.Bool("s", false, "whether to use HTTPS") c = flag.Bool("c", false, "whether to run concurrently") u = flag.String("u", "google.com", "URL to fetch, without scheme") ) func main() { flag.Parse() out := make(chan error, *n) var url string if *s { url = fmt.Sprintf("https://%v", *u) } else { url = fmt.Sprintf("http://%v", *u) } for i := 0; i < *n; i++ { f := func() { res, err := http.Get(url) if res != nil { res.Body.Close() } if err != nil { out <- err } else { out <- nil } } if *c { go f() } else { f() } } failed := 0 for i := 0; i < *n; i++ { err := <-out if err != nil { fmt.Println(err) failed++ } } fmt.Printf("%d/%d failed\n", failed, *n) } |
Adam, I'm suspecting that the crypto/tls package could be more aggressive at noticing that an alertCloseNotify packet is the next thing on the wire (potentially already read by us) and returning (non-zero, io.EOF) from its io.Readers instead of its (non-zero, nil) responses. But that would require crypto/tls using bufio on its input Reader, so we could do bigger reads and get lucky more often and read in more packets than we asked for. Then, if we were using bufio, we could add *Conn.maybeReadRecord, which returned a packet if available (looking at the un-consumed bytes in bufio.Reader) without blocking longer, per the io.Reader contract. Then, if it is an alertCloseNotify behind the application data, we could return io.EOF along with the same Read. As a side note, it would be nice if the crypto/tls package had support for easily writing out the symmetric keys as needed by wireshark (http://wiki.wireshark.org/SSL) so theories like this could be more easily investigated with production servers (servers whose private keys I don't know and are doing weird things). I will locally hack up the crypto/tls package probably to debug this, but curious your thoughts on the above two items. |
Have added a TODO for supporting key-log in crypto/tls (although only at P3). As for the EOFs: Isn't this equally a problem for HTTP? Most sites don't send close_notify alerts so it ends up working just the same as HTTP: we just get a TCP FIN. Why is HTTPS particularly problematic in this respect? |
Adam, I have no idea why yet. HTTP hasn't been problematic, but HTTPS is terrible. It might just be Amazon, though. I wonder if they're often sending bad hints in their HTTP replies, as they did earlier in this thread, saying they're going to do a keep-alive connection but then killing it. I might have to sit on a connection for 10-20 milliseconds before re-using it, just to see if the other side had a close on the way. Or maybe learn at runtime, per-hostname, which hosts lie about keep-alive ability. But I'd rather not. |
Comment 25 by jon@colchiscapital.com: I'm also experiencing this problem while doing large volumes of requests via HTTPS. I'd be interested to know if other HTTP clients' reuse implementations have the same issue or not. It may be that they simply retry and hide this issue from the consumer. A look at the popular Python transports makes me think that's the case there. |
Sent part 1: https://golang.org/cl/76400046/ And part 2: https://golang.org/cl/79240044 Status changed to Started. |
This issue was closed by revision cc2c5fc. Status changed to Fixed. |
This issue was updated by revision f61f18d. An io.Reader is permitted to return either (n, nil) or (n, io.EOF) on EOF or other error. The tls package previously always returned (n, nil) for a read of size n if n bytes were available, not surfacing errors at the same time. Amazon's HTTPS frontends like to hang up on clients without sending the appropriate HTTP headers. (In their defense, they're allowed to hang up any time, but generally a server hangs up after a bit of inactivity, not immediately.) In any case, the Go HTTP client tries to re-use connections by looking at whether the response headers say to keep the connection open, and because the connection looks okay, under heavy load it's possible we'll reuse it immediately, writing the next request, just as the Transport's always-reading goroutine returns from tls.Conn.Read and sees (0, io.EOF). But because Amazon does send an AlertCloseNotify record before it hangs up on us, and the tls package does its own internal buffering (up to 1024 bytes) of pending data, we have the AlertCloseNotify in an unread buffer when our Conn.Read (to the HTTP Transport code) reads its final bit of data in the HTTP response body. This change makes that final Read return (n, io.EOF) when an AlertCloseNotify record is buffered right after, if we'd otherwise return (n, nil). A dependent change in the HTTP code then notes whether a client connection has seen an io.EOF and uses that as an additional signal to not reuse a HTTPS connection. With both changes, the majority of Amazon request failures go away. Without either one, 10-20 goroutines hitting the S3 API leads to such an error rate that empirically up to 5 retries are needed to complete an API call. LGTM=agl, rsc R=agl, rsc CC=golang-codereviews https://golang.org/cl/76400046 |
Yes, in cc2c5fc |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Attachments:
The text was updated successfully, but these errors were encountered: