Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: http tries to reuse dead TLS connections #3514

Closed
edsrzf opened this issue Apr 12, 2012 · 30 comments
Closed

net/http: http tries to reuse dead TLS connections #3514

edsrzf opened this issue Apr 12, 2012 · 30 comments
Milestone

Comments

@edsrzf
Copy link

edsrzf commented Apr 12, 2012

The symptom is that requests occasionally fail with io.EOF. I've observed this when
making requests against AWS's ELB (Elastic Load Balancer), but it might happen elsewhere.

When ELB's load suddenly increases, it can respond with HTTP 503 errors while it warms
up. When it does this over HTTPS, it also sends a TLS close_notify alert but does not
close the connection itself.

Go's http package receives ELB's 503 responses just fine. However, it does not read the
close_notify alert and so doesn't realize that the TLS connection is effectively closed.
It then tries to reuse the same connection for its next request.

The next request's writes all succeed, but ELB is completely ignoring everything. Then
http attempts to read the response, the tls.Conn sees the close_notify alert from the
previous response, and the read returns io.EOF.

I can provide steps to reproduce the problem, but unfortunately they're kind of involved.

1. Sign up for an account on http://www.iron.io/
2. Create a project and take note of its project ID.
3. Get an API token from https://hud.iron.io/tokens
4. Run the attached program until it panics

Attachments:

  1. mq_eof.go (499 bytes)
@bradfitz
Copy link
Contributor

Comment 1:

Adam, thoughts on this?
I didn't know TSL even had a "close_notify" packet type.
For background: the HTTP transport code opens two goroutines: one for reading and one
for writing.  It stops using a TCP connection for HTTP (or HTTPS) once the remote side
says "Connection: close" (or HTTP/1.0 without Connection keep-alive), or once the reader
goroutine peeks a byte and it's EOF.
There might be two issues here:
1) Amazon ELB is not sending a Connection: close on its 503 response?
2) Peeking a byte on a tls.Conn should return EOF or something once it gets a
close_notify packet?  Or, if it already returns a different error, the net/http
Transport code should learn what it is.  I see no vars declared in crypto/tls, though.

Owner changed to @bradfitz.

Status changed to Accepted.

@edsrzf
Copy link
Author

edsrzf commented Apr 12, 2012

Comment 2:

I can confirm that ELB doesn't send a Connection: close on its 503. The only header it
sends with that is Content-Length: 0.

@bradfitz
Copy link
Contributor

Comment 3:

In that case, it's totally legit AFAIU for us to re-use that connection.  If the Go
client has an outbound request ready before it sees the close_notify packet (assume we
do/will handle it correctly), there's a potential race between us sending a request and
us hearing that the connection is dead.

@edsrzf
Copy link
Author

edsrzf commented Apr 12, 2012

Comment 4:

I understand that ELB's behaving badly here, but other clients I've tried handle it
gracefully.
I also understand the potential race, but catching a dead TLS connection sometimes seems
better than catching it never.

@bradfitz
Copy link
Contributor

Comment 5:

Agreed. I just want to set expectations: even if we have a tls.Conn and/or
http.Transport bug here and they get fixed, there will still be a race if you have a
busy client.
But that's likely true with any other language too. If you find that is NOT the case,
I'd be curious what other libraries due to avoid this. Treating a 503 as a dead
connection always seems lame, for instance, if that's the answer.

@agl
Copy link
Contributor

agl commented Apr 12, 2012

Comment 6:

TLS has a secure EOF called "close_notify". In theory, every half-connection is closed
with a close-notify and any transport EOF without a close_notify is a truncation error.
In practice, things are very messy.
However, I can well believe that we ignore close_notify. If treating it as EOF fixes
this problem then that sounds like a reasonable thing to do.

@edsrzf
Copy link
Author

edsrzf commented Apr 12, 2012

Comment 7:

I believe close_notify is already treated as EOF. It's just that we don't end up reading
it straight after reading the response.
The sequence goes:
1. Request 1 is sent as normal
2. Response 1 is received as normal (and it happens to be a 503)
3. Request 2 is sent
4. We try to read response 2, but then realize that close_notify was sent straight after
response 1. Request 2 was totally ignored by the server. EOF is returned.

@bradfitz
Copy link
Contributor

Comment 8:

I don't agree with comment #7's analysis.
The Transport always has a goroutine reading, even if a request hasn't been sent.

@rsc
Copy link
Contributor

rsc commented Apr 13, 2012

Comment 9:

#7's analysis sounds plausible to me.  Surely there is a race: we could plausibly finish
reading the response and then decide to send another request before the close notify has
arrived.  
That suggests a more general experiment that could be carried out.  Write an HTTP/1.1
server that answers with a framed request (or a Content-Length), so that the client
would think the connection could be reused, but then hang up on the client, say, 2
seconds after writing the response.  If we are keeping connections open in general then
this would simulate the server timing us out at exactly the moment we decide to issue a
new request.  It seems like we should redial in this situation.  Do we?

@edsrzf
Copy link
Author

edsrzf commented Jul 12, 2012

Comment 10:

I think the problem I was having ended up being my fault. I was retrying requests that
resulted in 503, and when that happened I was forgetting to close the response body.
I just need to remember my mantra: "It's always your fault."

@edsrzf
Copy link
Author

edsrzf commented Jul 12, 2012

Comment 11:

It still seems strange that the problem only occurred when using HTTPS, though. I
haven't fully thought through why that might be.

@rsc
Copy link
Contributor

rsc commented Sep 12, 2012

Comment 12:

Labels changed: added priority-later, removed priority-triage.

@rsc
Copy link
Contributor

rsc commented Sep 12, 2012

Comment 13:

Labels changed: added go1.1maybe.

@bradfitz
Copy link
Contributor

Comment 14:

Here's a repro (HTTPS + fake keep-alive response with forced disconnect).  With the 500
millisecond sleep uncommented, it's reliable.  Without, you get varying errors:
serve_test.go:699:      Get https://127.0.0.1:38706/: EOF
serve_test.go:699:      Get https://127.0.0.1:58785/: http: can't write HTTP request on
broken connection
// Trying to repro golang.org/issue/3514
func TestTLSServerClosesConnection(t *testing.T) {
        closec := make(chan bool)
        closedc := make(chan bool)
        ts := httptest.NewTLSServer(HandlerFunc(func(w ResponseWriter, r *Request) {
                if strings.Contains(r.URL.Path, "/kill") {
                        conn, _, _ := w.(Hijacker).Hijack()
                        conn.Close()
                        return
                }
                if strings.Contains(r.URL.Path, "/keep-alive-then-die") {
                        conn, _, _ := w.(Hijacker).Hijack()
                        conn.Write([]byte("HTTP/1.1 200 OK\r\nContent-Length: 3\r\n\r\nfoo"))
                        <-closec
                        conn.Close()
                        closedc <- true
                        return
                }
                fmt.Fprintf(w, "hello")
        }))
        defer ts.Close()
        noVerifyTransport := &Transport{
        TLSClientConfig: &tls.Config{
                        InsecureSkipVerify: true,
                },
        }
        client := &Client{Transport: noVerifyTransport}
        _, err := client.Get(ts.URL + "/kill")
        if err == nil {
                t.Error("expected error from /kill request")
        }
        res, err := client.Get(ts.URL + "/keep-alive-then-die")
        if err != nil {
                t.Fatal(err)
        }
        slurp, err := ioutil.ReadAll(res.Body)
        if err != nil {
                t.Fatal(err)
        }
        if string(slurp) != "foo" {
                t.Errorf("Got %q, want foo", slurp)
        }
        closec <- true
        <-closedc
        // time.Sleep(500 * time.Millisecond)
        res, err = client.Get(ts.URL + "/")
        if err != nil {
                t.Fatal(err)
        }
        slurp, err = ioutil.ReadAll(res.Body)
        if err != nil {
                t.Fatal(err)
        }
        if string(slurp) != "hello" {
                t.Errorf("Got %q, want hello", slurp)
        }
}
The latter bug ("http: can't write HTTP request on broken connection") occurs before
we've written anything on the wire, so we could in theory back up a bit and pick a new
conn.
But the EOF one seems more common, and that's the one where we can't really do anything.
 Short of maybe issue #4677.
Not quite sure what to do here, if anything.

Status changed to Thinking.

@dborowitz
Copy link
Contributor

Comment 15:

I'm running into a problem that sounds very similar, and wanted to add a few
observations. My test program is pasted at the bottom.
-This problem gets worse with concurrency. Running 1000 concurrent GET requests to
https://www.google.com consistently fails 0.1-1% of the time:
 $ go run test_get.go -c -s -n 1000 -u www.google.com | grep failed
2/1000 failed
Up that to 10000 concurrent requests and the rate skyrockets to >80%(!):
$ go run test_get.go -c -s -n 10000 -u www.google.com | grep failed
8441/10000 failed
-This problem also gets worse when redirects are involved, even without concurrency.
(For these I used google.com, which 301s to www.google.com.)
$ go run test_get.go -c -s -n 1000 -u google.com | grep failed
311/1000 failed
$ go run test_get.go -s -n 1000 -u google.com | grep failed
7/1000 failed
-Just as a control, which is what makes me suspect it's this bug, all of the above
report 0 failures without -s.
$ cat test_get.go 
package main
import (
        "flag"
        "fmt"
        "net/http"
)
var (
        n = flag.Int("n", 100, "number of GETs to run")
        s = flag.Bool("s", false, "whether to use HTTPS")
        c = flag.Bool("c", false, "whether to run concurrently")
        u = flag.String("u", "google.com", "URL to fetch, without scheme")
)
func main() {
        flag.Parse()
        out := make(chan error, *n)
        var url string
        if *s {
                url = fmt.Sprintf("https://%v", *u)
        } else {
                url = fmt.Sprintf("http://%v", *u)
        }
        for i := 0; i < *n; i++ {
                f := func() {
                        res, err := http.Get(url)
                        if res != nil {
                                res.Body.Close()
                        }
                        if err != nil {
                                out <- err
                        } else {
                                out <- nil
                        }
                }
                if *c {
                        go f()
                } else {
                        f()
                }
        }
        failed := 0
        for i := 0; i < *n; i++ {
                err := <-out
                if err != nil {
                        fmt.Println(err)
                        failed++
                }
        }
        fmt.Printf("%d/%d failed\n", failed, *n)
}

@rsc
Copy link
Contributor

rsc commented Mar 12, 2013

Comment 16:

[The time for maybe has passed.]

Labels changed: removed go1.1maybe.

@rsc
Copy link
Contributor

rsc commented Nov 27, 2013

Comment 17:

Labels changed: added go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 18:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 19:

Labels changed: added repo-main.

@bradfitz
Copy link
Contributor

Comment 20:

I've started to hit this a lot, doing many HTTPS connections to Amazon.
Re-adding Go1.3-Maybe, if the fix is simple.

Labels changed: added release-go1.3maybe, removed release-none.

@bradfitz
Copy link
Contributor

Comment 21:

Adam,
I'm suspecting that the crypto/tls package could be more aggressive at noticing that an
alertCloseNotify packet is the next thing on the wire (potentially already read by us)
and returning (non-zero, io.EOF) from its io.Readers instead of its (non-zero, nil)
responses.
But that would require crypto/tls using bufio on its input Reader, so we could do bigger
reads and get lucky more often and read in more packets than we asked for.  Then, if we
were using bufio, we could add *Conn.maybeReadRecord, which returned a packet if
available (looking at the un-consumed bytes in bufio.Reader) without blocking longer,
per the io.Reader contract. Then, if it is an alertCloseNotify behind the application
data, we could return io.EOF along with the same Read.
As a side note, it would be nice if the crypto/tls package had support for easily
writing out the symmetric keys as needed by wireshark (http://wiki.wireshark.org/SSL) so
theories like this could be more easily investigated with production servers (servers
whose private keys I don't know and are doing weird things).
I will locally hack up the crypto/tls package probably to debug this, but curious your
thoughts on the above two items.

@agl
Copy link
Contributor

agl commented Mar 20, 2014

Comment 22:

Have added a TODO for supporting key-log in crypto/tls (although only at P3).
As for the EOFs:
Isn't this equally a problem for HTTP? Most sites don't send close_notify alerts so it
ends up working just the same as HTTP: we just get a TCP FIN. Why is HTTPS particularly
problematic in this respect?

@bradfitz
Copy link
Contributor

Comment 23:

Adam, I have no idea why yet.  HTTP hasn't been problematic, but HTTPS is terrible.  It
might just be Amazon, though.  I wonder if they're often sending bad hints in their HTTP
replies, as they did earlier in this thread, saying they're going to do a keep-alive
connection but then killing it.
I might have to sit on a connection for 10-20 milliseconds before re-using it, just to
see if the other side had a close on the way. Or maybe learn at runtime, per-hostname,
which hosts lie about keep-alive ability. But I'd rather not.

@agl
Copy link
Contributor

agl commented Mar 20, 2014

Comment 24:

Could you retry requests on warn connections that result in EOF? There's always going to
be some amount of window where the server might have given up on a connection.

@gopherbot
Copy link

Comment 25 by jon@colchiscapital.com:

I'm also experiencing this problem while doing large volumes of requests via HTTPS. 
I'd be interested to know if other HTTP clients' reuse implementations have the same
issue or not. It may be that they simply retry and hide this issue from the consumer. A
look at the popular Python transports makes me think that's the case there.

@bradfitz
Copy link
Contributor

Comment 26:

Sent part 1: https://golang.org/cl/76400046/
And part 2: https://golang.org/cl/79240044

Status changed to Started.

@bradfitz
Copy link
Contributor

Comment 27:

This issue was closed by revision cc2c5fc.

Status changed to Fixed.

@bradfitz
Copy link
Contributor

Comment 28:

This issue was updated by revision f61f18d.

An io.Reader is permitted to return either (n, nil)
or (n, io.EOF) on EOF or other error.
The tls package previously always returned (n, nil) for a read
of size n if n bytes were available, not surfacing errors at
the same time.
Amazon's HTTPS frontends like to hang up on clients without
sending the appropriate HTTP headers. (In their defense,
they're allowed to hang up any time, but generally a server
hangs up after a bit of inactivity, not immediately.) In any
case, the Go HTTP client tries to re-use connections by
looking at whether the response headers say to keep the
connection open, and because the connection looks okay, under
heavy load it's possible we'll reuse it immediately, writing
the next request, just as the Transport's always-reading
goroutine returns from tls.Conn.Read and sees (0, io.EOF).
But because Amazon does send an AlertCloseNotify record before
it hangs up on us, and the tls package does its own internal
buffering (up to 1024 bytes) of pending data, we have the
AlertCloseNotify in an unread buffer when our Conn.Read (to
the HTTP Transport code) reads its final bit of data in the
HTTP response body.
This change makes that final Read return (n, io.EOF) when
an AlertCloseNotify record is buffered right after, if we'd
otherwise return (n, nil).
A dependent change in the HTTP code then notes whether a
client connection has seen an io.EOF and uses that as an
additional signal to not reuse a HTTPS connection. With both
changes, the majority of Amazon request failures go
away. Without either one, 10-20 goroutines hitting the S3 API
leads to such an error rate that empirically up to 5 retries
are needed to complete an API call.
LGTM=agl, rsc
R=agl, rsc
CC=golang-codereviews
https://golang.org/cl/76400046

@freeformz
Copy link

@bradfitz Was the dependent change to http ever made? f61f18d is only tls changes.

@bradfitz
Copy link
Contributor

Yes, in cc2c5fc

This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants