Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame #36026

Open
jim-minter opened this issue Dec 6, 2019 · 14 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@jim-minter
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.13.4 linux/amd64

Does this issue reproduce with the latest release?

I believe so.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/data/home/jminter/.cache/go-build"
GOENV="/data/home/jminter/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/data/home/jminter/go"
GOPRIVATE=""
GOPROXY="direct"
GOROOT="/usr/lib/golang"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build951550205=/tmp/go-build -gno-record-gcc-switches"

What did you do?

With a simple client like this, and using it to make a dumb GET request to some server:

c := &http.Client{
Timeout: 30 * time.Second,
}

If the server doesn't respond within 30 seconds, the client gets a timeout error as expected. But, Go does not close the underlying TCP connection to the server. If the client keeps retrying, unfortunately its subsequent requests get queued on the same TCP connection.

If the server doesn't close the connection, or perhaps if there's an underlying TCP connection (e.g. connectivity) break which means the client and server aren't actually communicating, the client doesn't redial to establish a new connection, at least perhaps until the local kernel socket send buffer is full, or there's a TCP keepalive timeout, or something (conjecture).

You can see this when it happens because netstat continues to show the connection, and you can see its local Recv-Q figure incrementing on each retry.

Setting DisableKeepAlives works around this problem, but it's a pretty big hammer.

IMO it would make more sense for Go to abandon and close the underlying TCP connection if an HTTP timeout has occured on it.

What did you expect to see?

I'd like to see Go net/http close underlying TCP connections if the HTTP request they are running times out.

What did you see instead?

It keeps the underlying TCP connection open.

@bradfitz
Copy link
Contributor

bradfitz commented Dec 6, 2019

Got a repro?

@jim-minter
Copy link
Author

@bradfitz I understand my problem better now, and I can reproduce this easily, but there's not a straightforward free-standing reproducer.

I think this is actually primarily a problem with Azure load balancers being badly behaved, but there's also a difference in Go client behaviour between HTTP/1.1 and HTTP/2. I do wonder whether there is scope for Go to be more defensive here.

The principal issue is that Azure load balancers by default don't send a TCP RST for incoming packets on a connection they don't recognise.

With my issue, I've got a long-standing connection to Azure CosmosDB (which sits behind an Azure load balancer), and I log onto a VPN which changes the default route on my laptop. The same thing can also happen if there's a network break that lasts longer than the Azure LB idle timeout, or, similarly, if I suspend my laptop for longer than the idle timeout.

When any of these things happen, the open TCP connection gets into a nasty state. The connection can't make progress because Azure doesn't recognise it or has timed it out, but it also doesn't die quickly because Azure blackholes it instead of sending a RST back to the client. Linux keeps sending packets to Azure hoping for a response until it hits the tcp_retries2 timeout and finally gives up.

In the case of a Go HTTP client with a Timeout set, using HTTP/2, when the underlying TCP connection goes bad, individual requests time out, but Go does not collect the connection. All future requests that are multiplexed to that connection also timeout because the underlying connection is basically dead - it cannot make progress.

For me, adding the following after c.Do() helps, although I suspect it's racy in the face of multiple concurrent requests on a single connection:

if err, ok := err.(net.Error); ok && err.Timeout() {
	     c.CloseIdleConnections()
}

Setting DisableKeepAlives: true works around the issue safely in the HTTP/2 case, but as mentioned, is a big hammer.

If I force the Go client to HTTP/1.1 (with keep-alives enabled), then when the Timeout hits the connection is automatically collected, which seems to be much better behaviour.

I think in my case, for safety, it makes sense for me to disable the HTTP/2 client for now.

I wonder whether, as a safety feature, it would make sense for the Go HTTP/2 client to abandon a connection for future requests if it has had a timeout - or whether this would be too intrusive? Conjecturing, it would be also a useful safety mechanism for any client sitting behind a stupid firewall that drops RSTs, if that brokenness still happens these days.

@toothrot toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 9, 2019
@toothrot toothrot added this to the Backlog milestone Dec 9, 2019
@odeke-em
Copy link
Member

Thank you for the report @jim-minter!

To help get us to the last mile and make this issue actionable for Brad and myself, here is some seed code to help with the repro

package main

import (
	"fmt"
	"log"
	"net"
	"net/http"
	"net/http/httptest"
	"os"
	"os/exec"
	"time"

	"golang.org/x/net/http2"
)

func main() {
	fmt.Printf("Our PID: %d", os.Getpid())
	stallResp := make(chan bool)
	cst := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		fmt.Printf("Request proto: %s\n", r.Proto)
		<-stallResp
		w.Write([]byte("Done here"))
	}))
	if err := http2.ConfigureServer(cst.Config, new(http2.Server)); err != nil {
		log.Fatalf("Failed to configure HTTP/2 server: %v", err)
	}
	cst.TLS = cst.Config.TLSConfig
	cst.StartTLS()
	defer cst.Close()

	tr := &http.Transport{TLSClientConfig: cst.Config.TLSConfig}
	if err := http2.ConfigureTransport(tr); err != nil {
		log.Fatalf("Failed to configure HTTP2 transport: %v", err)
	}
	tr.TLSClientConfig.InsecureSkipVerify = true
	client := &http.Client{
		Timeout:   400 * time.Millisecond,
		Transport: tr,
	}
	req, _ := http.NewRequest("GET", cst.URL, nil)
	res, err := client.Do(req)
	if err == nil {
		log.Fatal("Unexpectedly didn't get an error")
		res.Body.Close()
	}
	ne := err.(net.Error)
	if !ne.Timeout() {
		log.Fatal("Failed to get a timeout")
	}

	close(stallResp)

	// Now finally print the summary from lsof every 3 seconds.
	for {
		output, _ := lsofUs()
		fmt.Printf("%s\n\n", output)
		<-time.After(3 * time.Second)
	}
}

func lsofUs() ([]byte, error) {
	return exec.Command("lsof", "-p", fmt.Sprintf("%d", os.Getpid())).CombinedOutput()
}

It contains a server and a client both in HTTP/2 mode, please tweak it to make it representative of the failures you reported and every 3 seconds it'll report lsof -p PID output which will also print out the states of the sockets e.g.

main    34481 emmanuelodeke    8u    IPv4 0xc0636412fd530a8f       0t0      TCP localhost:49402->localhost:49401 (ESTABLISHED)
main    34481 emmanuelodeke    9u    IPv4 0xc06364130796210f       0t0      TCP localhost:49401->localhost:49402 (ESTABLISHED)

which you can then use to verify your report and then we can also use to diagnose the issue. Thank you!

@jim-minter
Copy link
Author

@odeke-em, thanks for spurring me into action. Here's a reproducer that works on Linux (note: uses sudo and iptables):

package main

import (
	"crypto/tls"
	"log"
	"net"
	"net/http"
	"net/http/httptest"
	"os/exec"
	"time"
)

// configure returns a TLS test server and corresponding client.  If http2 is
// set, the server is configured to enable HTTP/2; otherwise HTTP/1.x will be
// negotiated.
func configure(dialer func(network, addr string) (net.Conn, error), http2 bool) (s *httptest.Server, c *http.Client) {
	s = httptest.NewUnstartedServer(http.HandlerFunc(func(http.ResponseWriter, *http.Request) {}))
	if http2 {
		s.TLS = &tls.Config{
			NextProtos: []string{"h2"},
		}
	}
	s.StartTLS()

	c = s.Client()
	c.Timeout = 500 * time.Millisecond
	c.Transport.(*http.Transport).Dial = dialer
	c.Transport.(*http.Transport).ForceAttemptHTTP2 = true

	return
}

// doRequest makes a GET request from the client to the server and logs the
// first line of the response.
func doRequest(s *httptest.Server, c *http.Client) {
	resp, err := c.Get(s.URL)
	if err != nil {
		log.Printf("doRequest: %v", err)
		return
	}

	log.Printf("doRequest: %s %s", resp.Proto, resp.Status)

	err = resp.Body.Close()
	if err != nil {
		log.Printf("doRequest: %v", err)
	}
}

// iptables manages a local iptables rule blackholing localhost traffic on a
// given TCP port.  It simulates bad behaviour by Azure load balancers, which by
// default don't send a TCP RST for incoming packets on a connection they don't
// recognise.  If command is "-I", the rule is inserted; if it is "-D", it is
// deleted.  For more details, see:
// https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-reset
func iptables(command, port string) error {
	log.Printf("iptables: %s %s", command, port)
	return exec.Command("sudo", "iptables", command, "INPUT", "-s", "127.0.0.1", "-p", "tcp", "--sport", port, "-j", "DROP").Run()
}

func test(http2 bool) error {
	var localPort string

	// dialer wraps net.Dial and leaks the local port of the connection it has
	// just dialed to localPort
	dialer := func(network, addr string) (net.Conn, error) {
		log.Printf("dialing %s", addr)
		conn, err := net.Dial(network, addr)
		if err != nil {
			return nil, err
		}

		_, localPort, err = net.SplitHostPort(conn.(*net.TCPConn).LocalAddr().String())
		if err != nil {
			return nil, err
		}

		return conn, nil
	}

	s, c := configure(dialer, http2)

	// first request is expected to succeed (no reason for it to fail)
	doRequest(s, c)

	// now simulate the underlying connection being broken
	defer iptables("-D", localPort)
	err := iptables("-I", localPort)
	if err != nil {
		return err
	}

	// second request is expected to time out because the underlying connection
	// is busted
	doRequest(s, c)

	// with HTTP/1.x, the third and subsequent requests succeed because the
	// broken underlying connection is discarded and a new one is dialed.

	// with HTTP/2, the third and subsequent requests fail because the
	// underlying connection is *not* discarded.  Very approximately, I think
	// that HTTP/2 should not make new requests on an underlying connection if
	// it has seen a request time out on that connection.
	doRequest(s, c)
	doRequest(s, c)

	return nil
}

func main() {
	log.Print("testing HTTP/1.x")
	err := test(false)
	if err != nil {
		panic(err)
	}

	log.Print("testing HTTP/2")
	err = test(true)
	if err != nil {
		panic(err)
	}
}

I get the following output:

2019/12/24 09:03:57 testing HTTP/1.x
2019/12/24 09:03:57 dialing 127.0.0.1:43105
2019/12/24 09:03:57 doRequest: HTTP/1.1 200 OK
2019/12/24 09:03:57 iptables: -I 47758
2019/12/24 09:03:58 doRequest: Get https://127.0.0.1:43105: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:58 dialing 127.0.0.1:43105
2019/12/24 09:03:58 doRequest: HTTP/1.1 200 OK
2019/12/24 09:03:58 doRequest: HTTP/1.1 200 OK
2019/12/24 09:03:58 iptables: -D 47758
2019/12/24 09:03:58 testing HTTP/2
2019/12/24 09:03:58 dialing 127.0.0.1:41429
2019/12/24 09:03:58 doRequest: HTTP/2.0 200 OK
2019/12/24 09:03:58 iptables: -I 42876
2019/12/24 09:03:58 doRequest: Get https://127.0.0.1:41429: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:59 doRequest: Get https://127.0.0.1:41429: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:59 doRequest: Get https://127.0.0.1:41429: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:59 iptables: -D 42876

Thanks!

@odeke-em
Copy link
Member

Awesome, thank you for the complete repro @jim-minter!

I'll update this issue to match the problem and also kindly ask our HTTP/2 experts
@bradfitz @rs @tombergan:
How do we deal with a missing GOAWAY frame from the server? Sounds like
an HTTP/2 spec violation to me.
Should we be checking x/net/http2's RoundTripOpts for errors
that are non-retryable such as context.Canceled while waiting for headers? How do we differentiate the case of timeouts from no response, from just a server that's still processing?

As reported by this issue the HTTP/1 stack already closes connections on such errors but assumes that the HTTP/2 stack will do that too as per

go/src/net/http/transport.go

Lines 2576 to 2582 in c170b14

// Close HTTP/1 (pc.alt == nil) connection.
// HTTP/2 closes its connection itself.
if pc.alt == nil {
if err != errCallerOwnsConn {
pc.conn.Close()
}
close(pc.closech)

@odeke-em odeke-em changed the title net/http: TCP connection is not closed after http client request times out net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame Dec 26, 2019
@horgh
Copy link
Contributor

horgh commented Jan 12, 2021

#30702 seems related.

I've observed a similar situation in that a dead connection will continue be tried to be used for new requests when using HTTP/2, but if I disable HTTP/1.1 this is not the case. For example if I block the server IP with iptables after establishing the connection. This means my requests can fail for minutes at a time with HTTP/2 due to the cached dead connection continuing to be used. For my application I'm thinking to disable HTTP/2 for the time being because of this.

@networkimprov
Copy link

cc @fraenkel

@fraenkel
Copy link
Contributor

My best recommendation is to enable the PingTimeout to detect connections that have been improperly torn apart.

@horgh
Copy link
Contributor

horgh commented Jan 14, 2021

Cool, thank you! Yes, enabling that does help this situation. I set ReadIdleTimeout to enable that functionality.

@fraenkel
Copy link
Contributor

Yes. That will enable the health checks.

bulldozer-bot bot pushed a commit to palantir/conjure-go-runtime that referenced this issue Jun 21, 2021
Configure the HTTP2 transport to handle broken/idle connections

Two timeout values were added, ReadIdleTimeout and PingTimeout, that are both interrelated.
The ReadIdleTimeout enables the ping health check every configured duration if no frame has been received on the HTTP/2 connection. The PingTimeout can be used to configure the total amount of time to wait for a ping response before closing the connection. Both of these timeout values assist in cleaning up broken or idle connections forcing the client to re-connect. 
This is the current known workaround for the following Go issue:
golang/go#36026
@ibejohn818
Copy link

ibejohn818 commented May 8, 2022

In go1.14.x we were experiencing the same TCP connection leak. We use a shared http.Client w/RateLimiter to make hundreds of simultaneous http1.1 requests. Servers accepting requests have a hard timeout of 5m as does the shared http.Client Transport config. (Accepting servers are also behind multiple proxy layers)

Setting DisableKeepAlive: true in the http.Client Transport config and also using http.NewRequestWithContext and canceling the context (defer ctx.cancel() ) when I was done with the connection (success or error) has cleared up the leaking connections.

@davecheney
Copy link
Contributor

@ibejohn818 my recommendation would be to upgrade to a supported version of Go, 1.18.1 being the recommended version at the time of writing. If that doesn’t address your issue I suggest creating a new issue with details of how to reproduce the problem you are seeing.

@mbyio
Copy link

mbyio commented Mar 7, 2023

I believe we encountered this issue last week, and it caused an outage. We "fixed" the outage by restarting the Go processes, which obviously forced them to reconnect. We have a workaround in place for now. I haven't started trying to reproduce it yet.

We were using Go 1.18.x and instrumented the code with Datadog and the Go net/http/httptrace library. We were connecting from our Go service to an AWS ALB serving HTTPS (HTTP 1.1 and HTTP 2), and we were setting a Timeout on the Go HTTP Client, and wrapping it with https://github.com/hashicorp/go-retryablehttp. We have many other services that have never encountered this issue before, but we also don't usually use the Timeout setting.

Thanks to the tracing, we can see that for each failed HTTP request to the ALB, the round tripper recorded several pairs of GetConn and GotConn traces. On each of the GotConn traces, the Reused and WasIdle flags are set to true. No other traces were recorded, then the overall HTTP request failed due to the Timeout setting.

EdSchouten added a commit to buildbarn/bb-storage that referenced this issue Aug 30, 2023
I have noticed that HTTP/2 tends to be unreliable in certain
circumstances due to bugs in Go's HTTP/2 library. See the following
upstream issues:

golang/go#36026
golang/go#59690

Let's add a configuration option here, so that HTTP/2 can be disabled on
an individual basis if needed.
@edwardrf
Copy link

For whoever facing the same issue due to the http2 TCP connection didn't get closed properly, this is how you can enable the ping health check to close the bad tcp connection:

t1 := &http.Transport{}
t2, err := http2.ConfigureTransports(t1)
if err != nil {
	log.Printf("Failed to configure HTTP2 transport: %v", err)
} else {
	// Adding timeout settings to the http2 transport to prevent bad tcp connection hanging the requests for too long
	// Doc: https://pkg.go.dev/golang.org/x/net/http2#Transport
	//  - ReadIdleTimeout is the time before a ping is sent when no frame has been received from a connection
	//  - PingTimeout is the time before the TCP connection being closed if a Ping response is not received
	// So in total, if a TCP connection goes bad, it would take the combined time before the TCP connection is closed
	t2.ReadIdleTimeout = 10 * time.Second
	t2.PingTimeout = 5 * time.Second
}

client := &http.Client{
	Transport: t1,
	Timeout:   15 * time.Second,
}

resp, err := client.Do(http.NewRequest("GET", "https://www.google.com/", nil))
if err != nil {
	log.Printf("Failed to do request: %v", err)
}

dmitryax pushed a commit to open-telemetry/opentelemetry-collector that referenced this issue Dec 8, 2023
**Description:** 
This PR introduces options to enable `http/2` health check by exposing
`HTTP2ReadIdleTimeout` and `HTTP2PingTimeout`

The golang issues are:
golang/go#59690
golang/go#36026

In summary, if due to environmental issue the underlying tcp connection
used by the http/2 client in the exporter became
unstable/unusable/unreachable, unlike http/1, the http/2 client does not
forcibly close the connection and redial a new one, instead it keeps
using it for 15 minutes (default value of OS `tcp_retries2` ) until the
OS cleans it up and a new tcp connection gets established.
From OTEL user perspective, one will see a spike in export
failures/timeouts for ~15 minutes, this will happen for every connection
that got into a bad state, after 15 minutes things will recover until
next time the tcp connection gets into a bad state.


**Testing:**
- Run OTEL with one of the exporters that uses HTTP/2 client, example
`signalfx` exporter
- For simplicity use a single pipeline/exporter
- In a different shell, run this to watch the tcp state of the
established connection
```
 while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done
 ```  
- From the netstat, take a note of the source port and the source IP address
- replace <> from previous step
`sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP`
- Note how the OTEL exporter export starts timing out

Expected Result:
- A new connection should be established, similarly to http/1 and exports should succeed

Actual Result: 
- The exports keep failing for  ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to
- After 15 minutes, a new tcp connection is created and exports start working

**Documentation:** 
Readme is updated with new settings

Signed-off-by: Dani Louca <dlouca@splunk.com>
pantuza pushed a commit to pantuza/opentelemetry-collector that referenced this issue Dec 8, 2023
**Description:** 
This PR introduces options to enable `http/2` health check by exposing
`HTTP2ReadIdleTimeout` and `HTTP2PingTimeout`

The golang issues are:
golang/go#59690
golang/go#36026

In summary, if due to environmental issue the underlying tcp connection
used by the http/2 client in the exporter became
unstable/unusable/unreachable, unlike http/1, the http/2 client does not
forcibly close the connection and redial a new one, instead it keeps
using it for 15 minutes (default value of OS `tcp_retries2` ) until the
OS cleans it up and a new tcp connection gets established.
From OTEL user perspective, one will see a spike in export
failures/timeouts for ~15 minutes, this will happen for every connection
that got into a bad state, after 15 minutes things will recover until
next time the tcp connection gets into a bad state.


**Testing:**
- Run OTEL with one of the exporters that uses HTTP/2 client, example
`signalfx` exporter
- For simplicity use a single pipeline/exporter
- In a different shell, run this to watch the tcp state of the
established connection
```
 while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done
 ```  
- From the netstat, take a note of the source port and the source IP address
- replace <> from previous step
`sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP`
- Note how the OTEL exporter export starts timing out

Expected Result:
- A new connection should be established, similarly to http/1 and exports should succeed

Actual Result: 
- The exports keep failing for  ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to
- After 15 minutes, a new tcp connection is created and exports start working

**Documentation:** 
Readme is updated with new settings

Signed-off-by: Dani Louca <dlouca@splunk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests