net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame #36026

jim-minter · 2019-12-06T21:43:05Z

What version of Go are you using (`go version`)?

$ go version
go version go1.13.4 linux/amd64

Does this issue reproduce with the latest release?

I believe so.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/data/home/jminter/.cache/go-build"
GOENV="/data/home/jminter/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/data/home/jminter/go"
GOPRIVATE=""
GOPROXY="direct"
GOROOT="/usr/lib/golang"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build951550205=/tmp/go-build -gno-record-gcc-switches"

What did you do?

With a simple client like this, and using it to make a dumb GET request to some server:

c := &http.Client{
Timeout: 30 * time.Second,
}

If the server doesn't respond within 30 seconds, the client gets a timeout error as expected. But, Go does not close the underlying TCP connection to the server. If the client keeps retrying, unfortunately its subsequent requests get queued on the same TCP connection.

If the server doesn't close the connection, or perhaps if there's an underlying TCP connection (e.g. connectivity) break which means the client and server aren't actually communicating, the client doesn't redial to establish a new connection, at least perhaps until the local kernel socket send buffer is full, or there's a TCP keepalive timeout, or something (conjecture).

You can see this when it happens because netstat continues to show the connection, and you can see its local Recv-Q figure incrementing on each retry.

Setting DisableKeepAlives works around this problem, but it's a pretty big hammer.

IMO it would make more sense for Go to abandon and close the underlying TCP connection if an HTTP timeout has occured on it.

What did you expect to see?

I'd like to see Go net/http close underlying TCP connections if the HTTP request they are running times out.

What did you see instead?

It keeps the underlying TCP connection open.

The text was updated successfully, but these errors were encountered:

bradfitz · 2019-12-06T21:54:47Z

Got a repro?

jim-minter · 2019-12-07T20:57:33Z

@bradfitz I understand my problem better now, and I can reproduce this easily, but there's not a straightforward free-standing reproducer.

I think this is actually primarily a problem with Azure load balancers being badly behaved, but there's also a difference in Go client behaviour between HTTP/1.1 and HTTP/2. I do wonder whether there is scope for Go to be more defensive here.

The principal issue is that Azure load balancers by default don't send a TCP RST for incoming packets on a connection they don't recognise.

With my issue, I've got a long-standing connection to Azure CosmosDB (which sits behind an Azure load balancer), and I log onto a VPN which changes the default route on my laptop. The same thing can also happen if there's a network break that lasts longer than the Azure LB idle timeout, or, similarly, if I suspend my laptop for longer than the idle timeout.

When any of these things happen, the open TCP connection gets into a nasty state. The connection can't make progress because Azure doesn't recognise it or has timed it out, but it also doesn't die quickly because Azure blackholes it instead of sending a RST back to the client. Linux keeps sending packets to Azure hoping for a response until it hits the tcp_retries2 timeout and finally gives up.

In the case of a Go HTTP client with a Timeout set, using HTTP/2, when the underlying TCP connection goes bad, individual requests time out, but Go does not collect the connection. All future requests that are multiplexed to that connection also timeout because the underlying connection is basically dead - it cannot make progress.

For me, adding the following after c.Do() helps, although I suspect it's racy in the face of multiple concurrent requests on a single connection:

if err, ok := err.(net.Error); ok && err.Timeout() {
	     c.CloseIdleConnections()
}

Setting DisableKeepAlives: true works around the issue safely in the HTTP/2 case, but as mentioned, is a big hammer.

If I force the Go client to HTTP/1.1 (with keep-alives enabled), then when the Timeout hits the connection is automatically collected, which seems to be much better behaviour.

I think in my case, for safety, it makes sense for me to disable the HTTP/2 client for now.

I wonder whether, as a safety feature, it would make sense for the Go HTTP/2 client to abandon a connection for future requests if it has had a timeout - or whether this would be too intrusive? Conjecturing, it would be also a useful safety mechanism for any client sitting behind a stupid firewall that drops RSTs, if that brokenness still happens these days.

odeke-em · 2019-12-24T06:04:53Z

Thank you for the report @jim-minter!

To help get us to the last mile and make this issue actionable for Brad and myself, here is some seed code to help with the repro

package main

import (
	"fmt"
	"log"
	"net"
	"net/http"
	"net/http/httptest"
	"os"
	"os/exec"
	"time"

	"golang.org/x/net/http2"
)

func main() {
	fmt.Printf("Our PID: %d", os.Getpid())
	stallResp := make(chan bool)
	cst := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		fmt.Printf("Request proto: %s\n", r.Proto)
		<-stallResp
		w.Write([]byte("Done here"))
	}))
	if err := http2.ConfigureServer(cst.Config, new(http2.Server)); err != nil {
		log.Fatalf("Failed to configure HTTP/2 server: %v", err)
	}
	cst.TLS = cst.Config.TLSConfig
	cst.StartTLS()
	defer cst.Close()

	tr := &http.Transport{TLSClientConfig: cst.Config.TLSConfig}
	if err := http2.ConfigureTransport(tr); err != nil {
		log.Fatalf("Failed to configure HTTP2 transport: %v", err)
	}
	tr.TLSClientConfig.InsecureSkipVerify = true
	client := &http.Client{
		Timeout:   400 * time.Millisecond,
		Transport: tr,
	}
	req, _ := http.NewRequest("GET", cst.URL, nil)
	res, err := client.Do(req)
	if err == nil {
		log.Fatal("Unexpectedly didn't get an error")
		res.Body.Close()
	}
	ne := err.(net.Error)
	if !ne.Timeout() {
		log.Fatal("Failed to get a timeout")
	}

	close(stallResp)

	// Now finally print the summary from lsof every 3 seconds.
	for {
		output, _ := lsofUs()
		fmt.Printf("%s\n\n", output)
		<-time.After(3 * time.Second)
	}
}

func lsofUs() ([]byte, error) {
	return exec.Command("lsof", "-p", fmt.Sprintf("%d", os.Getpid())).CombinedOutput()
}

It contains a server and a client both in HTTP/2 mode, please tweak it to make it representative of the failures you reported and every 3 seconds it'll report lsof -p PID output which will also print out the states of the sockets e.g.

main    34481 emmanuelodeke    8u    IPv4 0xc0636412fd530a8f       0t0      TCP localhost:49402->localhost:49401 (ESTABLISHED)
main    34481 emmanuelodeke    9u    IPv4 0xc06364130796210f       0t0      TCP localhost:49401->localhost:49402 (ESTABLISHED)

which you can then use to verify your report and then we can also use to diagnose the issue. Thank you!

jim-minter · 2019-12-24T15:06:06Z

@odeke-em, thanks for spurring me into action. Here's a reproducer that works on Linux (note: uses sudo and iptables):

package main

import (
	"crypto/tls"
	"log"
	"net"
	"net/http"
	"net/http/httptest"
	"os/exec"
	"time"
)

// configure returns a TLS test server and corresponding client.  If http2 is
// set, the server is configured to enable HTTP/2; otherwise HTTP/1.x will be
// negotiated.
func configure(dialer func(network, addr string) (net.Conn, error), http2 bool) (s *httptest.Server, c *http.Client) {
	s = httptest.NewUnstartedServer(http.HandlerFunc(func(http.ResponseWriter, *http.Request) {}))
	if http2 {
		s.TLS = &tls.Config{
			NextProtos: []string{"h2"},
		}
	}
	s.StartTLS()

	c = s.Client()
	c.Timeout = 500 * time.Millisecond
	c.Transport.(*http.Transport).Dial = dialer
	c.Transport.(*http.Transport).ForceAttemptHTTP2 = true

	return
}

// doRequest makes a GET request from the client to the server and logs the
// first line of the response.
func doRequest(s *httptest.Server, c *http.Client) {
	resp, err := c.Get(s.URL)
	if err != nil {
		log.Printf("doRequest: %v", err)
		return
	}

	log.Printf("doRequest: %s %s", resp.Proto, resp.Status)

	err = resp.Body.Close()
	if err != nil {
		log.Printf("doRequest: %v", err)
	}
}

// iptables manages a local iptables rule blackholing localhost traffic on a
// given TCP port.  It simulates bad behaviour by Azure load balancers, which by
// default don't send a TCP RST for incoming packets on a connection they don't
// recognise.  If command is "-I", the rule is inserted; if it is "-D", it is
// deleted.  For more details, see:
// https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-reset
func iptables(command, port string) error {
	log.Printf("iptables: %s %s", command, port)
	return exec.Command("sudo", "iptables", command, "INPUT", "-s", "127.0.0.1", "-p", "tcp", "--sport", port, "-j", "DROP").Run()
}

func test(http2 bool) error {
	var localPort string

	// dialer wraps net.Dial and leaks the local port of the connection it has
	// just dialed to localPort
	dialer := func(network, addr string) (net.Conn, error) {
		log.Printf("dialing %s", addr)
		conn, err := net.Dial(network, addr)
		if err != nil {
			return nil, err
		}

		_, localPort, err = net.SplitHostPort(conn.(*net.TCPConn).LocalAddr().String())
		if err != nil {
			return nil, err
		}

		return conn, nil
	}

	s, c := configure(dialer, http2)

	// first request is expected to succeed (no reason for it to fail)
	doRequest(s, c)

	// now simulate the underlying connection being broken
	defer iptables("-D", localPort)
	err := iptables("-I", localPort)
	if err != nil {
		return err
	}

	// second request is expected to time out because the underlying connection
	// is busted
	doRequest(s, c)

	// with HTTP/1.x, the third and subsequent requests succeed because the
	// broken underlying connection is discarded and a new one is dialed.

	// with HTTP/2, the third and subsequent requests fail because the
	// underlying connection is *not* discarded.  Very approximately, I think
	// that HTTP/2 should not make new requests on an underlying connection if
	// it has seen a request time out on that connection.
	doRequest(s, c)
	doRequest(s, c)

	return nil
}

func main() {
	log.Print("testing HTTP/1.x")
	err := test(false)
	if err != nil {
		panic(err)
	}

	log.Print("testing HTTP/2")
	err = test(true)
	if err != nil {
		panic(err)
	}
}

I get the following output:

2019/12/24 09:03:57 testing HTTP/1.x
2019/12/24 09:03:57 dialing 127.0.0.1:43105
2019/12/24 09:03:57 doRequest: HTTP/1.1 200 OK
2019/12/24 09:03:57 iptables: -I 47758
2019/12/24 09:03:58 doRequest: Get https://127.0.0.1:43105: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:58 dialing 127.0.0.1:43105
2019/12/24 09:03:58 doRequest: HTTP/1.1 200 OK
2019/12/24 09:03:58 doRequest: HTTP/1.1 200 OK
2019/12/24 09:03:58 iptables: -D 47758
2019/12/24 09:03:58 testing HTTP/2
2019/12/24 09:03:58 dialing 127.0.0.1:41429
2019/12/24 09:03:58 doRequest: HTTP/2.0 200 OK
2019/12/24 09:03:58 iptables: -I 42876
2019/12/24 09:03:58 doRequest: Get https://127.0.0.1:41429: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:59 doRequest: Get https://127.0.0.1:41429: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:59 doRequest: Get https://127.0.0.1:41429: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/12/24 09:03:59 iptables: -D 42876

Thanks!

odeke-em · 2019-12-26T10:05:39Z

Awesome, thank you for the complete repro @jim-minter!

I'll update this issue to match the problem and also kindly ask our HTTP/2 experts
@bradfitz @rs @tombergan:
How do we deal with a missing GOAWAY frame from the server? Sounds like
an HTTP/2 spec violation to me.
Should we be checking x/net/http2's RoundTripOpts for errors
that are non-retryable such as context.Canceled while waiting for headers? How do we differentiate the case of timeouts from no response, from just a server that's still processing?

As reported by this issue the HTTP/1 stack already closes connections on such errors but assumes that the HTTP/2 stack will do that too as per

go/src/net/http/transport.go

Lines 2576 to 2582 in c170b14

    
           // Close HTTP/1 (pc.alt == nil) connection. 
        
           // HTTP/2 closes its connection itself. 
        
           if pc.alt == nil { 
        
           	if err != errCallerOwnsConn { 
        
           		pc.conn.Close() 
        
           	} 
        
           	close(pc.closech)

horgh · 2021-01-12T18:43:33Z

#30702 seems related.

I've observed a similar situation in that a dead connection will continue be tried to be used for new requests when using HTTP/2, but if I disable HTTP/1.1 this is not the case. For example if I block the server IP with iptables after establishing the connection. This means my requests can fail for minutes at a time with HTTP/2 due to the cached dead connection continuing to be used. For my application I'm thinking to disable HTTP/2 for the time being because of this.

networkimprov · 2021-01-14T00:43:43Z

cc @fraenkel

fraenkel · 2021-01-14T14:24:05Z

My best recommendation is to enable the PingTimeout to detect connections that have been improperly torn apart.

horgh · 2021-01-14T16:49:52Z

Cool, thank you! Yes, enabling that does help this situation. I set ReadIdleTimeout to enable that functionality.

fraenkel · 2021-01-14T17:02:07Z

Yes. That will enable the health checks.

Configure the HTTP2 transport to handle broken/idle connections Two timeout values were added, ReadIdleTimeout and PingTimeout, that are both interrelated. The ReadIdleTimeout enables the ping health check every configured duration if no frame has been received on the HTTP/2 connection. The PingTimeout can be used to configure the total amount of time to wait for a ping response before closing the connection. Both of these timeout values assist in cleaning up broken or idle connections forcing the client to re-connect. This is the current known workaround for the following Go issue: golang/go#36026

ibejohn818 · 2022-05-08T23:50:47Z

In go1.14.x we were experiencing the same TCP connection leak. We use a shared http.Client w/RateLimiter to make hundreds of simultaneous http1.1 requests. Servers accepting requests have a hard timeout of 5m as does the shared http.Client Transport config. (Accepting servers are also behind multiple proxy layers)

Setting DisableKeepAlive: true in the http.Client Transport config and also using http.NewRequestWithContext and canceling the context (defer ctx.cancel() ) when I was done with the connection (success or error) has cleared up the leaking connections.

davecheney · 2022-05-08T23:57:05Z

@ibejohn818 my recommendation would be to upgrade to a supported version of Go, 1.18.1 being the recommended version at the time of writing. If that doesn’t address your issue I suggest creating a new issue with details of how to reproduce the problem you are seeing.

mbyio · 2023-03-07T03:36:38Z

I believe we encountered this issue last week, and it caused an outage. We "fixed" the outage by restarting the Go processes, which obviously forced them to reconnect. We have a workaround in place for now. I haven't started trying to reproduce it yet.

We were using Go 1.18.x and instrumented the code with Datadog and the Go net/http/httptrace library. We were connecting from our Go service to an AWS ALB serving HTTPS (HTTP 1.1 and HTTP 2), and we were setting a Timeout on the Go HTTP Client, and wrapping it with https://github.com/hashicorp/go-retryablehttp. We have many other services that have never encountered this issue before, but we also don't usually use the Timeout setting.

Thanks to the tracing, we can see that for each failed HTTP request to the ALB, the round tripper recorded several pairs of GetConn and GotConn traces. On each of the GotConn traces, the Reused and WasIdle flags are set to true. No other traces were recorded, then the overall HTTP request failed due to the Timeout setting.

I have noticed that HTTP/2 tends to be unreliable in certain circumstances due to bugs in Go's HTTP/2 library. See the following upstream issues: golang/go#36026 golang/go#59690 Let's add a configuration option here, so that HTTP/2 can be disabled on an individual basis if needed.

edwardrf · 2023-11-10T23:27:58Z

For whoever facing the same issue due to the http2 TCP connection didn't get closed properly, this is how you can enable the ping health check to close the bad tcp connection:

t1 := &http.Transport{}
t2, err := http2.ConfigureTransports(t1)
if err != nil {
	log.Printf("Failed to configure HTTP2 transport: %v", err)
} else {
	// Adding timeout settings to the http2 transport to prevent bad tcp connection hanging the requests for too long
	// Doc: https://pkg.go.dev/golang.org/x/net/http2#Transport
	//  - ReadIdleTimeout is the time before a ping is sent when no frame has been received from a connection
	//  - PingTimeout is the time before the TCP connection being closed if a Ping response is not received
	// So in total, if a TCP connection goes bad, it would take the combined time before the TCP connection is closed
	t2.ReadIdleTimeout = 10 * time.Second
	t2.PingTimeout = 5 * time.Second
}

client := &http.Client{
	Transport: t1,
	Timeout:   15 * time.Second,
}

resp, err := client.Do(http.NewRequest("GET", "https://www.google.com/", nil))
if err != nil {
	log.Printf("Failed to do request: %v", err)
}

**Description:** This PR introduces options to enable `http/2` health check by exposing `HTTP2ReadIdleTimeout` and `HTTP2PingTimeout` The golang issues are: golang/go#59690 golang/go#36026 In summary, if due to environmental issue the underlying tcp connection used by the http/2 client in the exporter became unstable/unusable/unreachable, unlike http/1, the http/2 client does not forcibly close the connection and redial a new one, instead it keeps using it for 15 minutes (default value of OS `tcp_retries2` ) until the OS cleans it up and a new tcp connection gets established. From OTEL user perspective, one will see a spike in export failures/timeouts for ~15 minutes, this will happen for every connection that got into a bad state, after 15 minutes things will recover until next time the tcp connection gets into a bad state. **Testing:** - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** Readme is updated with new settings Signed-off-by: Dani Louca <dlouca@splunk.com>

jim-minter mentioned this issue Dec 6, 2019

net/http: TCP connection is not closed after http client makes a https request via a http proxy server which doesn't response #28012

Closed

toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 9, 2019

toothrot added this to the Backlog milestone Dec 9, 2019

odeke-em changed the title ~~net/http: TCP connection is not closed after http client request times out~~ net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame Dec 26, 2019

danielnelson mentioned this issue May 15, 2020

telegraf does not reconnect to influxdb influxdata/telegraf#5905

Closed

jim-minter mentioned this issue Nov 26, 2020

Improve cluster creation success in Korea Azure/ARO-RP#1196

Merged

horgh mentioned this issue Jan 12, 2021

storage: Advice regarding behaviour when GCS is unavailable googleapis/google-cloud-go#3522

Closed

tabboud mentioned this issue Jun 17, 2021

Configure the HTTP2 transport to handle broken connections palantir/conjure-go-runtime#189

Merged

powersj mentioned this issue Sep 28, 2022

Handle HTTP/2 GOAWAY messages when writing to outputs influxdata/telegraf#11901

Open

rhysh mentioned this issue Dec 24, 2022

x/net/http2: Stuck http requests #56385

Open

btasker mentioned this issue Apr 18, 2023

x/net/http2: raciness in stream close prevents closing of failed connections, blocking subsequent requests #59690

Open

dloucasfx mentioned this issue Nov 30, 2023

Introduce HTTP2 health check transport options open-telemetry/opentelemetry-collector#9022

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame #36026

net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame #36026

jim-minter commented Dec 6, 2019

bradfitz commented Dec 6, 2019

jim-minter commented Dec 7, 2019

odeke-em commented Dec 24, 2019

jim-minter commented Dec 24, 2019

odeke-em commented Dec 26, 2019

horgh commented Jan 12, 2021

networkimprov commented Jan 14, 2021

fraenkel commented Jan 14, 2021

horgh commented Jan 14, 2021

fraenkel commented Jan 14, 2021

ibejohn818 commented May 8, 2022 •

edited

davecheney commented May 8, 2022

mbyio commented Mar 7, 2023

edwardrf commented Nov 10, 2023

net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame #36026

net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame #36026

Comments

jim-minter commented Dec 6, 2019

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

bradfitz commented Dec 6, 2019

jim-minter commented Dec 7, 2019

odeke-em commented Dec 24, 2019

jim-minter commented Dec 24, 2019

odeke-em commented Dec 26, 2019

horgh commented Jan 12, 2021

networkimprov commented Jan 14, 2021

fraenkel commented Jan 14, 2021

horgh commented Jan 14, 2021

fraenkel commented Jan 14, 2021

ibejohn818 commented May 8, 2022 • edited

davecheney commented May 8, 2022

mbyio commented Mar 7, 2023

edwardrf commented Nov 10, 2023

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

ibejohn818 commented May 8, 2022 •

edited