Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: Transport wastes TLS handshake progress #50984

Open
rhysh opened this issue Feb 2, 2022 · 7 comments
Open

net/http: Transport wastes TLS handshake progress #50984

rhysh opened this issue Feb 2, 2022 · 7 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Feb 2, 2022

What version of Go are you using (go version)?

$ go1.16 version
go version go1.16.13 darwin/arm64

$ go1.17 version
go version go1.17.6 darwin/arm64

$ go1.18 version
go version devel go1.18-41f485b9a7 Mon Jan 31 13:43:52 2022 +0000 darwin/arm64
# aka go1.18beta2

Does this issue reproduce with the latest release?

Yes, this is present in the latest stable release, go1.17.6, and the latest tagged beta. It is not present in go1.16.13. It was introduced by the work for #32406.

What operating system and processor architecture are you using (go env)?

go env Output
$ go1.17 env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/rhys/Library/Caches/go-build"
GOENV="/Users/rhys/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/rhys/go/pkg/mod"
GONOPROXY="*"
GONOSUMDB="*"
GOOS="darwin"
GOPATH="/Users/rhys/go"
GOPRIVATE="*"
GOPROXY="direct"
GOROOT="/Users/rhys/go/version/go1.17"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/rhys/go/version/go1.17/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.17.6"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/lf/n4ry0qv5639_3z0nhyhy3z7c0000gs/T/go-build3534659518=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

An application I work with uses net/http to make outbound requests over HTTP/1.1+TLS. It uses Context values to control the lifecycle of those requests and to set timeouts. It cleans up the Context values (to avoid leaking timers, as recommended by go vet) when the outbound HTTPS call is complete.

Here's an overview of the most relevant code:

func callWithTimeout(ctx context.Context) {
    ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
    defer cancel()
    doRequest(ctx) // uses net/http package for HTTPS call several layers down
}

I've included a full reproducer at the bottom, also available at https://go.dev/play/p/_vg00gqqbpG although the Playground's network stack messes slightly with the results.

What did you expect to see?

I expected that code pattern (use Context to set a timeout, clean up immediately after the call) to be a best practice. I expected it to lead to efficient use of compute resources. I expected it to provide good performance (not much unnecessary latency).

What did you see instead?

After upgrading the application from from Go 1.16 to Go 1.17, the load balancer in front of the HTTPS service this application calls saw about a 4x increase in the volume of inbound TLS handshake requests, while reporting no change in the typical number of concurrent connections. A 4x increase in TLS handshake volume for this service is enough to throw off capacity planning.

The reproducer shows that canceling the Context value that was used for a request that is now complete causes harm to the connection pool. It shows that when a request finds an empty pool and so triggers creation of a new connection, the link between the request that triggered the new connection outlives the request's need for that specific connection.

When another request to the same backend completes and makes its connection available in the Transport's pool, the request that triggered a new dial may use the now-idle connection instead. That allows the request to complete even before the connection it triggered is ready.

All recent Go versions include that link when executing a TCP handshake, leading to some TCP connections being closed as soon as they become ready thereby wasting the work of the DNS lookup and TCP handshake. New in Go 1.17 is an additional link to the TLS handshake, allowing that more-expensive work to be discarded. From the reproducer:

	// New in Go 1.17 via https://golang.org/issue/32406, Go's HTTP client
	// connection pool now also discards useful progress on outbound TLS
	// handshakes.
	//
	// At t=0ms, the first request triggers a new connection
	// At t=0ms, the TCP handshake for the first connection is complete and the TLS handshake begins
	// At t=200ms, the TLS handshake completes and the first HTTP request begins
	// At t=250ms, the first HTTP request completes and the first connection enters the idle pool
	// At t=400ms, the second request removes the first connection from the idle pool
	// At t=420ms, the third request finds an empty pool and dials a second connection
	// At t=420ms, the second connection TLS handshake begins
	// At t=450ms, the second HTTP request completes and hands its connection to the pool
	// At t=450ms, the third request intercepts the first connection before it enters the pool
	// At t=500ms, the third HTTP request completes and returns to the application code
	// At t=500ms, the application code has the full HTTP response, so cancels its Context
	// At t=500ms, Go 1.17's call to tls.Conn.HandshakeContext aborts
	// At t=620ms, Go 1.16's call to tls.Conn.Handshake completes, and goes into the idle pool
	// At t=800ms, the fourth request removes the first connection from the idle pool
	// At t=800ms, the fifth request uses an idle connection (Go 1.16) or dials fresh (Go 1.17+)
	t.Run("minimal pool with slow TLS", varyCancel(
		2,                    // ideal pool size
		0,                    // delay in TCP
		200*time.Millisecond, // delay in TLS
		50*time.Millisecond,  // delay in HTTP
		[]time.Duration{
			0,                      // t=0 ms     create connection 1
			400 * time.Millisecond, // t=400 ms   use warm connection 1 until t=450
			20 * time.Millisecond,  // t=420 ms   trigger new connection 2 (BUG: work may be discarded!)
			380 * time.Millisecond, // t=800 ms   observe pool size (use 1)
			0,                      // t=800 ms   observe pool size (use 2, or dial 3)
		}))

With Go 1.17+, the idle pool remains smaller than what the application needs in the steady state. More requests find an empty pool and need to wait for a connection to become available (generally winning in the race against a freshly-dialed connection). The reproducer shows an increase to the average request duration, though I haven't measured that in a production setting. (TLS handshakes are expensive, not slow.)

Reproducer results
$ go1.16 test -v
=== RUN   TestHTTPDialCancel
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TCP
    dial_test.go:148: NOTE: expect failure here for all recent Go versions
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel
    dial_test.go:132: average duration 102.287083ms
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel
    dial_test.go:124: TCP handshake count; 3 > 2
    dial_test.go:130: timeout=20m0s tcp=3 tls=1 http=5
    dial_test.go:132: average duration 113.971666ms
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TLS
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel
    dial_test.go:132: average duration 101.454508ms
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel
    dial_test.go:132: average duration 101.3856ms
=== RUN   TestHTTPDialCancel/large_pool_with_slow_TLS
=== RUN   TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel
    dial_test.go:132: average duration 75.285054ms
=== RUN   TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel
    dial_test.go:132: average duration 76.206017ms
--- FAIL: TestHTTPDialCancel (7.69s)
    --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP (1.78s)
        --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel (0.87s)
        --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel (0.92s)
    --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TLS (1.73s)
        --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel (0.87s)
        --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel (0.86s)
    --- PASS: TestHTTPDialCancel/large_pool_with_slow_TLS (4.17s)
        --- PASS: TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel (2.07s)
        --- PASS: TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel (2.10s)
FAIL
exit status 1
FAIL	code.justin.tv/rhys/nursery/issue/golang/go/50xxx	7.844s
$ go1.17 test -v
=== RUN   TestHTTPDialCancel
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TCP
    dial_test.go:148: NOTE: expect failure here for all recent Go versions
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel
    dial_test.go:132: average duration 103.162191ms
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel
    dial_test.go:124: TCP handshake count; 3 > 2
    dial_test.go:130: timeout=20m0s tcp=3 tls=1 http=5
    dial_test.go:132: average duration 114.508925ms
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TLS
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel
    dial_test.go:132: average duration 102.52155ms
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel
    dial_test.go:124: TCP handshake count; 3 > 2
    dial_test.go:127: TLS handshake count; 3 > 2
    dial_test.go:130: timeout=20m0s tcp=3 tls=3 http=5
    dial_test.go:132: average duration 113.051833ms
=== RUN   TestHTTPDialCancel/large_pool_with_slow_TLS
=== RUN   TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel
    dial_test.go:132: average duration 77.059902ms
=== RUN   TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel
    dial_test.go:124: TCP handshake count; 24 > 8
    dial_test.go:127: TLS handshake count; 24 > 8
    dial_test.go:130: timeout=20m0s tcp=24 tls=24 http=44
    dial_test.go:132: average duration 87.613746ms
--- FAIL: TestHTTPDialCancel (8.04s)
    --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP (1.81s)
        --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel (0.87s)
        --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel (0.93s)
    --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TLS (1.89s)
        --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel (0.87s)
        --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel (1.02s)
    --- FAIL: TestHTTPDialCancel/large_pool_with_slow_TLS (4.34s)
        --- PASS: TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel (2.09s)
        --- FAIL: TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel (2.25s)
FAIL
exit status 1
FAIL	code.justin.tv/rhys/nursery/issue/golang/go/50xxx	8.780s
$ go1.18 test -v
=== RUN   TestHTTPDialCancel
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TCP
    dial_test.go:148: NOTE: expect failure here for all recent Go versions
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel
    dial_test.go:132: average duration 105.759366ms
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel
    dial_test.go:124: TCP handshake count; 3 > 2
    dial_test.go:130: timeout=20m0s tcp=3 tls=1 http=5
    dial_test.go:132: average duration 113.322008ms
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TLS
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel
    dial_test.go:132: average duration 101.212549ms
=== RUN   TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel
    dial_test.go:124: TCP handshake count; 3 > 2
    dial_test.go:127: TLS handshake count; 3 > 2
    dial_test.go:130: timeout=20m0s tcp=3 tls=3 http=5
    dial_test.go:132: average duration 116.365683ms
=== RUN   TestHTTPDialCancel/large_pool_with_slow_TLS
=== RUN   TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel
    dial_test.go:132: average duration 76.463704ms
=== RUN   TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel
    dial_test.go:124: TCP handshake count; 24 > 8
    dial_test.go:127: TLS handshake count; 24 > 8
    dial_test.go:130: timeout=20m0s tcp=24 tls=24 http=44
    dial_test.go:132: average duration 90.389791ms
--- FAIL: TestHTTPDialCancel (8.01s)
    --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP (1.79s)
        --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TCP/no_timeout/cancel (0.87s)
        --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TCP/with_timeout/cancel (0.92s)
    --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TLS (1.88s)
        --- PASS: TestHTTPDialCancel/minimal_pool_with_slow_TLS/no_timeout/cancel (0.87s)
        --- FAIL: TestHTTPDialCancel/minimal_pool_with_slow_TLS/with_timeout/cancel (1.01s)
    --- FAIL: TestHTTPDialCancel/large_pool_with_slow_TLS (4.33s)
        --- PASS: TestHTTPDialCancel/large_pool_with_slow_TLS/no_timeout/cancel (2.08s)
        --- FAIL: TestHTTPDialCancel/large_pool_with_slow_TLS/with_timeout/cancel (2.25s)
FAIL
exit status 1
FAIL	code.justin.tv/rhys/nursery/issue/golang/go/50xxx	8.825s
Reproducer code, dial_test.go
// Copyright 2022 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

import (
	"context"
	"crypto/tls"
	"fmt"
	"io"
	"io/ioutil"
	"log"
	"net"
	"net/http"
	"net/http/httptest"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)

func TestHTTPDialCancel(t *testing.T) {
	withServer := func(tcpDelay, tlsDelay, httpDelay time.Duration, fn func(srv *httptest.Server)) (int, int, int) {
		var (
			tcpDialCount      int32
			tlsHandshakeCount int32
			httpHandleCount   int32
		)

		srv := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			atomic.AddInt32(&httpHandleCount, 1)
			time.Sleep(httpDelay)
		}))

		srv.EnableHTTP2 = false

		srv.Config.ErrorLog = log.New(ioutil.Discard, "", 0)

		// Pause for a moment during the handshake so we can see what happens when
		// we cancel the Context of a completed HTTP Request.
		srv.TLS = &tls.Config{}
		srv.TLS.GetConfigForClient = func(chi *tls.ClientHelloInfo) (*tls.Config, error) {
			atomic.AddInt32(&tlsHandshakeCount, 1)
			time.Sleep(tlsDelay)
			return nil, nil
		}

		srv.StartTLS()
		defer srv.Close()

		// Before making any requests, add a delay to the TCP Dialer so we can see
		// what happens when we cancel the Context of a completed HTTP Request.
		srv.Client().Transport.(*http.Transport).DialContext = func(ctx context.Context, network, addr string) (net.Conn, error) {
			atomic.AddInt32(&tcpDialCount, 1)
			time.Sleep(tcpDelay)
			return (&net.Dialer{}).DialContext(ctx, network, addr)
		}
		// Allow a large connection pool
		srv.Client().Transport.(*http.Transport).MaxIdleConnsPerHost = 100

		fn(srv)

		return int(atomic.LoadInt32(&tcpDialCount)), int(atomic.LoadInt32(&tlsHandshakeCount)), int(atomic.LoadInt32(&httpHandleCount))
	}

	doRequest := func(ctx context.Context, srv *httptest.Server, timeout time.Duration) error {
		if timeout > 0 {
			// BUG: canceling the context associated with an already-complete
			// HTTP request leads to an increase in TLS handshake count.
			sub, cancel := context.WithTimeout(ctx, timeout)
			defer cancel()
			ctx = sub
		}

		req, err := http.NewRequestWithContext(ctx, "GET", srv.URL, nil)
		if err != nil {
			return fmt.Errorf("NewRequestWithContext: %w", err)
		}

		resp, err := srv.Client().Do(req)
		if err != nil {
			return fmt.Errorf("Do Request: %w", err)
		}
		defer resp.Body.Close()

		_, err = io.Copy(ioutil.Discard, resp.Body)
		if err != nil {
			return fmt.Errorf("Discard Body: %w", err)
		}

		return nil
	}

	callWithDelays := func(t *testing.T, tcpDelay, tlsDelay, httpDelay time.Duration, delays []time.Duration, timeout time.Duration) (int, int, int, time.Duration) {
		var total time.Duration
		tcp, tls, http := withServer(tcpDelay, tlsDelay, httpDelay, func(srv *httptest.Server) {
			ctx := context.Background()
			var wg sync.WaitGroup
			for _, delay := range delays {
				time.Sleep(delay)
				wg.Add(1)
				go func() {
					defer wg.Done()
					start := time.Now()
					err := doRequest(ctx, srv, timeout)
					if err != nil {
						t.Errorf("HTTP request failed: %v", err)
					}
					atomic.AddInt64((*int64)(&total), int64(time.Now().Sub(start)))
				}()
			}
			wg.Wait()
		})
		return tcp, tls, http, total / time.Duration(len(delays))
	}

	varyCancel := func(poolSize int, tcpDelay, tlsDelay, httpDelay time.Duration, delays []time.Duration) func(t *testing.T) {
		return func(t *testing.T) {
			fn := func(timeout time.Duration) func(t *testing.T) {
				return func(t *testing.T) {
					tcp, tls, http, avg := callWithDelays(t, tcpDelay, tlsDelay, httpDelay, delays, timeout)
					if tcp > poolSize {
						t.Errorf("TCP handshake count; %d > %d", tcp, poolSize)
					}
					if tls > poolSize {
						t.Errorf("TLS handshake count; %d > %d", tls, poolSize)
					}
					if t.Failed() {
						t.Logf("timeout=%s tcp=%d tls=%d http=%d", timeout, tcp, tls, http)
					}
					t.Logf("average duration %s", avg)
				}
			}
			// No timeout, so no context.WithCancel / WithTimeout call
			t.Run("no timeout/cancel", fn(0))
			// Huge timeout, key change is the presence of "defer cancel()" once
			// the outbound request is complete
			t.Run("with timeout/cancel", fn(20*time.Minute))
		}
	}

	// Go's HTTP client connection pool has discarded useful progress on
	// outbound TCP handshakes for several releases.
	//
	// NOTE: this test doesn't work well on the Playground.
	t.Run("minimal pool with slow TCP", func(t *testing.T) {
		t.Logf("NOTE: expect failure here for all recent Go versions")
		varyCancel(
			2,                    // ideal pool size
			200*time.Millisecond, // delay in TCP
			0,                    // delay in TLS
			50*time.Millisecond,  // delay in HTTP
			[]time.Duration{
				0,                      // t=0 ms     create connection 1
				400 * time.Millisecond, // t=400 ms   use warm connection 1 until t=450
				20 * time.Millisecond,  // t=420 ms   trigger new connection 2 (BUG: work may be discarded!)
				380 * time.Millisecond, // t=800 ms   observe pool size (use 1)
				0,                      // t=800 ms   observe pool size (use 2, or dial 3)
			})(t)
	})

	// New in Go 1.17 via https://golang.org/issue/32406, Go's HTTP client
	// connection pool now also discards useful progress on outbound TLS
	// handshakes.
	//
	// At t=0ms, the first request triggers a new connection
	// At t=0ms, the TCP handshake for the first connection is complete and the TLS handshake begins
	// At t=200ms, the TLS handshake completes and the first HTTP request begins
	// At t=250ms, the first HTTP request completes and the first connection enters the idle pool
	// At t=400ms, the second request removes the first connection from the idle pool
	// At t=420ms, the third request finds an empty pool and dials a second connection
	// At t=420ms, the second connection TLS handshake begins
	// At t=450ms, the second HTTP request completes and hands its connection to the pool
	// At t=450ms, the third request intercepts the first connection before it enters the pool
	// At t=500ms, the third HTTP request completes and returns to the application code
	// At t=500ms, the application code has the full HTTP response, so cancels its Context
	// At t=500ms, Go 1.17's call to tls.Conn.HandshakeContext aborts
	// At t=620ms, Go 1.16's call to tls.Conn.Handshake completes, and goes into the idle pool
	// At t=800ms, the fourth request removes the first connection from the idle pool
	// At t=800ms, the fifth request uses an idle connection (Go 1.16) or dials fresh (Go 1.17+)
	t.Run("minimal pool with slow TLS", varyCancel(
		2,                    // ideal pool size
		0,                    // delay in TCP
		200*time.Millisecond, // delay in TLS
		50*time.Millisecond,  // delay in HTTP
		[]time.Duration{
			0,                      // t=0 ms     create connection 1
			400 * time.Millisecond, // t=400 ms   use warm connection 1 until t=450
			20 * time.Millisecond,  // t=420 ms   trigger new connection 2 (BUG: work may be discarded!)
			380 * time.Millisecond, // t=800 ms   observe pool size (use 1)
			0,                      // t=800 ms   observe pool size (use 2, or dial 3)
		}))

	// The impact of discarding useful progress on TLS handshakes is unbounded:
	// A client running Go 1.17 or newer, which creates a context for each
	// request which it cancels when the request is complete, may steadily churn
	// through new TLS connections. It can do this even when its maximum
	// outbound concurrency is below the MaxIdleConnsPerHost limit.
	t.Run("large pool with slow TLS", varyCancel(
		8,                    // ideal pool size
		0,                    // delay in TCP
		200*time.Millisecond, // delay in TLS
		50*time.Millisecond,  // delay in HTTP
		[]time.Duration{
			0,                      // t=0 ms     create connection 1
			0,                      // t=0 ms     create connection 2
			0,                      // t=0 ms     create connection 3
			0,                      // t=0 ms     create connection 4
			400 * time.Millisecond, // t=400 ms   use warm connection 1 until t=450
			0,                      // t=400 ms   use warm connection 2 until t=450
			0,                      // t=400 ms   use warm connection 3 until t=450
			0,                      // t=400 ms   use warm connection 4 until t=450
			20 * time.Millisecond,  // t=420 ms   trigger new connection 5 (BUG: work may be discarded!)
			0,                      // t=420 ms   trigger new connection 6 (BUG: work may be discarded!)
			0,                      // t=420 ms   trigger new connection 7 (BUG: work may be discarded!)
			0,                      // t=420 ms   trigger new connection 8 (BUG: work may be discarded!)
			380 * time.Millisecond, // t=800 ms   use warm connection 1 until t=850
			0,                      // t=800 ms   use warm connection 2 until t=850
			0,                      // t=800 ms   use warm connection 3 until t=850
			0,                      // t=800 ms   use warm connection 4 until t=850
			20 * time.Millisecond,  // t=820 ms   use warm connection 5, or trigger new connection 9 (BUG: work may be discarded!)
			0,                      // t=820 ms   use warm connection 6, or trigger new connection 10 (BUG: work may be discarded!)
			0,                      // t=820 ms   use warm connection 7, or trigger new connection 11 (BUG: work may be discarded!)
			0,                      // t=820 ms   use warm connection 8, or trigger new connection 12 (BUG: work may be discarded!)
			380 * time.Millisecond, // t=1200 ms  use warm connection 1 until t=1250
			0,                      // t=1200 ms  use warm connection 2 until t=1250
			0,                      // t=1200 ms  use warm connection 3 until t=1250
			0,                      // t=1200 ms  use warm connection 4 until t=1250
			20 * time.Millisecond,  // t=1220 ms  use warm connection 5, or trigger new connection 13 (BUG: work may be discarded!)
			0,                      // t=1220 ms  use warm connection 6, or trigger new connection 14 (BUG: work may be discarded!)
			0,                      // t=1220 ms  use warm connection 7, or trigger new connection 15 (BUG: work may be discarded!)
			0,                      // t=1220 ms  use warm connection 8, or trigger new connection 16 (BUG: work may be discarded!)
			380 * time.Millisecond, // t=1600 ms  use warm connection 1 until t=1650
			0,                      // t=1600 ms  use warm connection 2 until t=1650
			0,                      // t=1600 ms  use warm connection 3 until t=1650
			0,                      // t=1600 ms  use warm connection 4 until t=1650
			20 * time.Millisecond,  // t=1620 ms  use warm connection 5, or trigger new connection 17 (BUG: work may be discarded!)
			0,                      // t=1620 ms  use warm connection 6, or trigger new connection 18 (BUG: work may be discarded!)
			0,                      // t=1620 ms  use warm connection 7, or trigger new connection 19 (BUG: work may be discarded!)
			0,                      // t=1620 ms  use warm connection 8, or trigger new connection 20 (BUG: work may be discarded!)
			380 * time.Millisecond, // t=2020 ms  observe pool size (use 1)
			0,                      // t=2020 ms  observe pool size (use 2)
			0,                      // t=2020 ms  observe pool size (use 3)
			0,                      // t=2020 ms  observe pool size (use 4)
			0,                      // t=2020 ms  observe pool size (use 5, or dial 21)
			0,                      // t=2020 ms  observe pool size (use 6, or dial 22)
			0,                      // t=2020 ms  observe pool size (use 7, or dial 23)
			0,                      // t=2020 ms  observe pool size (use 8, or dial 24)
		}))
}
@komuw
Copy link
Contributor

komuw commented Feb 3, 2022

sounds related to #50657 somehow

@toothrot toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 4, 2022
@toothrot toothrot added this to the Backlog milestone Feb 4, 2022
@bcmills
Copy link
Contributor

bcmills commented Feb 22, 2022

CC @neild

@duanmengkk

This comment was marked as duplicate.

@bcmills

This comment was marked as off-topic.

@krk

This comment was marked as off-topic.

@mark4z

This comment was marked as duplicate.

@kassiansun
Copy link

kassiansun commented May 11, 2023

I think we can't use a dedicated context for the TLS handshake, (at least not with a simple code modification). If user doesn't use idle pool (DisableKeepAlive: true), this will make TLS handshake (within queueForDial) process un-cancellable.

Maybe we can check the connection pool for idle connection, if there're any idle connections, we can skip the dialing process, fasthttp is doing a similar process. The caveats is if the pool get occupied after the check, the call will have to wait a longer time to get an idle connection (instead of trying to dial the remote in parallel).

Another solution is adding a DialDelay option, user can choose whether they want to wait a bit longer to prioritize using the idle connection, instead of aggressively dialing the remote host. This might make the implementation a bit more efficient under some load scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

8 participants