Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: TestTransportMaxConnsPerHost still flaky #31982

Closed
cuonglm opened this issue May 11, 2019 · 16 comments
Closed

net/http: TestTransportMaxConnsPerHost still flaky #31982

cuonglm opened this issue May 11, 2019 · 16 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Soon This needs to be done soon. (regressions, serious bugs, outages) Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@cuonglm
Copy link
Member

cuonglm commented May 11, 2019

What version of Go are you using (go version)?

$ go version
go version devel +83f205fa88 Sun May 12 02:55:19 2019 +0000 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/cuonglm/.cache/go-build"
GOENV="/home/cuonglm/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/cuonglm/go"
GOPROXY="direct"
GOROOT="/home/cuonglm/sources/go"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/home/cuonglm/sources/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build766721718=/tmp/go-build -gno-record-gcc-switches"

What did you do?

From git root repo
$ cd src/net/http
$ go test -run TestTransportMaxConnsPerHost -count=100 -race

What did you expect to see?

Test passed

What did you see instead?

--- FAIL: TestTransportMaxConnsPerHost (0.07s)
    transport_test.go:661: Too many get connections (http2): 2
    transport_test.go:676: Too many get connections (http2): 3
--- FAIL: TestTransportMaxConnsPerHost (0.03s)
    transport_test.go:661: Too many get connections (http2): 2
    transport_test.go:676: Too many get connections (http2): 3
--- FAIL: TestTransportMaxConnsPerHost (0.08s)
    transport_test.go:661: Too many get connections (http2): 2
    transport_test.go:676: Too many get connections (http2): 3
--- FAIL: TestTransportMaxConnsPerHost (0.06s)
    transport_test.go:661: Too many get connections (http2): 2
    transport_test.go:676: Too many get connections (http2): 3
--- FAIL: TestTransportMaxConnsPerHost (0.03s)
    transport_test.go:661: Too many get connections (http2): 2
    transport_test.go:676: Too many get connections (http2): 3
FAIL
exit status 1
FAIL	net/http	4.433s
@fraenkel
Copy link
Contributor

This is a bit interesting because we get the right number of dials and TLS handshakes but the connection returned is supposed not marked reused.

@fraenkel
Copy link
Contributor

The problem is that the GotConn for http2 (h2_bundle:6878) is called before we know whether we are creating a new stream (h2_bundle: 7449) which is what determines reuse.

This is not a flaky test but a real bug.

@fraenkel
Copy link
Contributor

Using the nextStreamID is flawed anyhow. If you allow HTTP, we start at 3 which means its immediately reused.

@fraenkel
Copy link
Contributor

If we move the GotConn call into getClientConn(h2_bundle:749), we can distinguish between a new connection and a used connection.

@cuonglm cuonglm changed the title net/http: TestTransportMaxConnsPerHost still flakes net/http: TestTransportMaxConnsPerHost still flaky May 12, 2019
@anakin
Copy link

anakin commented May 12, 2019

➜ http go test -run TestTransportMaxConnsPerHost -count=100 -race
PASS
ok net/http 1.641s

@anakin
Copy link

anakin commented May 12, 2019

by the way,
➜ http go version
go version go1.12.5 darwin/amd64

@gopherbot
Copy link

Change https://golang.org/cl/176720 mentions this issue: http2: track reused connections

@fraenkel
Copy link
Contributor

I submitted a fix which does highlight the issue with http2. I did some other testing and found some of the other ClientTrace calls like TLSHandshakreXXX are missing.

@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 13, 2019
@andybons andybons added this to the Unplanned milestone May 13, 2019
@bcmills
Copy link
Contributor

bcmills commented May 14, 2019

The darwin-amd64-race builder caught this too:
https://build.golang.org/log/06b60fdfdf4a88e04ae4181d80ddd39db9aaafa0

@bcmills
Copy link
Contributor

bcmills commented May 14, 2019

I just got another TryBot flake on this:
https://storage.googleapis.com/go-build-log/06afa4b7/freebsd-amd64-12_0_63953b9e.log

@bcmills bcmills modified the milestones: Unplanned, Go1.12.6, Go1.13 May 14, 2019
@bcmills bcmills added Soon This needs to be done soon. (regressions, serious bugs, outages) Testing An issue that has been verified to require only test changes, not just a test failure. labels May 14, 2019
@bcmills
Copy link
Contributor

bcmills commented May 14, 2019

CC @bradfitz

gopherbot pushed a commit to golang/net that referenced this issue May 14, 2019
nextStreamID was used as a means to determine if the connection was
being reused. Multiple requests can see a new connection because the
nextStreamID is updated after a ClientTrace reports it is being reused.

Updates golang/go#31982

Change-Id: Iaa4b62b217f015423cddb99fd86de75a352f8320
Reviewed-on: https://go-review.googlesource.com/c/net/+/176720
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/177037 mentions this issue: net/http: update vendored, bundled x/net/http2

@bradfitz
Copy link
Contributor

Whoops, I'd forgotten to submit 177037. Done. Hopefully it's better now.

gopherbot pushed a commit that referenced this issue May 16, 2019
For:

    http2: track reused connections
    https://golang.org/cl/176720 (updates #31982)

Some x/sys/unix updates come along for the ride too.

I filed #32031 for making the bundling process less difficult and
error-prone in the future.

Change-Id: Ic822080991ffa2d50352c5f613e45648a327cf16
Reviewed-on: https://go-review.googlesource.com/c/go/+/177037
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@bcmills
Copy link
Contributor

bcmills commented Jun 14, 2019

In the other direction, it appears that TestTransportMaxConnsPerHost occasionally deadlocks: see #32336.

@rsc
Copy link
Contributor

rsc commented Jun 28, 2019

Closing as duplicate of #32336 now that we understand how severe that one is.

@rsc rsc closed this as completed Jun 28, 2019
@gopherbot
Copy link

Change https://golang.org/cl/184262 mentions this issue: net/http: fix Transport.MaxConnsPerHost limits & idle pool races

gopherbot pushed a commit that referenced this issue Jul 8, 2019
There were at least three races in the implementation of the pool of
idle HTTP connections before this CL.

The first race is that HTTP/2 connections can be shared for many
requests, but each requesting goroutine would take the connection out
of the pool and then immediately return it before using it; this
created unnecessary, tiny little race windows during which another
goroutine might dial a second connection instead of reusing the first.
This CL changes the idle pool to just leave the HTTP/2 connection in
the pool permanently (until there is reason to close it), instead of
doing the take-it-out-put-it-back dance race.

The second race is that “is there an idle connection?” and
“register to wait for an idle connection” were implemented as two
separate steps, in different critical sections. So a client could end
up registered to wait for an idle connection and be waiting or perhaps
dialing, not having noticed the idle connection sitting in the pool
that arrived between the two steps.

The third race is that t.getIdleConnCh assumes that the inability to
send on the channel means the client doesn't need the result, when it
could mean that the client has not yet entered the select.
That is, the main dial does:

	idleConnCh := t.getIdleConnCh(cm)
	select {
	case v := <-dialc:
		...
	case pc := <-idleConnCh
		...
	...
	}

But then tryPutIdleConn does:

	waitingDialer := t.idleConnCh[key] // what getIdleConnCh(cm) returned
	select {
	case waitingDialer <- pconn:
		// We're done ...
		return nil
	default:
		if waitingDialer != nil {
			// They had populated this, but their dial won
			// first, so we can clean up this map entry.
			delete(t.idleConnCh, key)
		}
	}

If the client has returned from getIdleConnCh but not yet reached the
select, tryPutIdleConn will be unable to do the send, incorrectly
conclude that the client does not care anymore, and put the connection
in the idle pool instead, again leaving the client dialing unnecessarily
while a connection sits in the idle pool.

(It's also odd that the success case does not clean up the map entry,
and also that the map has room for only a single waiting goroutine for
a given host.)

None of these races mattered too much before Go 1.11: at most they
meant that connections were not reused quite as promptly as possible,
or a few more than necessary would be created. But Go 1.11 added
Transport.MaxConnsPerHost, which limited the number of connections
created for a given host. The default is 0 (unlimited), but if a user
did explicitly impose a low limit (2 is common), all these misplaced
conns could easily add up to the entire limit, causing a deadlock.
This was causing intermittent timeouts in TestTransportMaxConnsPerHost.

The addition of the MaxConnsPerHost support added its own races.

For example, here t.incHostConnCount could increment the count
and return a channel ready for receiving, and then the client would
not receive from it nor ever issue the decrement, because the select
need not evaluate these two cases in order:

	select {
	case <-t.incHostConnCount(cmKey):
		// count below conn per host limit; proceed
	case pc := <-t.getIdleConnCh(cm):
		if trace != nil && trace.GotConn != nil {
			trace.GotConn(httptrace.GotConnInfo{Conn: pc.conn, Reused: pc.isReused()})
		}
		return pc, nil
	...
	}

Obviously, unmatched increments are another way to get to a deadlock.
TestTransportMaxConnsPerHost deadlocked approximately 100% of
the time with a small random sleep added between incHostConnCount
and the select:

	ch := t.incHostConnCount(cmKey):
	time.Sleep(time.Duration(rand.Intn(10))*time.Millisecond)
	select {
	case <-ch
		// count below conn per host limit; proceed
	case pc := <-t.getIdleConnCh(cm):
		...
	}

The limit also did not properly apply to HTTP/2, because of the
decrement being attached to the underlying net.Conn.Close
and net/http not having access to the underlying HTTP/2 conn.
The alternate decrements for HTTP/2 may also have introduced
spurious decrements (discussion in #29889). Perhaps those
spurious decrements or other races caused the other intermittent
non-deadlock failures in TestTransportMaxConnsPerHost,
in which the HTTP/2 phase created too many connections (#31982).

This CL replaces the buggy, racy code with new code that is hopefully
neither buggy nor racy.

Fixes #29889.
Fixes #31982.
Fixes #32336.

Change-Id: I0dfac3a6fe8a6cdf5f0853722781fe2ec071ac97
Reviewed-on: https://go-review.googlesource.com/c/go/+/184262
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
@golang golang locked and limited conversation to collaborators Jun 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Soon This needs to be done soon. (regressions, serious bugs, outages) Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

8 participants