net/http: slight scheduling changes makes TestServerWriteTimeout/h2 flaky (primarily wasm) #65410

prattmic · 2024-01-31T17:18:41Z

@bcmills and I have been working on addressing some scheduling concerns with net on wasm (#65178, #64317), notably via https://go.dev/cl/557437 and https://go.dev/cl/558915.

Each of these CLs works fine in isolation, but when applied together, TestServerWriteTimeout/h2 started timing out: https://ci.chromium.org/ui/p/golang/builders/ci/gotip-wasip1-wasm_wasmtime/b8757522083839877441/overview (possibly others as well, masked behind the timeout).

This failure can be reproduced at tip by cherry-picking https://go.dev/cl/557437 and running GOARCH=wasm GOOS=wasip1 go test -v -run TestServerWriteTimeout/h2.

I dug into this failure a bit and believe it is due to bad retries within net/http and/or bad expectations of the test itself. My analysis below. The instrumentation I added can be found in https://go.dev/cl/559797.

First of all, without these CLs, this test passes on both linux-amd64 and on wasm, but the latter is more exciting:

linux-amd64:

=== RUN   TestServerWriteTimeout
=== RUN   TestServerWriteTimeout/h2
--- PASS: TestServerWriteTimeout (0.01s)
    --- PASS: TestServerWriteTimeout/h2 (0.01s)
PASS
ok      net/http        0.041s

wasipi1-wasm:

=== RUN   TestServerWriteTimeout
=== RUN   TestServerWriteTimeout/h2
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:2: read tcp 127.0.0.1:1->127.0.0.1:2: i/o timeout
    serve_test.go:783: Get error, retrying: Get "https://127.0.0.1:1": write tcp 127.0.0.1:2->127.0.0.1:1: send: Connection reset by peer
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:4: read tcp 127.0.0.1:3->127.0.0.1:4: i/o timeout
    serve_test.go:783: Get error, retrying: Get "https://127.0.0.1:3": write tcp 127.0.0.1:4->127.0.0.1:3: send: Connection reset by peer
--- PASS: TestServerWriteTimeout (0.10s)
    --- PASS: TestServerWriteTimeout/h2 (0.10s)
PASS
ok      net/http        0.318s

The linux-amd64 version simply passed, but the wasm version encountered 2 timeouts before passing. This test contains a top-level timeout that increases on retry. The initial 5ms timeout was apparently insufficient and it had to increase twice before the test succeeded.

If the fact that the wasm version of this test consistently has a few timeouts here before passing is fundamentally problematic, you can stop reading here and the focus should be to understand and fix that.

Assuming otherwise, here's what happens with both CLs:

=== RUN   TestServerWriteTimeout
=== RUN   TestServerWriteTimeout/h2
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:2: read tcp 127.0.0.1:1->127.0.0.1:2: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:3: read tcp 127.0.0.1:1->127.0.0.1:3: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:4: read tcp 127.0.0.1:1->127.0.0.1:4: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:5: read tcp 127.0.0.1:1->127.0.0.1:5: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:6: read tcp 127.0.0.1:1->127.0.0.1:6: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:7: read tcp 127.0.0.1:1->127.0.0.1:7: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:8: read tcp 127.0.0.1:1->127.0.0.1:8: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:9: read tcp 127.0.0.1:1->127.0.0.1:9: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:10: read tcp 127.0.0.1:1->127.0.0.1:10: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:11: read tcp 127.0.0.1:1->127.0.0.1:11: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:12: read tcp 127.0.0.1:1->127.0.0.1:12: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:13: read tcp 127.0.0.1:1->127.0.0.1:13: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:14: read tcp 127.0.0.1:1->127.0.0.1:14: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:15: read tcp 127.0.0.1:1->127.0.0.1:15: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:16: read tcp 127.0.0.1:1->127.0.0.1:16: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:17: read tcp 127.0.0.1:1->127.0.0.1:17: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:18: read tcp 127.0.0.1:1->127.0.0.1:18: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:19: read tcp 127.0.0.1:1->127.0.0.1:19: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:20: read tcp 127.0.0.1:1->127.0.0.1:20: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:21: read tcp 127.0.0.1:1->127.0.0.1:21: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:22: read tcp 127.0.0.1:1->127.0.0.1:22: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:23: read tcp 127.0.0.1:1->127.0.0.1:23: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:24: read tcp 127.0.0.1:1->127.0.0.1:24: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:25: read tcp 127.0.0.1:1->127.0.0.1:25: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:26: read tcp 127.0.0.1:1->127.0.0.1:26: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:27: read tcp 127.0.0.1:1->127.0.0.1:27: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:28: read tcp 127.0.0.1:1->127.0.0.1:28: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:29: read tcp 127.0.0.1:1->127.0.0.1:29: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:30: read tcp 127.0.0.1:1->127.0.0.1:30: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:31: read tcp 127.0.0.1:1->127.0.0.1:31: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:32: read tcp 127.0.0.1:1->127.0.0.1:32: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:33: read tcp 127.0.0.1:1->127.0.0.1:33: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:34: read tcp 127.0.0.1:1->127.0.0.1:34: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:35: read tcp 127.0.0.1:1->127.0.0.1:35: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:36: read tcp 127.0.0.1:1->127.0.0.1:36: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:37: read tcp 127.0.0.1:1->127.0.0.1:37: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:38: read tcp 127.0.0.1:1->127.0.0.1:38: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:39: read tcp 127.0.0.1:1->127.0.0.1:39: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:40: write tcp 127.0.0.1:1->127.0.0.1:40: i/o timeout
    serve_test.go:783: Get error, retrying: Get "https://127.0.0.1:1": EOF
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:42: read tcp 127.0.0.1:41->127.0.0.1:42: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:43: read tcp 127.0.0.1:41->127.0.0.1:43: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:44: read tcp 127.0.0.1:41->127.0.0.1:44: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:45: read tcp 127.0.0.1:41->127.0.0.1:45: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:46: read tcp 127.0.0.1:41->127.0.0.1:46: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:47: read tcp 127.0.0.1:41->127.0.0.1:47: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:48: read tcp 127.0.0.1:41->127.0.0.1:48: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:49: read tcp 127.0.0.1:41->127.0.0.1:49: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:50: read tcp 127.0.0.1:41->127.0.0.1:50: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:51: read tcp 127.0.0.1:41->127.0.0.1:51: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:52: read tcp 127.0.0.1:41->127.0.0.1:52: i/o timeout
    clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:53: read tcp 127.0.0.1:41->127.0.0.1:53: i/o timeout
... continues

We're getting seemingly similar timeouts, but now they are occurring over and over again. Based on logging from my instrumentation CL https://go.dev/cl/559797, what is happening here is:

In the passing case, the timeout error ends up coming out of Transport.getConn to Transport.roundTrip here. There is no retry logic here, so the error ultimately bubbles all the way up to the test, which increases the timeout and retries.
In the failing case, the timeout errors ends up coming out of the pconn RoundTrip call just a few lines later. This error has a conditional internal retry mechanism (L615-643), which decides that this error warrants a retry and internally loops and retries within roundTrip. Because it never went back to the test, the timeout is the same, and the same problem occurs over and over again.

It is unclear to me what the "correct" behavior is. Perhaps timeouts should not be retryable? Or perhaps it is unreasonable for the test to expect a request to complete with the same initial timeout? Either way, it seems strange to me that the same timeout error from getConn vs RoundTrip has different behavior, given that it seems to be just luck/timing which call gets the error.

cc @neild @bcmills

The text was updated successfully, but these errors were encountered:

prattmic · 2024-01-31T17:19:53Z

Oh, I forgot to mention, @bcmills managed to reproduce this repeated internal timeout in the test on linux-amd64 by tuning just the right timeouts and adding some strategic sleeps.

bcmills · 2024-01-31T17:27:40Z

I've added a comment with a bit more detail about this race in the test, in https://go-review.git.corp.google.com/c/go/+/559835/4/src/net/http/serve_test.go.

gopherbot · 2024-01-31T17:28:45Z

Change https://go.dev/cl/559835 mentions this issue: net/http: set a client timeout in TestServerWriteTimeout

bcmills · 2024-01-31T18:06:12Z

TestServerReadTimeout and TestServerCancelsReadTimeoutWhenIdle seem to fall into the same retry loop too. Not sure if there are others.

neild · 2024-01-31T18:37:59Z

I believe what's happening is:

Transport.roundTrip successfully makes a TLS connection.
The TLS connection negotiates the HTTP/2 protocol, so it's passed to the HTTP/2 layer.
The HTTP/2 transport adds the conn to its pool and provides a RoundTripper to net/http.
The HTTP/2 layer immediately reads an error from the connection and discards it.
Transport.roundTrip tries to use the HTTP/2 RoundTripper, which returns a "no cached connections available" error.
We go back to the start and do it all again.

We've got a mechanism that to avoid an infinite retry loop when HTTP/1 connections constantly fail (if we make a request using a new connection, and that request fails, then don't retry). We don't have one for HTTP/2, and this is complicated by the fact that Transport.roundTrip doesn't have any way to identify which HTTP/2 connection is being used for a request; HTTP/2 connections get their own pool, which is invisible to the top-level RoundTrip.

The right long-term fix is to unify the connection pools and simplify connection management, but that's a big project. (Which I'm in the middle of, but it's slow going.) There's probably a lower impact fix we can do here, though. Perhaps we can apply the "don't retry if this connection has never had a successful round trip" logic to HTTP/2 connections as well?

neild · 2024-01-31T18:41:50Z

https://go.dev/issue/16582 is going to be relevant here as a case where a HTTP/2 RoundTripper returns a no-cached-connections-error, but we actually do want to retry.

prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 31, 2024

bcmills added this to the Backlog milestone Jan 31, 2024

bcmills added the Testing An issue that has been verified to require only test changes, not just a test failure. label Jan 31, 2024

gopherbot closed this as completed in fc0d9a4 Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net/http: slight scheduling changes makes TestServerWriteTimeout/h2 flaky (primarily wasm) #65410

net/http: slight scheduling changes makes TestServerWriteTimeout/h2 flaky (primarily wasm) #65410

prattmic commented Jan 31, 2024

prattmic commented Jan 31, 2024

bcmills commented Jan 31, 2024

gopherbot commented Jan 31, 2024

bcmills commented Jan 31, 2024

neild commented Jan 31, 2024

neild commented Jan 31, 2024

net/http: slight scheduling changes makes TestServerWriteTimeout/h2 flaky (primarily wasm) #65410

net/http: slight scheduling changes makes TestServerWriteTimeout/h2 flaky (primarily wasm) #65410

Comments

prattmic commented Jan 31, 2024

prattmic commented Jan 31, 2024

bcmills commented Jan 31, 2024

gopherbot commented Jan 31, 2024

bcmills commented Jan 31, 2024

neild commented Jan 31, 2024

neild commented Jan 31, 2024