net/http: slight scheduling changes makes TestServerWriteTimeout/h2 flaky (primarily wasm) #65410
Labels
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Testing
An issue that has been verified to require only test changes, not just a test failure.
Milestone
@bcmills and I have been working on addressing some scheduling concerns with net on wasm (#65178, #64317), notably via https://go.dev/cl/557437 and https://go.dev/cl/558915.
Each of these CLs works fine in isolation, but when applied together,
TestServerWriteTimeout/h2
started timing out: https://ci.chromium.org/ui/p/golang/builders/ci/gotip-wasip1-wasm_wasmtime/b8757522083839877441/overview (possibly others as well, masked behind the timeout).This failure can be reproduced at tip by cherry-picking https://go.dev/cl/557437 and running
GOARCH=wasm GOOS=wasip1 go test -v -run TestServerWriteTimeout/h2
.I dug into this failure a bit and believe it is due to bad retries within net/http and/or bad expectations of the test itself. My analysis below. The instrumentation I added can be found in https://go.dev/cl/559797.
First of all, without these CLs, this test passes on both linux-amd64 and on wasm, but the latter is more exciting:
linux-amd64:
wasipi1-wasm:
The linux-amd64 version simply passed, but the wasm version encountered 2 timeouts before passing. This test contains a top-level timeout that increases on retry. The initial 5ms timeout was apparently insufficient and it had to increase twice before the test succeeded.
If the fact that the wasm version of this test consistently has a few timeouts here before passing is fundamentally problematic, you can stop reading here and the focus should be to understand and fix that.
Assuming otherwise, here's what happens with both CLs:
We're getting seemingly similar timeouts, but now they are occurring over and over again. Based on logging from my instrumentation CL https://go.dev/cl/559797, what is happening here is:
In the passing case, the timeout error ends up coming out of
Transport.getConn
toTransport.roundTrip
here. There is no retry logic here, so the error ultimately bubbles all the way up to the test, which increases the timeout and retries.In the failing case, the timeout errors ends up coming out of the
pconn
RoundTrip call just a few lines later. This error has a conditional internal retry mechanism (L615-643), which decides that this error warrants a retry and internally loops and retries withinroundTrip
. Because it never went back to the test, the timeout is the same, and the same problem occurs over and over again.It is unclear to me what the "correct" behavior is. Perhaps timeouts should not be retryable? Or perhaps it is unreasonable for the test to expect a request to complete with the same initial timeout? Either way, it seems strange to me that the same timeout error from
getConn
vsRoundTrip
has different behavior, given that it seems to be just luck/timing which call gets the error.cc @neild @bcmills
The text was updated successfully, but these errors were encountered: