Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: frequent failures in TestNoSniffExpectRequestBody_h2 since Oct. 28 #42256

Closed
bcmills opened this issue Oct 28, 2020 · 15 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Oct 28, 2020

--- FAIL: TestNoSniffExpectRequestBody_h2 (10.00s)
    clientserver_test.go:1316: unexpected Read call
2020/10/28 09:19:06 http: TLS handshake error from 127.0.0.1:33072: read tcp 127.0.0.1:33050->127.0.0.1:33072: use of closed network connection
FAIL
FAIL	net/http	24.279s

2020-10-28T14:25:56-b85c2dd/aix-ppc64
2020-10-28T13:46:11-a0a4439/solaris-amd64-oraclerel
2020-10-28T13:25:44-72dec90/aix-ppc64
2020-10-28T04:20:39-02335cf/aix-ppc64
2020-10-28T01:22:47-2414e1f/aix-ppc64
2020-10-28T01:18:38-7be9158/solaris-amd64-oraclerel

Marking as release-blocker because this appears to be a regression in 1.16.

The failures so far are on somewhat esoteric builders, but those are also some of our slowest builders — this could be a timing bug that just happens to reproduce more easily on them.

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Oct 28, 2020
@bcmills bcmills added this to the Go1.16 milestone Oct 28, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Oct 28, 2020

CC @neild @odeke-em @ianlancetaylor

@bcmills bcmills changed the title net/http: frequent failures in TestNoSnippExpectRequestBody_h2 since Oct. 28 net/http: frequent failures in TestNoSniffExpectRequestBody_h2 since Oct. 28 Oct 28, 2020
@danp
Copy link
Contributor

danp commented Nov 2, 2020

I spent some time digging into this, somewhat out of curiosity but also to be sure it wasn't related to the fix for #42156 since the timing was suspect. It doesn't seem to be related to #42156 since http.Server.Shutdown is not involved.

It would be nice to simulate go test -count 100 -run TestNoSniffExpectRequestBody_h2 or so on one of these builders. It's my understanding gomote access is not given out much. If that is the case, would it be appropriate to debug by opening a DO NOT REVIEW/SUBMIT CL and have TryBot target one or more of these builders using TRY?

@neild
Copy link
Contributor

neild commented Nov 2, 2020

I'm pretty certain this has been fixed, but I haven't managed to bisect down to the exact point of the fix yet. It reproduces on these builders at the previously-failing commits with -count 100, but not at head.

@danp
Copy link
Contributor

danp commented Nov 2, 2020

The timer-related changes in 8fdc79e and b4b0144 before the first failure (7be9158) and in 89a6540 after the last observed failure (676ad56) seemed interesting but I wasn't able to induce any failures locally in the questionable commit range to dig deeper.

@neild
Copy link
Contributor

neild commented Nov 2, 2020

Bisection results point at this being fixed by e02ab89:

https://go.googlesource.com/go/+/e02ab89eb8994fa6f2dfa2924cdadb097633fcc1 (passes)
https://go.googlesource.com/go/+/a313eec3869c609c0da2402a4cac2d32113d599c (fails)

The test is flaky. The failures take ~10s to run, which points at the ExpectContinueTimeout in the test expiring for some reason. I'm not sure at the moment why the test is failing or why that CL seems to fix it.

\cc @ianlancetaylor

@ianlancetaylor
Copy link
Contributor

https://golang.org/cl/266304 shouldn't affect any correct code. It's a minor optimization which will cause the runtime poller to be woken up less often to run timers. It will only affect programs that set an I/O deadline or a time.Timer, and then change the deadline or the Timer to run earlier than when it was created. Programs that do that will see slightly different, hopefully more efficient, scheduling behavior.

It's really puzzling that it fails before CL 266304 and passes after it. If anything I would expect CL 266304 to wake up fewer goroutines. So why would waking up more goroutines cause a timeout to expire?

I do not know what is happening here.

@danp
Copy link
Contributor

danp commented Nov 3, 2020

It will only affect programs that set an I/O deadline or a time.Timer, and then change the deadline or the Timer to run earlier than when it was created.

It's worth nothing that is what the h2 bits do in the Expect: 100-continue case, using AfterFunc with a "huge duration" here:

go/src/net/http/h2_bundle.go

Lines 9105 to 9108 in da7aa86

const hugeDuration = 365 * 24 * time.Hour
s.timer = time.AfterFunc(hugeDuration, func() {
s.fnonce.Do(s.fn)
})

And then using Reset to have the timer fire earlier:

s.timer.Reset(s.delay)

@ianlancetaylor
Copy link
Contributor

Well, at least that explains why the CL has some effect, though I still can't explain why it has the effect that it does.

@danp
Copy link
Contributor

danp commented Nov 3, 2020

Actually, something else I wondered about when digging into this is the use of Stop and Reset together. Expanding the code around the Reset:

go/src/net/http/h2_bundle.go

Lines 9140 to 9142 in da7aa86

if s.timer.Stop() {
s.timer.Reset(s.delay)
}

if s.timer.Stop() {
	s.timer.Reset(s.delay)
}

Since this timer was created with AfterFunc, is it questionable to use Stop and Reset in this way?

@ianlancetaylor
Copy link
Contributor

That usage is fine. The Stop method will return true if the function has not yet run. After calling Stop, it's fine to call Reset to set a new time.

@dmitshur
Copy link
Contributor

This test stops passing completely when the change in CL 253259 is bundled into net/http.
I filed #42498 about that, it may or may not be related to this issue.

@fraenkel
Copy link
Contributor

I can explain how the error occurs, but it doesn't explain why.
The test has a client POST some data with a 10s delay on sending the body via an Expect: 100-continue. The server sends back a 403 (Unauthorized) once it receives the headers. The client should receive the 403 and cancel the timer for the body write.

The body is being written after 10s but the timer should have been cancelled.

@dmitshur dmitshur added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Nov 19, 2020
@fraenkel
Copy link
Contributor

https://go-review.googlesource.com/c/net/+/269058/ should fix this as well

@toothrot toothrot removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 17, 2020
@toothrot
Copy link
Contributor

toothrot commented Jan 7, 2021

@bcmills Is this still an issue now that https://golang.org/cl/269058 is merged?

@ianlancetaylor
Copy link
Contributor

I don't see any occurrences of this test failure since October 30, 2020, so closing.

@golang golang locked and limited conversation to collaborators Jan 7, 2022
@rsc rsc unassigned neild Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

8 participants