Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: TestInterruptWithPanic_h2 flakes on ppc64le builder #17243

Closed
mwhudson opened this issue Sep 26, 2016 · 18 comments
Closed

net/http: TestInterruptWithPanic_h2 flakes on ppc64le builder #17243

mwhudson opened this issue Sep 26, 2016 · 18 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mwhudson
Copy link
Contributor

The failure looks like this:

--- FAIL: TestInterruptWithPanic_h2 (0.01s)
    clientserver_test.go:1158: Get https://127.0.0.1:51367: stream error: stream ID 1; INTERNAL_ERROR
FAIL
FAIL    net/http    13.570s

It's happened three times in the last three screens of build.golang.org:

https://build.golang.org/log/969151e16737eda8c8720b1c0c3cb6f0ddbe5025
https://build.golang.org/log/ae7a7294c3d83cc5584aabeec8bb4fe55b9e4832
https://build.golang.org/log/512a378cc11774b203aa0962c9818326eda0cd76

but I have the vague feeling it's been going on for a while.

@quentinmit
Copy link
Contributor

@aclements Can you run your flake finder tool on this?

@quentinmit quentinmit added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 3, 2016
@quentinmit quentinmit added this to the Go1.8Maybe milestone Oct 3, 2016
@aclements
Copy link
Member

Looks like it's been flaking on more than just PPC:

$ greplogs -dashboard -e TestInterruptWithPanic_h2 -e INTERNAL_ERROR -l -md
2016-04-19T22:04:30-8b20fd0/freebsd-arm-paulzhol
2016-05-03T00:49:46-499cd33/freebsd-arm-paulzhol
2016-05-06T16:43:51-c81a353/darwin-386-10_10
2016-05-10T03:55:12-3d82432/linux-amd64-race
2016-05-18T14:18:48-b30fcbc/linux-ppc64le-buildlet
2016-06-24T02:21:02-f9d6b90/freebsd-arm-paulzhol
2016-08-09T00:43:25-392bf3a/darwin-amd64-10_9
2016-08-17T23:23:28-6937167/darwin-amd64-10_9
2016-08-23T14:53:47-03723c9/darwin-amd64-10_8
2016-09-14T16:39:47-2e67514/android-arm64-sdk21
2016-09-17T00:34:19-a1bf203/linux-ppc64le-buildlet
2016-09-25T02:38:11-890c09e/linux-ppc64le-buildlet
2016-09-26T22:01:32-d211c2d/linux-ppc64le-buildlet

Flake analysis isn't very helpful with this one, I'm afraid:

$ greplogs -dashboard -e TestInterruptWithPanic_h2 -e INTERNAL_ERROR -l | findflakes -paths
First observed 8b20fd0 19 Apr 22:04 2016 (1668 commits ago)
Last observed  d211c2d 26 Sep 22:01 2016 (88 commits ago)
51% chance failure is still happening
0.76% failure probability (13 of 1581 commits)
Likely culprits:
    1% 8b20fd0 cmd/compile: transform some Phis into Or8.
    1% 082f464 go/types: trailing semis are ok after valid fallthrough
    1% a4dd6ea runtime: add maxSliceCap
    1% 55ab07c cmd/compile: static composite literals are side-effect free
    1% 3c6e60c cmd/compile: fix isStaticCompositeLiteral
    1% 562d398 go/types: accept trailing empty stmts in terminating stmt lists
    1% 9b6bf20 crypto/aes: de-couple asm and go implementations
    1% 03e216f cmd/compile: re-enable in-place append optimization
    1% b5ddbb9 spec: refine rules about terminating statements
    1% a39950b crypto/aes: delete TestEncryptBlock and TestDecryptBlock
No known past failures

@bradfitz
Copy link
Contributor

I can't reproduce this.

I deflaked this test in 238247e but that commit message is misleading. It was only deflaking additions added in https://golang.org/cl/33099 and https://golang.org/cl/33103.

It looks like the original flake code remains unchanged.

In particular, the test handler does this:

        cst := newClientServerTest(t, h2, HandlerFunc(func(w ResponseWriter, r *Request) {
                io.WriteString(w, msg)
                w.(Flusher).Flush()
                panic(panicValue)
        }), func(ts *httptest.Server) {

That means the Handler's http2 HEADERS frame should be written on the wire before the RST_STREAM of error INTERNAL_ERROR.

@tombergan, did you fix this by accident when you redid the http2 write scheduler?

Or, can you imagine why we'd schedule a RST_STREAM before a HEADERS?

@tombergan
Copy link
Contributor

Here's one hypothesis. sc.resetStream constructs a FrameWriteRequest that doesn't have a stream field:
https://github.com/golang/net/blob/master/http2/server.go#L1081

Both the old and new write schedulers use the stream field to determine which queue gets the FrameWriteRequest:
https://github.com/golang/net/blob/c33d37840d93fc75251f49ab83c5fc8b2faea5e8/http2/writesched.go#L84
https://github.com/golang/net/blob/master/http2/writesched.go#L64

This means RST_STREAM frames will be added to the root queue (for control frames), which is processed before the stream queues. We could have this sequence of events:

  1. queue HEADERS (on stream queue)
  2. queue RST_STREAM (on root queue -- this is buggy)
  3. schedule RST_STREAM (since root has priority)
  4. schedule HEADERS

The test will pass if step 3 happens before step 2, but will fail in the above order. Both the old and new write schedulers have this bug -- more precisely, the bug is in sc.resetStream. This bug can explain the flake. However, I can't explain why the test has stopped flaking. Perhaps something changed in the buffering code which made the above order much less likely?

@bradfitz
Copy link
Contributor

To be clear, I've never been able to reproduce it.

And it hasn't happened many times I don't think.

@bradfitz
Copy link
Contributor

(In a meeting now, but your theory looks valid at first glance.)

Do you want to prepare a fix? Or I can send you something?

@tombergan
Copy link
Contributor

A bit busy today, but I can get to this tomorrow if you don't get there first.

@bradfitz
Copy link
Contributor

@gopherbot
Copy link

CL https://golang.org/cl/33238 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/33241 mentions this issue.

gopherbot pushed a commit that referenced this issue Nov 15, 2016
Updates x/net/http2 to x/net git rev 00ed5e9 for:

    http2: schedule RSTStream writes onto its stream's queue
    https://golang.org/cl/33238

Fixes #17243

Change-Id: I79cc5d15bf69ead28d549d4f798c12f4ee2a2201
Reviewed-on: https://go-review.googlesource.com/33241
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@bradfitz
Copy link
Contributor

Apparently golang/net@00ed5e9 wasn't enough.

I still see a flake:
https://build.golang.org/log/2211304f54d156c6eafdd1ab4e15da8c16a19c1c

--- FAIL: TestInterruptWithPanic_nil_h2 (0.02s)
    clientserver_test.go:1191: Get https://127.0.0.1:58409: stream error: stream ID 1; INTERNAL_ERROR
FAIL
FAIL    net/http    3.481s

@bradfitz bradfitz reopened this Nov 16, 2016
@bradfitz
Copy link
Contributor

I'm going to deflake this a different way, without fixing the http2 issue for now.

@gopherbot
Copy link

CL https://golang.org/cl/33332 mentions this issue.

gopherbot pushed a commit that referenced this issue Nov 17, 2016
Updates #17243

Change-Id: Iaa737874e75fdac73452f1fc13a5749e8df78ebe
Reviewed-on: https://go-review.googlesource.com/33332
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@bradfitz
Copy link
Contributor

TODO(bradfitz): file bug about the general problem and theory and then close this one.

@bradfitz
Copy link
Contributor

I'll repurpose this bug about the x/net/http2 behavior.

It seems that even with our fix in https://golang.org/cl/33238 (and bundling it into std), this flaky test persists. Perhaps the problem is now in the Transport instead of the Server.

Investigate for Go 1.9.

@bradfitz bradfitz modified the milestones: Go1.9, Go1.8Maybe Nov 21, 2016
@bradfitz
Copy link
Contributor

That is, if the Transport sees a HEADERS frame on the wire and then "immediately" sees a RST_STREAM, is it possible for the Transport to report an error instead of its RoundTrip returning a non-nil *Response and nil error?

@bradfitz
Copy link
Contributor

@tombergan, I think between the Transport and Server changes, we might've fixed this one?

@tombergan
Copy link
Contributor

If it hasn't been flaking, I say close it.

@golang golang locked and limited conversation to collaborators Dec 20, 2017
c3mb0 pushed a commit to c3mb0/net that referenced this issue Apr 2, 2018
Fixes golang/go#17243

Change-Id: I76f972f908757b103e2ab8d9b1701312308d66e5
Reviewed-on: https://go-review.googlesource.com/33238
Reviewed-by: Tom Bergan <tombergan@google.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants