Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/maintner/godata, x/build/maintner/maintnerd/maintapi: getGoData helper should handle failure without failing test, otherwise that causes occasional false positive test failures due to HTTP/2 internal errors while syncing segments #52048

Closed
bcmills opened this issue Mar 30, 2022 · 8 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Mar 30, 2022

--- FAIL: TestCorpusCheck (100.40s)
    godata_test.go:52: getting corpus: syncing segment 130: stream error: stream ID 3; INTERNAL_ERROR; received from peer
2022/03/29 15:50:24 Loading data from log *maintner.netMutSource ...
2022/03/29 15:50:24 Downloading 2701 bytes of https://maintner.golang.org/logs/130 ...
2022/03/29 15:50:24 wrote C:\Users\gopher\.cache\golang-maintner\0130.growing.mutlog
2022/03/29 15:50:59 Reloaded data from log *maintner.netMutSource.
FAIL
FAIL	golang.org/x/build/maintner/godata	136.77

greplogs --dashboard -md -l -e 'getting corpus: syncing segment \d+: stream error: stream ID \d+; INTERNAL_ERROR'

2022-03-29T15:43:06-e96d8cf-ae9ce82/windows-amd64-longtest
2021-04-08T21:58:35-83a8520-d67e739/windows-amd64-longtest
2021-04-08T19:58:50-83a8520-bb76193/windows-amd64-longtest
2020-10-19T18:36:14-2476803-06839e3/linux-amd64-longtest

(CC @golang/release)

@bcmills bcmills added this to the Backlog milestone Mar 30, 2022
@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 30, 2022
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Mar 30, 2022
@dmitshur
Copy link
Contributor

dmitshur commented Mar 30, 2022

This looks related to #51323, and a problem that needs to be addressed in net/http (possibly via a change to golang.org/x/net/http2).

It's not viable for maintner to stop depending on net/http and Google front end servers, so I'm not sure what actions are available in this package.

  • Disable HTTP/2? That's not great given as I understand this problem happens quite rarely and probably not worth losing benefits of HTTP/2 over.
  • Skipping the test? Similarly we'd lose test coverage that we get each time this error doesn't happen.

@bcmills
Copy link
Contributor Author

bcmills commented Mar 30, 2022

#51323 looks related, although the high rate of errors there appears to have started in Oct. 2021, whereas the first of these failures was observed Oct. 2020 (a year earlier). 🤔

Either way, if we suspect an error in an HTTP/2 implementation (either Go's or Google's), would it make sense for maintner to work around it by retrying some bounded number of INTERNAL_ERROR results during syncing?

(But this probably doesn't need to be a priority either way unless the failure rate increases. Mostly I've filed the issue in case it indicates some deeper systemic problem in conjunction with other maintner and/or HTTP/2 issues.)

@dmitshur
Copy link
Contributor

dmitshur commented Apr 5, 2022

Maintner already does retries at a higher level. I don't think we should add code to it to detect and handle an internal error coming from the net/http package, since that is out of scope. The real problem needs to be fixed in the HTTP layer, not in maintner which is one of many programs using it.

We can possibly add the retry to TestCorpusCheck as a test-only workaround sooner.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 5, 2022
@dmitshur dmitshur added this to Planned in Go Release Team Apr 5, 2022
@dmitshur dmitshur added the Testing An issue that has been verified to require only test changes, not just a test failure. label Apr 5, 2022
@bcmills
Copy link
Contributor Author

bcmills commented May 13, 2022

greplogs -l -e 'getting corpus: syncing segment \d+: stream error: stream ID \d+; INTERNAL_ERROR' --since=2022-03-30
2022-05-12T21:45:28-4aa4d2e-6365efb/linux-amd64-longtest
2022-04-12T22:05:26-2897e13-d85694a/windows-amd64-longtest
2022-03-31T14:57:35-27fe37a-378221b/linux-amd64-longtest
2022-03-31T14:51:17-27fe37a-2ea9376/linux-amd64-longtest

(But still nothing to be done here until we make progress on #51323, I think.)

@dmitshur
Copy link
Contributor

dmitshur commented Jun 24, 2022

Based on progress of investigation in #51323, the conclusion is that other than the error message not making it easy to see that the error is coming from a remote server for a possibly unavoidable reason (such as backend HTTP servers restarting, and graceful HTTP/2 stream shutdown being out of scope by intentional design), the net/http package and the HTTP/2 protocol are working as intended (i.e., there isn't an internal error to be fixed elsewhere).

Since maintner already does retries, the problem is only in the test, and adding a retry to the test will be a complete fix for this issue of a flaky test (rather than a workaround as we previously thought).

@dmitshur dmitshur changed the title x/build/maintner: HTTP/2 internal errors while syncing segments x/build/maintner/godata, x/build/maintner/maintnerd/maintapi: getGoData helper should handle failure without failing test, otherwise that causes occasional false positive test failures due to HTTP/2 internal errors while syncing segments Jun 24, 2022
@gopherbot
Copy link

Change https://go.dev/cl/414174 mentions this issue: maintner: retry network operations that may fail in getNewSegments

@dmitshur
Copy link
Contributor

dmitshur commented Jun 25, 2022

I initially wanted to update this issue with the latest status, and then saw that it'd be quick to send a trivial CL to close it. All that was needed was a trivial change to the following flaky-test-generating logic in the two copies of the getGoData helper:

 corpusCache, err = Get(context.Background())
 if err != nil {
-	tb.Fatalf("getting corpus: %v", err)
+	// Occasionally getting non-zero network errors
+	// while downloading 2+ GB of data from the internet
+	// is NOT unexpected!
+	//
+	// Doing t.Fatalf here means we're producing a non-actionable
+	// test failure, and while we haven't implemented go.dev/issue/19177 yet,
+	// any non-zero frequency false-positive flaky test failures
+	// may create additional manual triage work...
+	//
+	// So just handle the error from godata.Get by
+	// trying again or skipping the test—either way it
+	// would likely be better than t.Fatalf given the constraints above.
 }

While doing that, I tried to confirm my claim that our maintner-using programs handle retries (which I knew to be true without looking earlier, since they haven't needed manual intervention to keep running in a very long time).

It turns out they all retry not just by reloading the corpus, but by fatally exiting the program and having k8s restart the pod. That is functional but feels wasteful (computationally). I didn't want to add retry loops to the many godata.Get invocations in multiple x/build programs...

So that's the story of how I ended up with CL 414174. It felt easier to just detect possibly retryable network problems and retry them in netMutSource.getNewSegments, allowing the least amount of partial work to be thrown away unnecessarily due to an occasional network error. (I tried to split it into two smaller changes, but it didn't work out well due to changes in nesting intersecting with both logical changes.)

@dmitshur dmitshur modified the milestones: Backlog, Unreleased Jun 25, 2022
gopherbot pushed a commit to golang/build that referenced this issue Jun 29, 2022
On the internet, servers generally don't go out of their way to ensure
graceful stream shutdowns, preferring to instead rely on the clients to
receive a network error and decide to repeat the same request.

netMutSource implements the MutationSource interface, whose GetMutations
method should return a final MutationStreamEvent event with a non-nil
Err only when the mutation source encounters a fatal error. Most callers
correctly handle such fatal errors with a log.Fatalln or equivalent.

In practice, the netMutSource.getNewSegments method does a lot of work
to ensure it doesn't give up unnecessarily soon, so most errors are
really fatal. However, it was attempting to rely on HTTP requests to
a remote server to always succeed on the first try. That is a mistake.

This change augments the getServerSegments and syncSeg methods to
annotate errors that they believe might be due to transient network
problems, and adds non-zero number of retries to getNewSegments.
That is a good layer to retry since it allows throwing away the least
amount of successful progress made thus far-much less than completely
restarting a maintner-based program and having it reload the corpus.

While here, start relying on the maintnerd server having support for
long-polling since it was added in 2017 (CL 42871), which simplifies
code by allowing the top-level loop and its nesting to be deleted.

Updates golang/go#52048.

Change-Id: I8148109f97365697beab5575358069adfa095f0e
Reviewed-on: https://go-review.googlesource.com/c/build/+/414174
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Alex Rakoczy <jenny@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
@dmitshur
Copy link
Contributor

I believe CL 414174 is a complete fix and there's nothing more to do here, so tentatively closing but please reopen if there are further occurrences.

Go Release Team automation moved this from Planned to Done Jul 11, 2022
@golang golang locked and limited conversation to collaborators Jul 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
Archived in project
Development

No branches or pull requests

3 participants