Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/review/git-codereview: failures with "invalid pending output" due to missing Gerrit metadata #50576

Closed
bcmills opened this issue Jan 12, 2022 · 20 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Jan 12, 2022

The x/review/git-codereview tests for the pending command occasionally fail due to missing Gerrit metadata for some of the CLs, particularly on slower builders. Looks like at least a missed error check, since the error that causes the truncated metadata is not present in the logs.

greplogs --dashboard -md -l -e 'invalid pending output' --since=2020-01-01

2022-01-12T15:53:47-b04b288-6e8b7e4/netbsd-386-9_0-n2

    pending_test.go:396: invalid pending output:
        work REVHASH..REVHASH (current branch)
        + REVHASH v14
        + REVHASH v13
        + REVHASH v12
        + REVHASH v11
        + REVHASH v10
        + REVHASH v9
        + REVHASH v8
        + REVHASH v7
        + REVHASH v6
        + REVHASH v5
        + REVHASH v4
        + REVHASH v3
        + REVHASH v2
        + REVHASH v1
        + REVHASH msg
        
        
        want:
        work REVHASH..REVHASH (current branch, all mailed)
        + REVHASH v14 (CL 1234 -2 +1, mailed)
        + REVHASH v13 (CL 1234 -2 +1, mailed)
        + REVHASH v12 (CL 1234 -2 +1, mailed)
        + REVHASH v11 (CL 1234 -2 +1, mailed)
        + REVHASH v10 (CL 1234 -2 +1, mailed)
        + REVHASH v9 (CL 1234 -2 +1, mailed)
        + REVHASH v8 (CL 1234 -2 +1, mailed)
        + REVHASH v7 (CL 1234 -2 +1, mailed)
        + REVHASH v6 (CL 1234 -2 +1, mailed)
        + REVHASH v5 (CL 1234 -2 +1, mailed)
        + REVHASH v4 (CL 1234 -2 +1, mailed)
        + REVHASH v3 (CL 1234 -2 +1, mailed)
        + REVHASH v2 (CL 1234 -2 +1, mailed)
        + REVHASH v1 (CL 1234 -2 +1, mailed)
        + REVHASH msg (CL 1234 -2 +1, mailed, submitted)

2021-12-07T20:14:56-b8ead20-b37a539/solaris-amd64-oraclerel

    pending_test.go:396: invalid pending output:
        work REVHASH..REVHASH (current branch)
        + REVHASH v14 (CL 1234 -2 +1, mailed)
        + REVHASH v13 (CL 1234 -2 +1, mailed)
        + REVHASH v12 (CL 1234 -2 +1, mailed)
        + REVHASH v11 (CL 1234 -2 +1, mailed)
        + REVHASH v10 (CL 1234 -2 +1, mailed)
        + REVHASH v9 (CL 1234 -2 +1, mailed)
        + REVHASH v8 (CL 1234 -2 +1, mailed)
        + REVHASH v7 (CL 1234 -2 +1, mailed)
        + REVHASH v6 (CL 1234 -2 +1, mailed)
        + REVHASH v5 (CL 1234 -2 +1, mailed)
        + REVHASH v4
        + REVHASH v3
        + REVHASH v2
        + REVHASH v1
        + REVHASH msg (CL 1234 -2 +1, submitted)
        
        
        want:
        work REVHASH..REVHASH (current branch, all mailed)
        + REVHASH v14 (CL 1234 -2 +1, mailed)
        + REVHASH v13 (CL 1234 -2 +1, mailed)
        + REVHASH v12 (CL 1234 -2 +1, mailed)
        + REVHASH v11 (CL 1234 -2 +1, mailed)
        + REVHASH v10 (CL 1234 -2 +1, mailed)
        + REVHASH v9 (CL 1234 -2 +1, mailed)
        + REVHASH v8 (CL 1234 -2 +1, mailed)
        + REVHASH v7 (CL 1234 -2 +1, mailed)
        + REVHASH v6 (CL 1234 -2 +1, mailed)
        + REVHASH v5 (CL 1234 -2 +1, mailed)
        + REVHASH v4 (CL 1234 -2 +1, mailed)
        + REVHASH v3 (CL 1234 -2 +1, mailed)
        + REVHASH v2 (CL 1234 -2 +1, mailed)
        + REVHASH v1 (CL 1234 -2 +1, mailed)
        + REVHASH msg (CL 1234 -2 +1, mailed, submitted)

2021-11-05T21:18:28-39ade5b-4c7cafd/dragonfly-amd64

    pending_test.go:396: invalid pending output:
        work REVHASH..REVHASH (current branch)
        + REVHASH v14
        + REVHASH v13
        + REVHASH v12
        + REVHASH v11
        + REVHASH v10
        + REVHASH v9
        + REVHASH v8
        + REVHASH v7
        + REVHASH v6
        + REVHASH v5
        + REVHASH v4
        + REVHASH v3
        + REVHASH v2
        + REVHASH v1
        + REVHASH msg (CL 1234 -2 +1, mailed, submitted)
        
        
        want:
        work REVHASH..REVHASH (current branch, all mailed)
        + REVHASH v14 (CL 1234 -2 +1, mailed)
        + REVHASH v13 (CL 1234 -2 +1, mailed)
        + REVHASH v12 (CL 1234 -2 +1, mailed)
        + REVHASH v11 (CL 1234 -2 +1, mailed)
        + REVHASH v10 (CL 1234 -2 +1, mailed)
        + REVHASH v9 (CL 1234 -2 +1, mailed)
        + REVHASH v8 (CL 1234 -2 +1, mailed)
        + REVHASH v7 (CL 1234 -2 +1, mailed)
        + REVHASH v6 (CL 1234 -2 +1, mailed)
        + REVHASH v5 (CL 1234 -2 +1, mailed)
        + REVHASH v4 (CL 1234 -2 +1, mailed)
        + REVHASH v3 (CL 1234 -2 +1, mailed)
        + REVHASH v2 (CL 1234 -2 +1, mailed)
        + REVHASH v1 (CL 1234 -2 +1, mailed)
        + REVHASH msg (CL 1234 -2 +1, mailed, submitted)

2021-10-06T23:01:01-39ade5b-4ffa2f1/dragonfly-amd64

    pending_test.go:260: invalid pending output:
        work REVHASH..REVHASH (current branch, 1 behind)
        + REVHASH
        	msg
        
        	Change-Id: I123456789
        
        	Files in this change:
        		file
        
        
        want:
        work REVHASH..REVHASH (current branch, all mailed, all submitted, 1 behind)
        + REVHASH http://127.0.0.1:PORT/1234 (mailed, submitted)
        	msg
        
        	Change-Id: I123456789
        
        	Code-Review:
        		+1 Grace Emlin
        		-2 George Opher
        	Other-Label:
        		+2 The Owner
        	Files in this change:
        		file
@gopherbot gopherbot added this to the Unreleased milestone Jan 12, 2022
@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 12, 2022
@bcmills bcmills modified the milestones: Unreleased, Backlog Jan 12, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Apr 7, 2022

greplogs --dashboard -md -l -e 'invalid pending output' --since=2022-01-13

2022-01-21T00:39:55-b04b288-32636cd/netbsd-386-9_0

@bcmills
Copy link
Contributor Author

bcmills commented Apr 7, 2022

Marking as release-blocker via #11811.

Although this hasn't failed on any first-class ports yet, the diversity of ports on which it has occurred (NetBSD, Dragonfly, and Solaris so far) suggests that it is not a platform-specific failure mode.

@bcmills bcmills modified the milestones: Backlog, Go1.19 Apr 7, 2022
@rsc
Copy link
Contributor

rsc commented Apr 11, 2022

That test is using a fake Gerrit server (type gerritServer).
The map it serves from is protected by a sync.Mutex.

The fake server does run on a real localhost network port.
I wonder if we are seeing the effect of flaky network dialing.
I can try dialing a second time if the first one fails, but is that a good fix?

@gopherbot
Copy link

Change https://go.dev/cl/399626 mentions this issue: git-codereview: try Gerrit HTTP requests three times

@gopherbot
Copy link

Change https://go.dev/cl/399116 mentions this issue: git-codereview: print about all Gerrit API errors

gopherbot pushed a commit to golang/review that referenced this issue Apr 14, 2022
Trying to chase down a problem with slow builders.
If this gets to be too chatty, we can limit it to builders.

For golang/go#50576.

Change-Id: I16c7818153d2444c897b4946ce7baebbd646ab23
Reviewed-on: https://go-review.googlesource.com/c/review/+/399116
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
@bcmills
Copy link
Contributor Author

bcmills commented Apr 22, 2022

#49899 may be related; one of those failures also triggered the new logging. (See #49899 (comment).)

@mknyszek
Copy link
Contributor

@rsc What are the next steps here? This is still a release blocker because it's causing builder failures.

@dmitshur dmitshur added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Jun 1, 2022
@rsc
Copy link
Contributor

rsc commented Jun 1, 2022

Does anyone have links to recent builder failures (since CL 399116 landed)? Thanks.

@bcmills
Copy link
Contributor Author

bcmills commented Jun 1, 2022

greplogs -l -e 'invalid pending output' --since=2022-04-15
2022-05-10T21:56:21-ae62a0b-bda9da8/netbsd-amd64-9_0
2022-05-06T21:07:37-ae62a0b-ac39dbd/dragonfly-amd64

@gopherbot
Copy link

Change https://go.dev/cl/409914 mentions this issue: git-codereview: print more about Gerrit API errors

gopherbot pushed a commit to golang/review that referenced this issue Jun 1, 2022
For golang/go#50576.

Change-Id: I3b67234098cbf9cd28bfa578c0b29744a4add2a3
Reviewed-on: https://go-review.googlesource.com/c/review/+/409914
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@rsc
Copy link
Contributor

rsc commented Jun 1, 2022

I updated my fetchlogs and greplogs and did

fetchlogs -n 1000 -repo review
fetchlogs -n 1000 
greplogs -l -e 'invalid pending output' 

and am not getting any hits. I'm holding it wrong, but how?

@bcmills
Copy link
Contributor Author

bcmills commented Jun 1, 2022

Hmm, I'm not sure! Might need a larger -n, since the last failure was a few weeks ago?
(The semantics of -n are unfortunately kind of fuzzy, I think.)

There may also be a bug in cmd/coordinator where it cuts off all but the most recent k (= 20?) Go commits per subrepo commit, which could have pushed the failures off the end.

@rsc
Copy link
Contributor

rsc commented Jun 1, 2022

Found one from earlier today with a different pattern. (Too much earlier today unfortunately.) Fingers crossed.

% greplogs -l -e 'git-codereview: fetch'
`greplogs -l -e 'git-codereview: fetch'`
[2022-06-01T02:37:46-ae62a0b-8a56c77/netbsd-arm64-bsiegert](https://build.golang.org/log/f8365942ffcaa04564c4a72430f98bb4eb996a35)
% 

@bcmills
Copy link
Contributor Author

bcmills commented Jun 1, 2022

Yep, that one matches #49899, which may or may not share a root cause (not obvious to me either way).

@rsc
Copy link
Contributor

rsc commented Jun 1, 2022

Losing the connection would explain both 'malformed json response' and 'connection reset by peer', so probably these are the same root cause. Of course why a localhost connection would get cut is a mystery.

Looks like the affected systems are dragonfly, netbsd, openbsd, solaris. It's possible the minor BSDs are hitting some common problem that FreeBSD has since fixed (either on the Go side or in the OS). But it's weird to see Solaris in that list. My understanding is that Solaris has a SVR4-derived networking stack, which would be totally different code. Looking forward to seeing the malformed bodies.

@gopherbot gopherbot removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Jun 10, 2022
@rsc
Copy link
Contributor

rsc commented Jun 15, 2022

FWIW I did a fetchlogs / greplogs 'git-codereview: fetch' and don't see any new failures in the last two weeks. Also nothing on https://build.golang.org/?repo=golang.org%2fx%2freview. Will have to keep waiting.

@dmitshur dmitshur added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jun 15, 2022
@ianlancetaylor ianlancetaylor modified the milestones: Go1.19, Go1.20 Jun 24, 2022
@ianlancetaylor
Copy link
Contributor

Rolling forward to 1.20. Please comment if you disagree. Thanks.

@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@dmitshur
Copy link
Contributor

I think we're willing to wait a bit longer here, reopening.

@dmitshur dmitshur reopened this Jul 15, 2022
@dmitshur dmitshur reopened this Jul 15, 2022
@rsc
Copy link
Contributor

rsc commented Oct 13, 2022

I think this stopped happening, but a variant just happened on #56221. Closing this as a duplicate of #56221.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

7 participants