Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/maintner/maintnerd: Gerrit events are slow to pick up #35293

Open
bradfitz opened this issue Oct 31, 2019 · 6 comments
Open

x/build/maintner/maintnerd: Gerrit events are slow to pick up #35293

bradfitz opened this issue Oct 31, 2019 · 6 comments
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bradfitz
Copy link
Contributor

I've noticed a huge delay lately starting TryBots.

Something's stuck or being slow in the pubsubhelper and/or maintner Gerrit code.

@gopherbot gopherbot added this to the Unreleased milestone Oct 31, 2019
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Oct 31, 2019
@bradfitz bradfitz changed the title x/build/cmd/maintner: Gerrit events are slow to pick up x/build/maintner/maintnerd: Gerrit events are slow to pick up Nov 1, 2019
@bradfitz
Copy link
Contributor Author

bradfitz commented Nov 1, 2019

I watched the logs and left a test comment on Gerrit:

2019/11/01 01:51:11 Got pubsubhelper event: {
        "Time": "2019-11-01T01:51:11.080801251Z",
        "Gerrit": {
                "URL": "https://go-review.googlesource.com/c/go/+/201203",
                "Project": "go",
                "CommitHash": "b4998a16ed8f74a4cca6eb8256b700b082f025b2",
                "ChangeNumber": 201203
        }
}
2019/11/01 01:51:11 Pubsub event on topic "gerrit:go.googlesource.com/go" discarded; already syncing?
2019/11/01 01:51:18 Got pubsubhelper event: {
        "Time": "2019-11-01T01:51:18.111203794Z",
        "Gerrit": {
                "URL": "https://go-review.googlesource.com/c/go/+/204617",
                "Project": "go",
                "CommitHash": "57417ba5b809f01d4a91f2efab01e69ce7ce755e",
                "ChangeNumber": 204617
        }
}
2019/11/01 01:51:18 Pubsub event on topic "gerrit:go.googlesource.com/go" discarded; already syncing?

The pubsub event arrived immediately. The questionable line is:

2019/11/01 01:51:18 Pubsub event on topic "gerrit:go.googlesource.com/go" discarded; already syncing?

What's it's supposed to do is poll regularly, but if it gets a pubsub event, it wakes up the poller. But if it's already polling when it gets a pubsub event, it ignores it.

It seems we're stuck in a poll (forever? usually?) and so when we get the pubsub events we just drop them.

Next up is figure out why we're stuck updating from Gerrit.

/cc @golang/osp-team

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 1, 2019
@gopherbot
Copy link

Change https://golang.org/cl/205860 mentions this issue: maintner: add tracing on 'git fetch origin'

@bradfitz
Copy link
Contributor Author

bradfitz commented Nov 7, 2019

I deployed maintnerd with 46182bf412d81cd9eadccb30ca9f451a58edd4fc (from https://go-review.googlesource.com/c/build/+/205918/1) and I'm watching logs now.

It also updated to Debian Buster w/ a newer git, which might even fix the problem, but let's see...

$ kubectl logs -f $(kubectl get po -o name -l app=maintnerd) | grep " in " | grep gerrit | grep -v mutation:
2019/11/07 22:37:19 gerrit go.googlesource.com/arch: ran git ls-remote in 890ms
2019/11/07 22:37:19 gerrit go.googlesource.com/tour: ran git ls-remote in 892ms
2019/11/07 22:37:19 gerrit go.googlesource.com/dl: ran git ls-remote in 856ms
2019/11/07 22:37:19 gerrit go.googlesource.com/talks: ran git ls-remote in 920ms
2019/11/07 22:37:19 gerrit go.googlesource.com/perf: ran git ls-remote in 842ms 
2019/11/07 22:37:19 gerrit go.googlesource.com/sync: ran git ls-remote in 830ms 
2019/11/07 22:37:19 gerrit go.googlesource.com/sublime-config: ran git ls-remote in 797ms 
2019/11/07 22:37:19 gerrit go.googlesource.com/image: ran git ls-remote in 932ms 
2019/11/07 22:37:19 gerrit go.googlesource.com/mod: ran git ls-remote in 697ms
2019/11/07 22:37:19 gerrit go.googlesource.com/lint: ran git ls-remote in 724ms 
2019/11/07 22:37:19 gerrit go.googlesource.com/gofrontend: ran git ls-remote in 866ms 
2019/11/07 22:37:19 gerrit go.googlesource.com/sublime-build: ran git ls-remote in 616ms 
2019/11/07 22:37:19 gerrit go.googlesource.com/xerrors: ran git ls-remote in 770ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/protobuf: ran git ls-remote in 769ms
2019/11/07 22:37:20 gerrit go.googlesource.com/scratch: ran git ls-remote in 571ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/website: ran git ls-remote in 617ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/vgo: ran git ls-remote in 882ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/exp: ran git ls-remote in 724ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/blog: ran git fetch origin in 2.217s 
2019/11/07 22:37:20 gerrit code.googlesource.com/gocloud: ran git ls-remote in 1.037s 
2019/11/07 22:37:20 gerrit go.googlesource.com/oauth2: ran git ls-remote in 669ms
2019/11/07 22:37:20 gerrit code.googlesource.com/google-api-go-client: ran git ls-remote in 699ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/mobile: ran git ls-remote in 683ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/net: ran git ls-remote in 648ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/playground: ran git ls-remote in 1.159s 
2019/11/07 22:37:20 gerrit go.googlesource.com/build: ran git ls-remote in 615ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/tools: ran git ls-remote in 487ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/blog: ran git ls-remote in 253ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/blog: fetched 5 hashes in 299ms 
2019/11/07 22:37:20 gerrit go.googlesource.com/build: fetched 2 hashes in 394ms
2019/11/07 22:37:21 gerrit go.googlesource.com/go: ran git fetch origin in 3.886s
2019/11/07 22:37:22 gerrit go.googlesource.com/go: ran git ls-remote in 1.114s 
2019/11/07 22:37:34 gerrit go.googlesource.com/go: fetched 16 hashes in 11.693s

gopherbot pushed a commit to golang/build that referenced this issue Nov 7, 2019
Also increase its timeout and terminate it with a friendlier signal
and add some logging of how long git operations took.

Updates golang/go#35293
Updates golang/go#35124

Change-Id: I1ed466d872a11f60751953ef5274be96cea0294b
Reviewed-on: https://go-review.googlesource.com/c/build/+/205860
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
@bradfitz
Copy link
Contributor Author

bradfitz commented Nov 7, 2019

The fetch operations seem the slowest:

$ kubectl logs -f $(kubectl get po -o name -l app=maintnerd) | grep "fetch" | grep hash | grep -v mutation:
2019/11/07 22:37:23 gerrit go.googlesource.com/go: fetching 16 hashes...
2019/11/07 22:37:34 gerrit go.googlesource.com/go: fetched 16 hashes in 11.693s
2019/11/07 22:41:46 gerrit go.googlesource.com/build: fetching 1 hashes...
2019/11/07 22:41:48 gerrit go.googlesource.com/build: fetched 1 hashes in 1.129s
2019/11/07 22:42:19 gerrit go.googlesource.com/go: fetching 1 hashes...
2019/11/07 22:42:22 gerrit code.googlesource.com/gocloud: fetching 1 hashes...
2019/11/07 22:42:23 gerrit code.googlesource.com/gocloud: fetched 1 hashes in 1.164s
2019/11/07 22:42:26 gerrit go.googlesource.com/go: fetched 1 hashes in 6.385s                                                                     2019/11/07 22:44:25 gerrit go.googlesource.com/go: fetching 2 hashes...
2019/11/07 22:44:29 gerrit go.googlesource.com/go: fetched 2 hashes in 4.601s
2019/11/07 22:44:44 gerrit go.googlesource.com/tools: fetching 1 hashes...
2019/11/07 22:44:45 gerrit go.googlesource.com/tools: fetched 1 hashes in 1.333s
2019/11/07 22:44:57 gerrit go.googlesource.com/go: fetching 4 hashes...
2019/11/07 22:45:02 gerrit go.googlesource.com/go: fetched 4 hashes in 5.027s
2019/11/07 22:45:09 gerrit go.googlesource.com/go: fetching 2 hashes...
2019/11/07 22:45:13 gerrit go.googlesource.com/go: fetched 2 hashes in 4.011s
2019/11/07 22:45:20 gerrit go.googlesource.com/go: fetching 2 hashes...
2019/11/07 22:45:24 gerrit go.googlesource.com/go: fetched 2 hashes in 3.518s
2019/11/07 22:48:24 gerrit go.googlesource.com/go: fetching 2 hashes...
2019/11/07 22:48:27 gerrit go.googlesource.com/go: fetched 2 hashes in 3.64s
2019/11/07 22:48:33 gerrit go.googlesource.com/go: fetching 1 hashes...
2019/11/07 22:48:37 gerrit go.googlesource.com/go: fetched 1 hashes in 3.605s
2019/11/07 22:52:41 gerrit go.googlesource.com/build: fetching 1 hashes...
2019/11/07 22:52:42 gerrit go.googlesource.com/build: fetched 1 hashes in 456ms
2019/11/07 22:53:40 gerrit go.googlesource.com/go: fetching 1 hashes...
2019/11/07 22:53:44 gerrit go.googlesource.com/go: fetched 1 hashes in 4.431s
2019/11/07 22:55:01 gerrit go.googlesource.com/build: fetching 1 hashes...
2019/11/07 22:55:01 gerrit go.googlesource.com/build: fetched 1 hashes in 553ms
2019/11/07 22:55:51 gerrit go.googlesource.com/go: fetching 1 hashes...
2019/11/07 22:55:55 gerrit go.googlesource.com/go: fetched 1 hashes in 4.028s
2019/11/07 22:57:49 gerrit go.googlesource.com/go: fetching 1 hashes...
2019/11/07 22:57:53 gerrit go.googlesource.com/go: fetched 1 hashes in 3.919s

I wonder if the problem is that because we drop pubsub wake-ups if a sync's already in progress, with the assumption that the current sync will likely pick up the change that we were just notified about, that assumption is rarely valid on these slow operations.

If we're fetching a ref for 11 seconds, a lot of comments can come in then and get dropped. Then we finish and resort to polling many minutes later.

We're on a pubsub wake-up we should set a bit saying "stuff happened" and when a sync's done and stuff happened meanwhile, sync immediately again until we get to a clean state where no pubsub events happened meanwhile and the ls-remote refs are unchanged.

That's my current theory, but I forget most of this code.

@bcmills
Copy link
Contributor

bcmills commented Nov 8, 2019

we drop pubsub wake-ups if a sync's already in progress, with the assumption that the current sync will likely pick up the change that we were just notified about

I was reading through the code and I thought we only drop the wakeup if the “wakeup“ channel's buffer is already full — that is, if we're going to wake up again at the end of the current poll anyway. Did I misread?

@bradfitz
Copy link
Contributor Author

bradfitz commented Nov 8, 2019

Did I misread?

You read, which is a step ahead of me, so you're probably right. I was looking at logs only and trying to remember the code.

codebien pushed a commit to codebien/build that referenced this issue Nov 13, 2019
Also increase its timeout and terminate it with a friendlier signal
and add some logging of how long git operations took.

Updates golang/go#35293
Updates golang/go#35124

Change-Id: I1ed466d872a11f60751953ef5274be96cea0294b
Reviewed-on: https://go-review.googlesource.com/c/build/+/205860
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants