New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
all: Gerrit having availability issues #30690
Comments
Hi there, |
This has been happening since yesterday afternoon PST. We're also affected. Not to beat a dead horse, but I want to emphasize that this has the same impact as npmjs.com downtime would for nearly every nodejs project in the world. |
Same with us. Our presubmit tests are very flaky because of that.
|
I feel like this should be relatively high priority, right? |
The same here for Go 1.11.x in our travis CI pipelines
|
@andybons any updates? |
@bcmills @jayconrod - a drive-by thought, but given that tools is now a proper module, would it make sense to have |
This issue also affects go 1.11, where modules are experimental and we are not using them. |
We are hitting this issue as well since Friday. We have had to resort to building locally and then publishing to GCR manually, but this is not an acceptable workaround. |
Thanks for the reports. We are looking into the issue with Gerrit availability, and will post updates here. |
@myitcv We can't do that at this time because there aren't tagged releases of the modules yet. See this comment that describes the current strategy for adding go.mod files to subrepositories. |
@dmitshur unless I'm mistaken, this approach will work for pseudo versions as well as non-pseudo versions (i.e. the tagged released you are referring to). All that's required here would be a bot of some sort to publish the pseudo versions for each new commit. I previous did something similar for my domain, with the module versions published to https://github.com/myitcv/pubx, and served via https://raw.githubusercontent.com |
Gerrit does not seem reliable here, and this is affecting most of the Golang community. Perhaps an option is to return
Return:
|
@andybons Is this the place where we can go to get the latest updates on this issue? This is impacting us internally at Netflix, and others in the larger Go community. I'd like to make sure I'm linking people to the right place. As we're moving towards more critical components for the ecosystem being hosted by Google, I think it's worth calling out that handling of issues like this are going to be what we use to make judgement calls regarding the trust we have in Google to operate future things such as the Notary. An issue not being handled well, such as through a lack of transparent communication, is one thing that community members will remember going forward. Experiencing intermittent issues for days, while minimal communication from the team, does not feel like the level of operational maturity we will expect from hosted Module infrastructure. How can I help us get in to a position of better communication so that we can have that trust? |
I second @theckman's statement on trust here - as we move towards a centralized module repository, we want to be able to trust Google's infrastructure here to be reliable for critical parts of the Golang ecosystem. Gerrit being in the critical path, if unreliable, is part of this - it's my belief we need a postmortem here, but just my two cents. |
@peter-edge let's aim for an Incident Review; nobody died. 😉 |
Right now, as a hotfix, literally having the |
Yes, this is the issue tracking the Gerrit availability problem. It has a Soon label and we're actively working on resolving it; we'll post updates as they happen. |
@dmitshur Thank you for the update on the current progress, we all really appreciate it! Is it possible to add a separate label for "Actively Engaged" or something similar, as "Soon" doesn't really communicate that piece (at least it didn't to me). If we intentionally use "Soon" to either/or, maybe we could enhance the description to make it clear that it includes imminent activity or current activity. I know you said that you'll post updates as they happen, but I wonder what we've been able to learn in 3 days. Are you able to share any impact estimates, impact start times, etc. to help people understand when/how it may have impacted them? Is this a transient thing, where it had stabilized and then became unstable again? Do you have an ETA for when we'd expect to see stability, or at least for when we'd expect to get an update from you around an ETA? Like with Google Cloud outages, can we get a commitment from the team to post regular updates (even if it's just to confirm it's still being worked on, and there's no ETA)? It helps us know that people are still engaged, and avoids hours of uncomfortable silence. Regular could be every 3, 6, or whatever hours. Sorry, got my SRE hat on a little bit here. 😄 |
Also just again to note - we can immediately and effectively mitigate this problem while Gerrit's overarching issues are investigated and fixed by switching the |
I don't know what is happening, but I believe that the Gerrit team fixed the problem on Friday, and now, on Monday, they are encountering the same problem or a different one with similar symptoms. |
An update from the Gerrit team is that they believe today's outage has ended at 8:50 PM EST (36 minutes ago compared to this comment's time). If you're still experiencing problems since that time, please let us know so we can continue to look into it. |
Today's incident seems to be resolved by now, so I'm going to remove the Soon label now. I won't close the issue yet for visibility, and because there are follow-up actions from this that we'll want to consider. (For now, I've added "outages" to the description of the label Soon. We can see if more can be done to improve the label as part of follow-up steps.) |
@dmitshur @ianlancetaylor When should we anticipate seeing a review of the incident, including details like impact window timelines, estimated impacts (percentage of requests failing, for example), and what happened as well as what's being done to prevent it? |
I think @andybons is planning to write something here after he gets the information from the Gerrit team. We are also encouraging the Gerrit team to have a public dashboard for status and outages. |
Ian is correct. I’m working with the Gerrit team and will update this thread once we have something to share. |
Hi all, To give a broad strokes update on what happened, an errant task run by another team within Google caused one of Gerrit’s backend services to become overwhelmed, resulting in elevated error rates including 502s and “Repository not found” errors to a subset of users. The underlying cause was due to routing not being configured to take differences in request cost into account, causing it to concentrate expensive requests in a small number of tasks. The issue has been remediated and a thorough internal post-mortem has been written. We have asked the Gerrit team to provide a public dashboard to better communicate these issues and will leave it to them to provide more extensive details on timeline, remediation steps, and follow-up items surrounding this particular outage. /cc @jrn from the Gerrit team. |
What version of Go are you using (
go version
)?Also using my browser, https://go.googlesource.com/text/+/master
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I visited https://go.googlesource.com/text/+/master and also tried to
go get golang.org/x/text
and it failed:go get golang.org/x/text
What did you expect to see?
A working call to
go get
.What did you see instead?
My CI build repeatedly fail.
The text was updated successfully, but these errors were encountered: