Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

godoc.org: various degrees of service degradation and unavailability on Sunday, January 19, 2020 #36642

Closed
heschi opened this issue Jan 19, 2020 · 21 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@heschi
Copy link
Contributor

heschi commented Jan 19, 2020

All pages are returning Internal server error.

cc @dmitshur @andybons and maybe @bradfitz ?

@heschi heschi added the Soon This needs to be done soon. (regressions, serious bugs, outages) label Jan 19, 2020
@bradfitz
Copy link
Contributor

Sorry, I've never touched godoc.org. Not even sure where it lives.

@iamtheyammer
Copy link

iamtheyammer commented Jan 19, 2020

Appears resolved? No longer working for me.

@esote
Copy link

esote commented Jan 19, 2020

(There has also been an issue created in golang/gddo#670.)

@hugelgupf
Copy link
Contributor

cc @broady

@dmitshur
Copy link
Contributor

dmitshur commented Jan 19, 2020

I can confirm, there is an excessively high frequency of 500 Internal Server Error responses on godoc.org at this time.

I'll look into whether there's something I can do now.

Until this issue is resolved, consider using pkg.go.dev, which also displays documentation for Go packages (see #33654), as a workaround. The pkg.go.dev server was designed to scale to a large number of packages and users better than godoc.org, and should be able to handle higher workloads. /cc @julieqiu

@dmitshur dmitshur changed the title godoc.org: is down godoc.org: serving a very high frequency of 500 internal server errors Jan 19, 2020
@dmitshur
Copy link
Contributor

I've taken some measures to address the problem, and the 500s are no longer happening at very high frequency as of 4:45 PM EST (12 minutes ago).

I'll remove the Soon label since this is mitigated. There's more to be done, but that can be done later.

Please let us know if you're still completely unable to reach godoc.org.

@dmitshur dmitshur added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Soon This needs to be done soon. (regressions, serious bugs, outages) labels Jan 19, 2020
@dmitshur dmitshur changed the title godoc.org: serving a very high frequency of 500 internal server errors godoc.org: serving occasional 500 internal server errors Jan 19, 2020
@iamtheyammer
Copy link

iamtheyammer commented Jan 19, 2020

I’m still getting “Error: Server Error”s on any page. Tested in clean chrome and safari.

Edit: just worked

@stevenh
Copy link
Contributor

stevenh commented Jan 19, 2020

Thanks guys for the quick fix / workaround, be interested in what the underlying cause was when identified?

@ghost
Copy link

ghost commented Jan 20, 2020

@dmitshur suggestion worked for me. For example changing:

https://godoc.org/golang.org/x/text/cases

To:

https://pkg.go.dev/golang.org/x/text/cases

but to confirm, no, its still not working:

$ curl -I https://godoc.org
HTTP/2 500
date: Mon, 20 Jan 2020 00:49:31 GMT
content-type: text/plain; charset=utf-8
content-length: 22
x-cloud-trace-context: 51bdb8438a2551de209987c3971d76af/4301414094728531385;o=1
via: 1.1 google

@dmitshur dmitshur changed the title godoc.org: serving occasional 500 internal server errors godoc.org: is down, serving 500 internal server errors Jan 20, 2020
@dmitshur dmitshur added the Soon This needs to be done soon. (regressions, serious bugs, outages) label Jan 20, 2020
@dmitshur
Copy link
Contributor

The measures I applied in #36642 (comment) worked for a few hours, but by now the redis backend that godoc.org uses has become completely unresponsive, and we'll need to do more work to get it operational again.

We'll update this issue when the godoc.org service is operational again. Until then,
please use pkg.go.dev as a workaround as mentioned above. Sorry for the trouble.

@dmitshur
Copy link
Contributor

dmitshur commented Jan 20, 2020

Another update. We've restored the redis backend, and so godoc.org should be operating okay now. I'll watch it some more to make sure it continues to be stable.

Edit: The godoc.org server has been serving successfully without any 500s since this comment was originally posted, so I'll update this issue to be resolved.

@dmitshur dmitshur removed the Soon This needs to be done soon. (regressions, serious bugs, outages) label Jan 20, 2020
@dmitshur dmitshur added this to the Unreleased milestone Jan 20, 2020
@dmitshur dmitshur changed the title godoc.org: is down, serving 500 internal server errors godoc.org: various degrees of service degradation and unavailability on Sunday, January 19, 2020 Jan 20, 2020
@theckman
Copy link
Contributor

theckman commented Jan 20, 2020

@dmitshur when can we expect a post-incident analysis to be completed with a publicly available write-up?

Edit: I should have clarified, I am assuming the issue has been addressed with the site behaving now. I could be wrong.

@ghost
Copy link

ghost commented Jan 20, 2020

@theckman isnt that a little presumptuous? I would like one as well, but its not owed to you or anyone.

What makes you think that it is?

@theckman
Copy link
Contributor

theckman commented Jan 20, 2020

@cup I think it's worth noting that other organizations that support these sorts of systems often have write-ups for their failures, npmjs.org might be a good example. (I can't believe I'm looking to NPM as a good example...) That said, I think "well, others do it" isn't a great answer so let me try and explain why I think it's important.

Ultimately we are putting trust in this organization, as a community, to host and run all of the core infrastructure for our language. I feel there is a need for transparency in to their handing of incidents so that we can be confident they are doing right by the community, but to also give them feedback where we feel we have misaligned expectations.

To me as a Site Reliability Engineer it's unacceptable to me when issues are discovered because customers reported them. There is an obligation to:

A. Have automatic discovery of issues.
B. A way to communicate the acknowledgement of those issues to the customers, so they are aware on-call engineers are engaged.

From the outside it looks like neither are in place, which would be surprising when you consider the SRE book came from Google. Such incident reports, detailing what contributed to the failure and how long it took to remediate, would sunshine on those issues. And if they aren't paging folks to bring Go infrastructure back up when it's failing, are they doing right by the community?

@ghost
Copy link

ghost commented Jan 20, 2020

@theckman yeah, I get that, you think its important, I do too.

But why do you feel entitled to demand one? For example, compare what you said:

@dmitshur when can we expect a post-incident analysis to be completed with a
publicly available write-up?

with another option:

Will a post-incident analysis with a publicly available write-up be made
available? If so, it would be much appreciated!

one is a demand, one is a (polite) ask.

There is an obligation to

no, theres not. Are you paying Google or Alphabet? This is open source software friend. The Golang team, and the larger Go community dont have any obligation to you in this regard.

@theckman
Copy link
Contributor

@cup I'm not demanding one. I asked when we can expect one to be available. A valid answer to that question is "We aren't planning on writing one." In response to that, the only thing I can do is provide context on why I feel it's important to write one.

@theckman
Copy link
Contributor

@cup Forgot to add: As a Site Reliability Engineer, those are the obligations of my role in that job. I'm not saying those are obligations of Google as an entire company, but are the expectations I have being that they have a strong SRE organization.

@heschi
Copy link
Contributor Author

heschi commented Jan 20, 2020

To my knowledge, nobody has ever committed to an availability goal for godoc.org. It is operated as a best-effort service, and it is slated to be replaced by pkg.go.dev. @dmitshur took time out of his weekend to fix it, which as far as I'm concerned was beyond the call of duty. I think it would have been nice to at least thank him. I don't think it's fair to assume that a post mortem is owed.

This is quite different from other services, notably proxy/index/sum.golang.org, which are critical dependencies for many Go users. We don't have a published policy for post mortems, but I think if those experienced an outage anything like this scale it would be appropriate to publish one.

If anyone would like to discuss the level of support that these services receive, I think a thread on golang-dev or golang-nuts would be a better forum. Or, if there's a concrete proposal, (e.g. publicize a post mortem for any significant outage) perhaps file an issue. But I don't think this discussion is shaping up to be productive.

@theckman
Copy link
Contributor

theckman commented Jan 20, 2020

@cup I don't think your comment was appropriate. I've not attacked you, or anyone else here. I am asking questions for the benefit of the community and I'm not going to tolerate being treated like that.

Edit: For posterity, since you edited the message:

Screen Shot 2020-01-20 at 2 16 10 PM

@ghost
Copy link

ghost commented Jan 20, 2020

@theckman I am sorry for my comment. If you notice carefully I deleted it. If you desire to post screenshots, then that reflects poorly on you as its off topic for this discussion.

I am asking questions for the benefit of the community

are you though? I hesitate to say it, but it seems from your presumptive wording that youre posting for your own benefit, or perhaps Netflix benefit.

@theckman
Copy link
Contributor

theckman commented Jan 20, 2020

@cup I raised the issue [1] because people were in the Gophers Slack trying to figure out what was goin on with getting their package documentation. People were confused, and godoc.org is currently the entrypoint we point a lot of newbies to for things. It's pretty critical from the user support perspective. So being 100% honest with you, I am asking on behalf of the Go community and nobody else. If these issues can be acknowledged and made public sooner, it makes providing user support in the community much easier (and it looks better on us too).

Hah, I really wish it was for the benefit of Netflix. It would mean we're writing a lot more Go than Java or NodeJS. I'd selfishly love that, but until then I'll keep dreaming. 😄

I noticed your edit after I sent the message, and so I was in a bit of a tough spot. I'm morally against editing my own messages to change my stance on things, especially in cases where I feel I've put my own foot in my mouth. In this case my message no longer had the relevant context and felt I needed to add it.

Since we've not had a chance to collaborate together before, it may be good to share my personal context. I have this stance because I feel my failures should be public/transparent to others, so they can form their own opinions and have an opportunity to not make mistakes I've made. The side effect here is that I try not to edit away things I've second thoughts about. I may add an edit indicating I was wrong, but my original message stands.

To summarize my own bar for these things: edits for clarifications or typo fixes 👍, but edits to change my message completely / rewrite history 👎.

[1] golang/gddo#670

Edit: Providing it as an edit instead of a separate comment to avoid followers getting hit with another Issue Update Email. Since my motives are being attacked, here is the Slack conversation where I called out raising the issue (right after people were experiencing problems): https://gophers.slack.com/archives/C029RQSEE/p1579455636310300

@ghost
Copy link

ghost commented Jan 20, 2020

I raised the issue [1] because people were in the Gophers Slack trying to
figure out what was goin on with getting their package documentation.

fair enough, but those people are looking for Go package documentation, not a post mortem. You are looking for that.

So even if you were acting in good faith to help those people, getting a post mortem is only going to help you and others looking for that, not for people who just want the documentation.

@theckman
Copy link
Contributor

theckman commented Jan 20, 2020

@cup Not looking for a post mortem since I'm pretty confident this didn't kill anybody. 😉My desire for such a retrospective is around this one line:

If these issues can be acknowledged and made public sooner, it makes providing user support in the community much easier (and it looks better on us too).

If we do a retrospective and discover it's not clear who to communicate these issues to and how, that's an extremely valuable learning. If we're able to make changes from that, it would help those of us who are fielding user questions, and pointing them to different Go-related resources. I think a good analogy might be businesses with Customer Support organizations. If there is something going on, most try to provide that sort of context to their support agents so they can better communicate with the people contacting them.

Hey, so https://godoc.org is having issues right now but once it's back up you can get your documentation here https://godoc.org/?example. This status page should go green once the current issues are resolved: https://godoc.org/?exampleStatus. Sorry about that!

Being able to do this would be super nice.

@DamareYoh
Copy link

DamareYoh commented Jan 20, 2020

It's a little disappointing to me that a simple request for a post-mortem for an outage on a system many people rely on is being met with this level of aggression. There's really no need to accuse people of being entitled or having ulterior motives for asking for one.

Moreover, the ball on this request is pretty much in the go team's court, and speculating on their answer and talking down to community members for requesting one is not helpful. It's more constructive to see what the go team says.

Isn't this why we have a CoC?

@ianlancetaylor
Copy link
Contributor

Let's please everybody take a break from this discussion and see what @dmitshur has to say. Please try to be polite and respectful at all times on this issue tracker. Thanks.

@dmitshur
Copy link
Contributor

dmitshur commented Jan 23, 2020

The godoc.org service continues to be stable and is not serving unexpected 500 errors anymore, so I'll close this.

The main cause of the 500 errors that were being served on Sunday was that the backing redis server that stores Go package documentation started to misbehave. Many of the redis operations were failing with an "ERR max number of clients reached" error. Almost every page on godoc.org needs to talk to the redis server, and when that operation wasn't successful, a 500 Internal Server Error was rendered.

In #36642 (comment), I made changes to scale down the number of frontend instances of gddo-server in order to reduce the total number of connections to the redis server. This bought some time, as the "ERR max number of clients reached" errors were greatly reduced, but after a few hours (see #36642 (comment)), they started to reoccur continuously. At that point, we needed to restart the redis server for it to become responsive again.

The root cause was a redis connection leak that led to the number of redis client connections to gradually grow over the course of the last 6 weeks to an unsustainably high number. There were no measures in place to limit or detect a large number of redis client connections, so we did not notice it sooner. Now that it is under control, I plan to add an alert for that metric so we can detect and resolve any recurrence of this problem before it can cause a service disruption. Edit: The redis connection leak was resolved and an alert was added.

We want to provide a good experience for users viewing Go package documentation. However, I want to reiterate what @heschik said above in #36642 (comment). Even with additional alerting, the godoc.org support continues to be provided on a best-effort basis, because it is slated to be eventually replaced by pkg.go.dev. The pkg.go.dev website can show documentation for specific versions of a Go package, and it has been designed to better handle the amount of Go packages that exist today. It is currently and will continue to be staffed by an active on-call rotation.

@golang golang locked and limited conversation to collaborators Jul 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests