Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/website: elevated rate of 502s #30619

Closed
dmitshur opened this issue Mar 6, 2019 · 6 comments
Closed

x/website: elevated rate of 502s #30619

dmitshur opened this issue Mar 6, 2019 · 6 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented Mar 6, 2019

I am seeing a relatively high frequently of 502s on the golang.org website:

$ time curl 'https://golang.org/_ah/health'
ok
real	0m0.209s
user	0m0.025s
sys	0m0.010s
$ time curl 'https://golang.org/_ah/health'
ok
real	0m0.142s
user	0m0.025s
sys	0m0.007s
$ time curl 'https://golang.org/_ah/health'
ok
real	0m0.141s
user	0m0.025s
sys	0m0.007s
$ time curl 'https://golang.org/_ah/health'

<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>

real	0m9.192s
user	0m0.017s
sys	0m0.006s
$ time curl 'https://golang.org/_ah/health'
ok
real	0m0.438s
user	0m0.022s
sys	0m0.007s
$ time curl 'https://golang.org/_ah/health'
ok
real	0m0.142s
user	0m0.024s
sys	0m0.007s

I suspect it's what's causing golang/lint#440. /cc @agnivade

It seems to be happening on previously deployed versions of golang.org, not just the current one.

The last error that https://tip.golang.org/_tipstatus ran into is also 502 related, but when cloning from go.googlesource.com/go:

error=builder.Init: checkout of go: fetch: fatal: unable to access 'https://go.googlesource.com/go/': The requested URL returned error: 502

Maybe it's related to some higher-level component and not the website specifically? I don't have enough information to tell yet. /cc @bradfitz @andybons

@dmitshur dmitshur added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Soon This needs to be done soon. (regressions, serious bugs, outages) labels Mar 6, 2019
@gopherbot gopherbot added this to the Unreleased milestone Mar 6, 2019
@agnivade
Copy link
Contributor

agnivade commented Mar 6, 2019

Thanks for the quick turnaround @dmitshur !

@dmitshur
Copy link
Contributor Author

dmitshur commented Mar 6, 2019

I can reliably get frequent 502s from golang.org and its previously deployed versions right now, but not other sites like tour.golang.org, so it seems contained to the main website from what I can tell so far.

I remember last time something like this happened it was because the website was misconfigured regarding its use of index, and ended up pegging the CPU at 100%, etc. That's just for reference; I don't know what's happening now, so this needs more investigation.

Edit: Another recent possibly related issue on my mind is from CL 141718.

@dmitshur
Copy link
Contributor Author

dmitshur commented Mar 6, 2019

From the GCP console, I see there's now about one 5xx code out of every 50 responses (so 2%). The elevated 5xx rates began at roughly 9:06 PM eastern and have stayed consistently there.

Edit: I can also see that CPU usage, memory usage, traffic, etc., are all normal. Only the 99% percentile latency has gone up at the same time as the 502 rate went up.

@dmitshur dmitshur changed the title x/website: frequent 502s x/website: elevated rate of 502s Mar 6, 2019
@dmitshur
Copy link
Contributor Author

dmitshur commented Mar 6, 2019

I've restarted the instance and it seems better now; I can't reproduce the 502s anymore.

It's still not as smooth as pre-9:06 pm; some occasional requests to https://golang.org/_ah/health take 4.8~ seconds instead of the usual 0.150~ seconds. So this needs more investigation, but at least golang/lint#440 shouldn't be happening now.

Edit: As of 12:36 am eastern, the rate of 502s and latency have returned to their nominal pre-9:06 pm levels and have stayed there since.

@dmitshur
Copy link
Contributor Author

dmitshur commented Mar 6, 2019

The immediate issue is resolved, so I'll remove Soon label.

@dmitshur dmitshur removed the Soon This needs to be done soon. (regressions, serious bugs, outages) label Mar 6, 2019
@dmitshur
Copy link
Contributor Author

dmitshur commented Mar 8, 2019

We've done more investigation here, and it turned out the root cause was external to the golang.org server. It was a temporary issue affecting another networking component that has since been resolved.

I've watched the golang.org server, and other than the elevated 502 rate during the affected period on Tuesday night (9:06 pm to 12:36 am), the issue has not re-occurred:

image

Closing, since there's nothing more to do here. Huge thanks to @broady for helping investigate and uncovering the source of the problem!

@dmitshur dmitshur closed this as completed Mar 8, 2019
@golang golang locked and limited conversation to collaborators Mar 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants