Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/pkgsite: @master failed with "could not be found" for an hour #59464

Open
mvdan opened this issue Apr 6, 2023 · 20 comments
Open

x/pkgsite: @master failed with "could not be found" for an hour #59464

mvdan opened this issue Apr 6, 2023 · 20 comments
Assignees
Labels

Comments

@mvdan
Copy link
Member

mvdan commented Apr 6, 2023

What is the URL of the page with the issue?

https://pkg.go.dev/cuelang.org/go@master

What is your user agent?

Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0

Screenshot

image

(this screenshot was for https://pkg.go.dev/cuelang.org/go/cue/load@master, but the root of the module failed too)

What did you do?

Visit the page to view the docs at the latest master commit.

What did you expect to see?

It should work. I've used this many times in the past, and go get cuelang.org/go@master worked too.

What did you see instead?

The error above. This was today at around 10:19 London time, or 09:19 UTC time. We tried multiple times and it kept failing until about 11:00 London time.

I've looked at the pkgsite source code, and it's interesting to see that it fetches its own meta tag via HTTP from cuelang.org, unlike go get which fetches https://proxy.golang.org/cuelang.org/go/@v/master.info. Why is that?

Something else to note here is that our meta tag page redirects; curl -v https://cuelang.org/go?go-get=1 does not show the meta tags, but curl -L -v https://cuelang.org/go?go-get=1 does. This should be fine, as both cmd/go and proxy.golang.org seem to follow redirects. I would imagine and hope that pkgsite does as well, but I would also hope that it would simply talk to proxy.golang.org instead of reimplementing the "fetch meta tags" logic.

@mvdan mvdan added the pkgsite label Apr 6, 2023
@gopherbot gopherbot added this to the Unreleased milestone Apr 6, 2023
@mvdan
Copy link
Member Author

mvdan commented Apr 6, 2023

@peterhellberg mentioned that he requested v0.6.0-0.dev.0.20230406090537-f85172a7916b and that seemed to fix it - unclear if it was a coincidence in terms of timing or not.

@findleyr
Copy link
Contributor

findleyr commented Apr 6, 2023

CC @jamalc

I've looked at the pkgsite source code, and it's interesting to see that it fetches its own meta tag via HTTP from cuelang.org, unlike go get which fetches https://proxy.golang.org/cuelang.org/go/@v/master.info.

IIRC pkgsite does that in order to link to the source.

@myitcv
Copy link
Member

myitcv commented Apr 6, 2023

IIRC pkgsite does that in order to link to the source.

But the existence of the module in the first place should follow the same logic as cmd/go, right?

@findleyr
Copy link
Contributor

findleyr commented Apr 6, 2023

But the existence of the module in the first place should follow the same logic as cmd/go, right?

Premably, yes. I was just explaining why the meta tags were queried.

I looked into this a bit. It looks like the problem is that the proxy timed out around 5am london, and this timeout is not correctly treated as a retryable error:
https://cs.opensource.google/go/x/pkgsite/+/master:internal/frontend/fetch.go;l=455;drc=d37447d5241999cfc88746e21d57f2fdbcef6628

There's a missing return: the status is set to a magic value that causes re-fetch, and then subsequently overwritten to NotFound.

The reason @peterhellberg's request went through is that the pseudoversion did not exist in the version map.

I'll fix.

@myitcv
Copy link
Member

myitcv commented Apr 6, 2023

I'll fix.

Amazing, thanks.

@mvdan
Copy link
Member Author

mvdan commented Apr 10, 2023

FYI, happening again today, as of 09:13 UTC.

@gopherbot
Copy link

Change https://go.dev/cl/482162 mentions this issue: internal/frontend: add missing early return to retry fetch errors

@findleyr
Copy link
Contributor

Hmm, in this case it seems fetching the pseudoversion didn't work.

I manually fetched: https://pkg.go.dev/cuelang.org/go@master.

gopherbot pushed a commit to golang/pkgsite that referenced this issue Apr 12, 2023
A missing early return was causing retryable fetch errors not to be
retried by the frontend.

I spent a little while trying to test this, and it proved too difficult
and invasive. Unfortunately this may be a case where it is not
worthwhile to significantly refactor in order to improve testability.

Updates golang/go#59464

Change-Id: I0293a71d6be5046a4efe8b7ff490e285f912af3e
Reviewed-on: https://go-review.googlesource.com/c/pkgsite/+/482162
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
TryBot-Result: kokoro <noreply+kokoro@google.com>
Run-TryBot: Robert Findley <rfindley@google.com>
@findleyr
Copy link
Contributor

This fix is deployed, so I'm going to tentatively marked this as fixed.

Please ping this issue if you are unable to fetch master in the future.

@myitcv
Copy link
Member

myitcv commented Apr 13, 2023

Thanks @findleyr for digging in to investigate, and for the quick fix!

@mvdan
Copy link
Member Author

mvdan commented Apr 14, 2023

Thanks for the fix, @findleyr!

I went to look at the docs again this morning, and once again I see as of 08:40 UTC:

image

I have requested the version a couple of times and refreshed after a few minutes, but I still see the same. It's true that I no longer see the exact same "could not be found" error page from before, so something has changed for sure, but the docs still aren't loading reliably.

Thinking outloud for a second. If the problem is that our vanity import URL or VCS sometimes take a while to respond or even report errors, shouldn't pkgsite continue to serve the last @master version it knew about? Surely that is better than "could not be found" or some other error along the lines of "could not resolve the latest master version".

@mvdan mvdan reopened this Apr 14, 2023
@mvdan
Copy link
Member Author

mvdan commented Apr 14, 2023

Still not loading for me thirty minutes later, so it's not a fluke :)

@findleyr
Copy link
Contributor

Thanks for following up.

Failure mode looks different: the module is being re-attempted now, but failing. I'll investigate.

@gopherbot
Copy link

Change https://go.dev/cl/484736 mentions this issue: internal/worker: proxy timeouts should be retryable

@findleyr
Copy link
Contributor

Oof, there is a lot going on here. Jotting down notes for my own recollection; Dan and Paul: don't feel obligated to follow along.

The previous CL fixed only one of approximately 4 "bugs". Here's the complete picture:

  1. pkgsite refreshes of "master" often timeout, when "master" is not fresh on the proxy.
  2. proxy timeouts were not retryable from the frontend <-- this is what I fixed in https://go.dev/cl/482162
  3. proxy timeouts are not retried on the fetch queue <-- this is what I am fixing in https://go.dev/cl/484736
  4. the master branch is automatically re-queued when users browse documentation@master <-- this is what I missed before
  5. GCP "de-dupes" against deleted/finished tasks for a few hours until they are garbage collected

So what happened is that master was successfully fetched yesterday at 1pm ET, and then at 6:30pm ET someone was browsing documentation@master, which caused master to be re-fetched asynchronously, and this fetch failed. At this point, master was in a broken state, and (1) subsequently re-fetching manually timed-out because the proxy needed to re-evaluate master (again), and then (2) couldn't be re-attempted for some amount of time due to the GCP task queue deduplication.

😮‍💨

To fix this properly, I think we should make proxy timeouts retryable on the queue (as done in CL 484736), and avoid updating the version map for master when a fetch fails. It is better to serve a potentially stale master version (and continue trying to fetch it in the background) than to serve a 404.

@jamalc @hyangah WDYT?

@mvdan
Copy link
Member Author

mvdan commented Apr 14, 2023

That is some investigation :) I wonder, are we really the only ones to have run into this set of bugs? I've used @master on pkgsite with dozens of other modules before, and I don't think I've ever hit it. Although with cuelang.org/go, we do push to master relatively frequently, so perhaps that makes pkgsite more likely to get stuck on fetch errors.

Also, is there any record about our HTML with meta tags or VCS repository taking too long to respond or failing? Because if those failure rates on our end are noticeable, then that's something we would want to fix as well.

@findleyr
Copy link
Contributor

@mvdan it could just be that the cue repo is larger than the typical modules you are browsing, and so times out more frequently.

I expect that most users just browse latest, not master, so it's not that surprising to me that you are the first to have this combination of (1) frequently browsing master, (2) large modules that timeout, and (3) willingness to file an issue.

@mvdan
Copy link
Member Author

mvdan commented Apr 14, 2023

That makes sense, thanks again for the quick and detailed responses.

gopherbot pushed a commit to golang/pkgsite that referenced this issue Apr 14, 2023
When fetching e.g. the master branch, it is common for the proxy client
to timeout as it waits for the branch to be resolved.

This should be a retryable error, as otherwise we end up in a state
where an asynchronous refresh of master actually breaks documentation at
master.

There is still a problem that documentation is broken after the first
failure, but this at least puts us in a recoverable state.

Again, as with CL 482162 this is unfortunately not possible to test in
the current setup without significant refactoring. Such refactoring may
be warranted, but I do not have time to do this now.

Updates golang/go#59464

Change-Id: I321cc2eaabac1d7e052d07efcaadefcac24208ac
Reviewed-on: https://go-review.googlesource.com/c/pkgsite/+/484736
Reviewed-by: Jamal Carvalho <jamal@golang.org>
Run-TryBot: Robert Findley <rfindley@google.com>
TryBot-Result: kokoro <noreply+kokoro@google.com>
@findleyr findleyr self-assigned this Apr 17, 2023
@jamalc jamalc modified the milestones: Unreleased, pkgsite/later Apr 28, 2023
@mvdan
Copy link
Member Author

mvdan commented May 12, 2023

Likely related: I just tried to directly access https://proxy.golang.org/cuelang.org/go/@v/master.info, and the first time (on a cold cache?) it took a solid 15-20s to respond, which I found quite slow. Subsequent requests were instantaneous. Surely it should always return the last master version it saw, even if its cache is a few hours old?

@findleyr
Copy link
Contributor

Yes, what you observed is the retry behavior added in https://go.dev/cl/484736. Before that CL, you would have been served a 404.

The last part of this issue is to do exactly what you suggest: don't update the pointer to master if the fetch failed. That missing piece is why this issue is still open; I just haven't had time to do it (IIRC I got stuck on the testing setup). Not likely to get done this month. I have this issue cued up for next month, but it is a relatively low priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants