proxy.golang.org: single intermittent failure of vanity URL causes stale cache #49916

myitcv · 2021-12-02T09:45:16Z

What version of Go are you using (`go version`)?

$ go version
go version devel go1.18-d34051bf16 Thu Dec 2 07:04:05 2021 +0000 linux/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/myitcv/.cache/go-build"
GOENV="/home/myitcv/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/myitcv/gostuff/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/myitcv/gostuff"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/myitcv/gos"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/myitcv/gos/pkg/tool/linux_arm64"
GOVCS=""
GOVERSION="devel go1.18-d34051bf16 Thu Dec 2 07:04:05 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3900600216=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Related to #34370.

Around 2021/12/01 1754 UTC we cut a new release of CUE, v0.4.1-beta.4.

At the time that could not be resolved via sum.golang.org (https://sum.golang.org/lookup/cuelang.org/go@v0.4.1-beta.4) because of:

not found: cuelang.org/go@v0.4.1-beta.4: unrecognized import path "cuelang.org/go": https fetch: Get "https://cuelang.org/go?go-get=1": dial tcp 68.183.23.220:443: connect: connection refused

i.e. it appears a transient error resolving https://cuelang.org/go?go-get=1 caused sum.golang.org to cache an error.

However, sum.golang.org was not consistently returning this error (by IP):

# 172.217.169.49
not found: cuelang.org/go@v0.4.1-beta.4: unrecognized import path "cuelang.org/go": https fetch: Get "https://cuelang.org/go?go-get=1": dial tcp 68.183.23.220:443: connect: connection refused

# 142.250.187.241
8200543
cuelang.org/go v0.4.1-beta.4 h1:OQwpzifUIivXoxQK68EukVKnATP90GeelTSHnOW1qfI=
cuelang.org/go v0.4.1-beta.4/go.mod h1:P09/R4UfAEzLkV9DXxwlxQnIZbkaT4uIhiEgs6Vsz2Q=

go.sum database tree
8211598
6KTVEmHHgorOjo5YAT18Gj8CznzxCPxir+IF+SDAZUo=

— sum.golang.org Az3grmjNeN768KqG22Zo7F4Gp6T4entYI8PeYDSiWCjBsITq7p9CiDkiA62ZLw6aaG97j5rHOTesLqWZGVS836655QU=

What did you expect to see?

my sense from GOSUMDB failing "410 Gone" for publicly-available tag #34370 (comment) was that the sum.golang.org servers should now be more consistent in their responses. Is there some way in which the cache could be synchronised/invalidated in the case of error responses when it is known, as was the case here, that other instances of sum.golang.org were not in error? Hence I would either expect no cache discrepancies, or at least a much shorter "bad" cache time period
cmd/go (which I believe is what {proxy,sum}.golang.org use behind the scenes?) failing after a single call to ?go-get=1 also seems a little unfortunate. In this case the effect was exacerbated by the subsequent caching issue, but I wonder if there is some sort of retry logic that should be applied here in the case of connection refused? Not least because the SLA we can generally expect from vanity servers is unlikely to match that of, say, {proxy,sum}.golang.org. Hence, I would have expected us not to get in to this situation if a simply retry on connection refused was attempted.

What did you see instead?

A "long" bad cache on sum.golang.org which then prevented a successful release.

cc @bcmills for cmd/go and @katiehockman for {proxy,sum}.golang.org

Thanks to @mvdan and @seankhliao for helping to debug here.

The text was updated successfully, but these errors were encountered:

seankhliao · 2021-12-02T10:01:04Z

cc @heschi

bcmills · 2021-12-02T16:43:23Z

I wonder if there is some sort of retry logic that should be applied here in the case of connection refused?

#28194 is probably relevant for that, but it's tricky: I would expect that an import-path server generally fails due to (say) temporary outages for machine or infrastructure updates more often than nondeterministic errors or short-term network drops. How do we balance retrying flakes vs. stalling on retries that won't succeed any time soon anyway (compare #40376)?

What's more, transparent retries can hide real problems — masking problems generally provides a short-term benefit at a long-term cost. (Compare #42699, and “Postel was wrong” is a good read too.)

mvdan · 2021-12-02T16:55:11Z

#42809 might also be relevant. I can understand it if we don't want to keep retrying with short intervals.

That said, I think we can do better without wasting resources. For instance, like Paul said, one of the servers saw the version while another didn't. This resulted in inconsistent yet reproducible behavior among users, which we could reproduce for a solid ten minutes. I think that kind of inconsistency could lead to significant confusion in practice; imagine an user reports to upstream that go install always fails for them, but for upstream it always works.

At a conceptual level, I imagine that as soon as one of the servers sees a valid new version, it could invalidate cached errors from all other servers.

heschi · 2021-12-02T17:26:04Z

There is only one instance of the mirror, but it is a globally distributed system with multiple layers of caching. We're not trying to guarantee global consistency, so this is working as expected as long as the cache became consistent within half an hour. There is no way to synchronize the global cache.

myitcv · 2021-12-03T06:02:50Z

@bcmills thanks, #28194 and #40376 are indeed relevant. But I think somewhat different. Because in this case the caller is not in control of the side effect of the (30 mins, albeit temporary) "bad" cache state that might result, yet they will likely suffer its (temporary) "long"-term cost, as will others. So for this specific error, one could argue that retrying and the stalling that results in is worth it.

myitcv · 2021-12-03T06:05:55Z

@heschi thanks for the detail.

Is it always the case that the {proxy,sum} mirror reflects the first fetch of a module, or rather one of the fetches during the 30 minute window that follow the first fetch?

Regardless of whether cmd/go is changed in some way to retry, in the case of this specific error is there merit in having a shorter cache invalidation period (apologies if I'm not using the correct terminology)? Again the reasoning here being that the caller, all else being equal, is very likely not in control of this category of error.

heschi · 2021-12-03T18:35:20Z

The only guarantee we attempt to make is that users see data that's less than half an hour old. There is no concept of a "first" fetch in the system. So if it does multiple fetches within a 30 minute interval, users may see any of the results in the interim.

The go command doesn't provide any kind of structured output for fetch errors, and the mirror has so far avoided parsing error text. I don't think this is a critical enough case to start doing that.

myitcv · 2021-12-06T05:50:24Z

There is no concept of a "first" fetch in the system. So if it does multiple fetches within a 30 minute interval, users may see any of the results in the interim.

Obviously not to question your explanation, but rather to share my understanding to this point which has been that the first fetch was the significant one. However it sounds like there are some scenarios where {proxy,sum}.golang.org can settle on a later module version that is different from the first fetch version, i.e. following a tag move?

The go command doesn't provide any kind of structured output for fetch errors, and the mirror has so far avoided parsing error text. I don't think this is a critical enough case to start doing that.

I'll defer on this point, because they only data points I am aware of are my own issue report.

heschi · 2021-12-06T18:27:24Z

Not all fetches are successful, as in this issue. And not all names stay the same forever, e.g. branches. But we are now way off topic for this issue.

myitcv added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 2, 2021

seankhliao changed the title ~~{proxy,sum}.golang.org: single intermittent failure of vanity URL causes stale cache~~ proxy.golang.org: single intermittent failure of vanity URL causes stale cache Aug 20, 2022

gopherbot added the proxy.golang.org label Aug 20, 2022

seankhliao added this to the Unplanned milestone Aug 20, 2022

myitcv mentioned this issue Sep 10, 2022

proxy.golang.org: slow resolution of commit already pulled through the proxy #54977

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proxy.golang.org: single intermittent failure of vanity URL causes stale cache #49916

proxy.golang.org: single intermittent failure of vanity URL causes stale cache #49916

myitcv commented Dec 2, 2021

seankhliao commented Dec 2, 2021

bcmills commented Dec 2, 2021

mvdan commented Dec 2, 2021

heschi commented Dec 2, 2021

myitcv commented Dec 3, 2021

myitcv commented Dec 3, 2021

heschi commented Dec 3, 2021

myitcv commented Dec 6, 2021

heschi commented Dec 6, 2021

proxy.golang.org: single intermittent failure of vanity URL causes stale cache #49916

proxy.golang.org: single intermittent failure of vanity URL causes stale cache #49916

Comments

myitcv commented Dec 2, 2021

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

seankhliao commented Dec 2, 2021

bcmills commented Dec 2, 2021

mvdan commented Dec 2, 2021

heschi commented Dec 2, 2021

myitcv commented Dec 3, 2021

myitcv commented Dec 3, 2021

heschi commented Dec 3, 2021

myitcv commented Dec 6, 2021

heschi commented Dec 6, 2021

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?