Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go: test reported as cached, but not actually cached #41593

Open
nick-jones opened this issue Sep 23, 2020 · 5 comments
Open

cmd/go: test reported as cached, but not actually cached #41593

nick-jones opened this issue Sep 23, 2020 · 5 comments
Labels
GoCommand cmd/go NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. ToolSpeed
Milestone

Comments

@nick-jones
Copy link

nick-jones commented Sep 23, 2020

What version of Go are you using (go version)?

$ go version
go version go1.15.2 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/user/.cache/go-build"
GOENV="/home/user/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/user/.go_workspace/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/user/.go_workspace:/usr/local/go_workspace"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/user/project/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build013192427=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We have a large(ish) cache built up from CI builds (approx 2GB in size). We've seen tests slow down fairly dramatically, and it seems that the underlying cause is (possibly partial) cache misses resulting in at least some of the test process to be run every single time (in particular I see the linker being invoked).

$ time GODEBUG=gocachetest=1 go test -race ./service/
testcache: github.com/foo/bar/service: test ID 5277766d4236333530556841686b797248445250 => 32e67a3c459a396178c3acafc87928bc9fc94b28d20fa9763dfc1fac9925e9a1
testcache: github.com/foo/bar/service: input list not found: cache entry not found: open /home/user/.cache/go-build/32/32e67a3c459a396178c3acafc87928bc9fc94b28d20fa9763dfc1fac9925e9a1-a: no such file or directory
testcache: github.com/foo/bar/service: test ID 484f364f314d6d6c647371774138555430696467 => ce7e1c46d67997e3d0b493d985ea92e199bad7b2fc4a97e7b405c87a92ac5d18
testcache: github.com/foo/bar/service: test ID ce7e1c46d67997e3d0b493d985ea92e199bad7b2fc4a97e7b405c87a92ac5d18 => input ID de8952dc6e704af61a533169fd1be1877a6c2778382c4f9a0828a177db9401ac => 74e40de3953e4347471748ef1e1ff4aacfa7380968c99f0baa14b0b7b1d24dae
ok      github.com/foo/bar/service  (cached)

real    0m2.185s
user    0m2.687s
sys     0m0.387s

$ time GODEBUG=gocachetest=1 go test -race ./service/
testcache: github.com/foo/bar/service: test ID 5277766d4236333530556841686b797248445250 => 32e67a3c459a396178c3acafc87928bc9fc94b28d20fa9763dfc1fac9925e9a1
testcache: github.com/foo/bar/service: input list not found: cache entry not found: open /home/user/.cache/go-build/32/32e67a3c459a396178c3acafc87928bc9fc94b28d20fa9763dfc1fac9925e9a1-a: no such file or directory
testcache: github.com/foo/bar/service: test ID 484f364f314d6d6c647371774138555430696467 => ce7e1c46d67997e3d0b493d985ea92e199bad7b2fc4a97e7b405c87a92ac5d18
testcache: github.com/foo/bar/service: test ID ce7e1c46d67997e3d0b493d985ea92e199bad7b2fc4a97e7b405c87a92ac5d18 => input ID de8952dc6e704af61a533169fd1be1877a6c2778382c4f9a0828a177db9401ac => 74e40de3953e4347471748ef1e1ff4aacfa7380968c99f0baa14b0b7b1d24dae
ok      github.com/foo/bar/service  (cached)

real    0m2.207s
user    0m2.786s
sys     0m0.397s

I'm not familiar with what it is caching, or indeed how the caching works, but it seems removing the cache entry that is being hit then resolves this issue:

$ rm ~/.cache/go-build/74/74e40de3953e4347471748ef1e1ff4aacfa7380968c99f0baa14b0b7b1d24dae-a

$ time GODEBUG=gocachetest=1 go test -race ./service/
testcache: github.com/foo/bar/service: test ID 5277766d4236333530556841686b797248445250 => 32e67a3c459a396178c3acafc87928bc9fc94b28d20fa9763dfc1fac9925e9a1
testcache: github.com/foo/bar/service: input list not found: cache entry not found: open /home/user/.cache/go-build/32/32e67a3c459a396178c3acafc87928bc9fc94b28d20fa9763dfc1fac9925e9a1-a: no such file or directory
testcache: github.com/foo/bar/service: test ID 484f364f314d6d6c647371774138555430696467 => ce7e1c46d67997e3d0b493d985ea92e199bad7b2fc4a97e7b405c87a92ac5d18
testcache: github.com/foo/bar/service: test ID ce7e1c46d67997e3d0b493d985ea92e199bad7b2fc4a97e7b405c87a92ac5d18 => input ID de8952dc6e704af61a533169fd1be1877a6c2778382c4f9a0828a177db9401ac => 74e40de3953e4347471748ef1e1ff4aacfa7380968c99f0baa14b0b7b1d24dae
testcache: github.com/foo/bar/service: test output not found: cache entry not found: open /home/user/.cache/go-build/74/74e40de3953e4347471748ef1e1ff4aacfa7380968c99f0baa14b0b7b1d24dae-a: no such file or directory
testcache: github.com/foo/bar/service: save test ID 32e67a3c459a396178c3acafc87928bc9fc94b28d20fa9763dfc1fac9925e9a1 => input ID de8952dc6e704af61a533169fd1be1877a6c2778382c4f9a0828a177db9401ac => c5d26a24287e143e5891845de941e3d6e14ccca91d2e881d31430334a09001a8
testcache: github.com/foo/bar/service: save test ID ce7e1c46d67997e3d0b493d985ea92e199bad7b2fc4a97e7b405c87a92ac5d18 => input ID de8952dc6e704af61a533169fd1be1877a6c2778382c4f9a0828a177db9401ac => 74e40de3953e4347471748ef1e1ff4aacfa7380968c99f0baa14b0b7b1d24dae
ok      github.com/foo/bar/service  0.052s

real    0m2.236s
user    0m2.863s
sys     0m0.370s

$ time GODEBUG=gocachetest=1 go test -race ./service/
testcache: github.com/foo/bar/service: test ID 5277766d4236333530556841686b797248445250 => 32e67a3c459a396178c3acafc87928bc9fc94b28d20fa9763dfc1fac9925e9a1
testcache: github.com/foo/bar/service: test ID 32e67a3c459a396178c3acafc87928bc9fc94b28d20fa9763dfc1fac9925e9a1 => input ID de8952dc6e704af61a533169fd1be1877a6c2778382c4f9a0828a177db9401ac => c5d26a24287e143e5891845de941e3d6e14ccca91d2e881d31430334a09001a8
ok      github.com/foo/bar/service  (cached)

real    0m0.728s
user    0m0.974s
sys     0m0.276s

$ time GODEBUG=gocachetest=1 go test -race ./service/
testcache: github.com/foo/bar/service: test ID 5277766d4236333530556841686b797248445250 => 32e67a3c459a396178c3acafc87928bc9fc94b28d20fa9763dfc1fac9925e9a1
testcache: github.com/foo/bar/service: test ID 32e67a3c459a396178c3acafc87928bc9fc94b28d20fa9763dfc1fac9925e9a1 => input ID de8952dc6e704af61a533169fd1be1877a6c2778382c4f9a0828a177db9401ac => c5d26a24287e143e5891845de941e3d6e14ccca91d2e881d31430334a09001a8
ok      github.com/foo/bar/service  (cached)

real    0m0.725s
user    0m1.034s
sys     0m0.234s

Note the save test ID lines and the the speed up.

We can of course clear out the cache to help resolve this, but it feels like there is perhaps an issue here: there is a cache miss yet in these circumstances no new cache entry is created, at least without intervention. I'm open to the possibility that our cache has gotten into a bad state, though I'd have a lot of difficulty trying to understand why.

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. ToolSpeed labels Sep 23, 2020
@bcmills bcmills added this to the Backlog milestone Sep 23, 2020
@bcmills bcmills added the GoCommand cmd/go label Sep 23, 2020
@bcmills bcmills changed the title go test perpetual cache misses cmd/go: test reported as cached, but not actually cached Sep 23, 2020
@bcmills
Copy link
Contributor

bcmills commented Sep 23, 2020

The ok github.com/foo/bar/service (cached) output in that first trace seems to indicate that the test result should have been loaded from the cache, but the timing information is not consistent with that output.

To me, that suggests that we are missing some error-handling somewhere when the cache gets into an inconsistent state.

CC @jayconrod @matloob

@nick-jones
Copy link
Author

nick-jones commented Sep 24, 2020

Just to add, this issue seems to be quite prevalent across our entire test suite (I only pasted an isolated example above).

$ # run 1
$ time go test -race ./...
<snip>

real    5m57.490s
user    10m44.469s
sys     0m56.386s
$ # run 2
$ time go test -race ./...
<snip>

real    5m56.693s
user    10m43.209s
sys     0m56.260s
$ # run 3 -- deleting the entire cache
$ rm -r ~/.cache/go-build/
$ time go test -race ./...
<snip>

real    33m14.534s
user    30m23.976s
sys     3m12.240s
$ # run 4
$ time go test -race ./...
<snip>

real    1m12.405s
user    1m39.922s
sys     0m24.057s
$ # run 5
$ time go test -race ./...
<snip>

real    1m8.562s
user    1m41.005s
sys     0m23.210s

I've not included the full output as it is quite large, but with runs 1, 2, 4 & 5 the output only contained lines with no test files or cached. So after clearing the cache out, the time to run the tests where all tests are apparently cached drops from ~6 minutes to ~1 minute

@nick-jones
Copy link
Author

nick-jones commented Nov 10, 2020

@bcmills - having taken another look at this, I'm wondering if we're hitting this condition here:

if c.buf == nil {
// We did not find a cached result using the link step action ID,
// so we ran the link step. Try again now with the link output
// content ID. The attempt using the action ID makes sure that
// if the link inputs don't change, we reuse the cached test
// result without even rerunning the linker. The attempt using
// the link output (test binary) content ID makes sure that if
// we have different link inputs but the same final binary,
// we still reuse the cached test result.
// c.saveOutput will store the result under both IDs.
c.tryCacheWithID(b, a, a.Deps[0].BuildContentID())
}

  • I can see the linker being invoked when watching the process list
  • the logs I pasted suggest that tryCache is invoked twice, with the first attempt getting a cache miss

Note that the comment suggests saveOutput will be invoked in this particular case, however from what I see it will only be called if this also results in a cache miss. Which is fine, however it means a relatively expensive link step is going to to run every single time?

Edit: I note that calling saveOutput in the case where the second cache lookup succeeds won't be an answer to this, as saveOutput relies on the tests actually being run (reads testlog.txt).

@nick-jones
Copy link
Author

nick-jones commented Nov 10, 2020

With the previous comment in mind, steps to force this undesirable behaviour are:

  1. Create main_test.go
package main

import (
	"fmt"
	"testing"
)

func TestFoo(t *testing.T) {
	fmt.Println("test")
}
  1. Run tests as normal
$ go test ./...                                                                                          
ok      test    1.408s
  1. Confirm cached
$ go test ./...
ok      test    (cached)
  1. Add a linker flag that has no impact on the compiled test binary, but results in cache misses for the first action ID
$ time go test -ldflags "-X main.foo=1" ./...            
ok      test    (cached)
go test -ldflags "-X main.foo=1" ./...  0.38s user 0.32s system 214% cpu 0.330 total
$ time go test -ldflags "-X main.foo=1" ./...
ok      test    (cached)
go test -ldflags "-X main.foo=1" ./...  0.38s user 0.33s system 217% cpu 0.327 total
$ time go test -ldflags "-X main.foo=1" ./...
ok      test    (cached)
go test -ldflags "-X main.foo=1" ./...  0.38s user 0.32s system 215% cpu 0.327 total
  1. Compare timings when flag isn't used
$ time go test ./...
ok      test    (cached)
go test ./...  0.23s user 0.29s system 280% cpu 0.187 total
$ time go test ./...
ok      test    (cached)
go test ./...  0.23s user 0.29s system 265% cpu 0.195 total
$ time go test ./...
ok      test    (cached)
go test ./...  0.22s user 0.28s system 263% cpu 0.192 total

We can confirm that using the linker flag is causing the first cache fetch to miss:

$ time GODEBUG=gocachetest=1 go test -ldflags "-X main.foo=1" ./...
testcache: test: test ID 34667a79325f4b5a5031745352646a364f334248 => 3ef2a119bdc3644ae990a022c528aaf151cfee0cdf4c9c709d47fb60f6c9c457
testcache: test: input list not found: cache entry not found: open /Users/***/Library/Caches/go-build/3e/3ef2a119bdc3644ae990a022c528aaf151cfee0cdf4c9c709d47fb60f6c9c457-a: no such file or directory
testcache: test: test ID 655748382d6e5536494c337a442d2d5735766c33 => 0d932d0481612b8bb78edaa50a1934585417fed642903bf1189315ddf344e4d7
testcache: test: test ID 0d932d0481612b8bb78edaa50a1934585417fed642903bf1189315ddf344e4d7 => input ID c5fc1f3633fe1da223946e105e6b9d870031ba00feb596125d273dfd099ec02a => 8071c7e4622894029c7df1341a0defb22002e4268d725b9d2ef3979d86c830aa
ok      test    (cached)
GODEBUG=gocachetest=1 go test -ldflags "-X main.foo=1" ./...  0.40s user 0.32s system 211% cpu 0.343 tota

The time loss seems minor, but with a large test suite with complex and large scale dependencies this very quickly adds up (as demonstrated in #41593 (comment)). Note that we aren't using any bogus -X flags AFAIK, I'm wondering if the cache misses start to occur after upgrading go versions.

@bcmills
Copy link
Contributor

bcmills commented Nov 27, 2023

This may be another symptom of #49267.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GoCommand cmd/go NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. ToolSpeed
Projects
None yet
Development

No branches or pull requests

2 participants