Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: rangegen.go test killed on ppc64 #65725

Closed
prattmic opened this issue Feb 15, 2024 · 22 comments
Closed

cmd/compile: rangegen.go test killed on ppc64 #65725

prattmic opened this issue Feb 15, 2024 · 22 comments
Labels
arch-ppc64x compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@prattmic
Copy link
Member

prattmic commented Feb 15, 2024

#!watchflakes
post <- pkg == "cmd/internal/testdir" && test == "Test/rangegen.go" && goarch ~ `ppc64`

Test/rangegen.go is occasionally getting killed on ppc64 builders.

e.g., https://ci.chromium.org/ui/p/golang/builders/ci/gotip-linux-ppc64le/b8756201558164082785/test-results?sortby=&groupby=

=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64le/compile: signal: killed
        
--- FAIL: Test/rangegen.go (33.68s)

From the history, this seems to be a recent regression: https://ci.chromium.org/ui/test/golang/cmd%2Finternal%2Ftestdir.Test%2Frangegen.go?q=V%3Ago_branch%3Dmaster+V%3Agoos%3Dlinux+V%3Ahost_goos%3Dlinux+

Note that this test is very large and has caused OOMs before (#64789).

cc @golang/ppc64

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 15, 2024
@prattmic prattmic added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. arch-ppc64x compiler/runtime Issues related to the Go compiler and/or runtime. and removed compiler/runtime Issues related to the Go compiler and/or runtime. labels Feb 15, 2024
@pmur
Copy link
Contributor

pmur commented Feb 15, 2024

I changed /home/swarming on the LUCI builders to a tmpfs mount on 2/13, that seemed to consistently fail during this test as it went over the 8G RAM limits of the container. The LUCI builders are using quite a bit more space than the buildbot instances.

I bumped it later that day to 10G, and set up a 5G swap file on the ppc64le builder. Hopefully that is enough to keep things running, hopefully faster.

@prattmic
Copy link
Member Author

Thanks for the update. For what it's worth, I got this failure on https://go.dev/cl/564137 at around Feb 14, 6pm EST. It sounds like that may be after your latest change? (This could be a bug in that WIP CL, but it seems unlikely to fail just this one test)

@pmur
Copy link
Contributor

pmur commented Feb 15, 2024

Looking at the VM, it did trigger an OOM on a container. The syslog is claiming 7.5G of that is "file" usage.

Is there any way to make LUCI more conservative with it disk usage? This seems like a pretty big jump from the old CI which also ran entirely on a tmpfs.

@prattmic
Copy link
Member Author

That's a good question. I wonder how much disk this test used on the old infra. I know this test does generate an absolutely massive source file.

cc @golang/release

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "cmd/internal/testdir" && test == "Test/rangegen.go" && goarch ~ `ppc64`
2024-02-15 14:35 gotip-linux-ppc64-power10 go@cfe7f21d cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64/compile: signal: killed
        
--- FAIL: Test/rangegen.go (45.74s)

watchflakes

@pmur
Copy link
Contributor

pmur commented Feb 15, 2024

I didn't realize the ppc64 LUCI builder also needed a bump too. It is bumped to 10g.

If this doesn't crash again in the next week, this issue can be closed.

@pmur
Copy link
Contributor

pmur commented Feb 15, 2024

Poking around at the idle ppc64le builders, there is a folder "/home/swarming/.cache/gopls" which is consuming between 2.5G and 3.3G on each instance.

That's a problem. Is it possible to update LUCI to cleanup caches at the end of a test run?

@dmitshur
Copy link
Contributor

I recall @adonovan used a GOPLSCACHE mechanism to help with that in the previous infrastructure (CL 494297). I'm not seeing that in the LUCI infrastructure—perhaps it needs to be ported over. There's some relevant discussion in this thread. CC @mknyszek.

@pmur
Copy link
Contributor

pmur commented Feb 15, 2024

For reference, this is what has accrued on linux-ppc64le-power8--05 since the last container reboot:

832K	.cache/gopls/013c47f3
259M	.cache/gopls/01a3ac08
136M	.cache/gopls/0634c015
88M	.cache/gopls/0d9f1c10
64M	.cache/gopls/1045283d
258M	.cache/gopls/159d94d7
142M	.cache/gopls/17fdaa63
144M	.cache/gopls/18c46665
89M	.cache/gopls/1c4c60e1
64M	.cache/gopls/223a02a4
128K	.cache/gopls/273a5340
6.7M	.cache/gopls/299fee7f
128K	.cache/gopls/29a597a4
26M	.cache/gopls/387f9f03
138M	.cache/gopls/4e2cae36
32M	.cache/gopls/575df253
128K	.cache/gopls/65191266
1.7M	.cache/gopls/694e1ff5
1.4M	.cache/gopls/6fc45050
6.7M	.cache/gopls/70bf6d67
89M	.cache/gopls/78e17b72
61M	.cache/gopls/8a118055
27M	.cache/gopls/8a786b24
6.7M	.cache/gopls/a786fe67
1.4M	.cache/gopls/a806c003
1.4M	.cache/gopls/aa30a907
33M	.cache/gopls/ac020416
832K	.cache/gopls/b0ace309
257M	.cache/gopls/b0bcd7cf
832K	.cache/gopls/b6fa81e7
42M	.cache/gopls/bcc7bedc
27M	.cache/gopls/c8caba22
41M	.cache/gopls/c99017d5
128K	.cache/gopls/cd2810e5
144M	.cache/gopls/d4694109
41M	.cache/gopls/d558ebe0
43M	.cache/gopls/d56f4029
1.9M	.cache/gopls/d67d19d4
33M	.cache/gopls/d93c051c
41M	.cache/gopls/e3926c21
138M	.cache/gopls/eeb12931
41M	.cache/gopls/f224d393

@adonovan
Copy link
Member

I recall @adonovan used a GOPLSCACHE mechanism to help with that in the previous infrastructure (CL 494297). I'm not seeing that in the LUCI infrastructure—perhaps it needs to be ported over.

Yes, that was an effective fix for the problems of this kind we saw in the older builders. It should be as simple as setting GOPLSCACHE to a temp directory for the entire run.

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "cmd/internal/testdir" && test == "Test/rangegen.go" && goarch ~ `ppc64`
2024-02-16 14:59 gotip-linux-ppc64-power10 go@5258d4ed cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64/compile: signal: killed
        
--- FAIL: Test/rangegen.go (39.72s)

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "cmd/internal/testdir" && test == "Test/rangegen.go" && goarch ~ `ppc64`
2024-02-16 15:12 gotip-linux-ppc64le go@3b515812 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64le/compile: signal: killed
        
--- FAIL: Test/rangegen.go (71.03s)
2024-02-16 15:51 go1.22-linux-ppc64le release-branch.go1.22@d6a27193 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64le/compile: signal: killed
        
--- FAIL: Test/rangegen.go (65.25s)
2024-02-16 16:53 gotip-linux-ppc64le go@7f799f33 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64le/compile: signal: killed
        
--- FAIL: Test/rangegen.go (55.99s)
2024-02-16 18:13 gotip-linux-ppc64le go@a0226c56 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64le/compile: signal: killed
        
--- FAIL: Test/rangegen.go (54.31s)
2024-02-16 20:25 gotip-linux-ppc64-power10 go@cdd0ddaf cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64/compile: signal: killed
        
--- FAIL: Test/rangegen.go (28.72s)
2024-02-17 00:13 gotip-linux-ppc64-power10 go@e41fabd6 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64/compile: signal: killed
        
--- FAIL: Test/rangegen.go (29.01s)
2024-02-17 00:13 gotip-linux-ppc64le go@e41fabd6 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64le/compile: signal: killed
        
--- FAIL: Test/rangegen.go (81.16s)
2024-02-19 08:55 gotip-linux-ppc64-power10 go@5c92f43c cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64/compile: signal: killed
        
--- FAIL: Test/rangegen.go (48.53s)
2024-02-19 20:44 gotip-linux-ppc64le go@0882ca7a cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64le/compile: signal: killed
        
--- FAIL: Test/rangegen.go (48.65s)
2024-02-20 14:56 gotip-linux-ppc64-power10 go@098a87fb cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64/compile: signal: killed
        
--- FAIL: Test/rangegen.go (54.67s)
2024-02-20 17:57 gotip-linux-ppc64le go@c1828fbc cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64le/compile: signal: killed
        
--- FAIL: Test/rangegen.go (46.50s)
2024-02-20 18:06 gotip-linux-ppc64-power10 go@67361bf8 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64/compile: signal: killed
        
--- FAIL: Test/rangegen.go (48.95s)
2024-02-20 18:06 gotip-linux-ppc64le go@67361bf8 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64le/compile: signal: killed
        
--- FAIL: Test/rangegen.go (45.65s)

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "cmd/internal/testdir" && test == "Test/rangegen.go" && goarch ~ `ppc64`
2024-02-20 20:44 gotip-linux-ppc64le go@4ce008d7 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64le/compile: signal: killed
        
--- FAIL: Test/rangegen.go (77.06s)
2024-02-20 21:02 gotip-linux-ppc64le go@de65aa41 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64le/compile: signal: killed
        
--- FAIL: Test/rangegen.go (50.24s)

watchflakes

@pmur
Copy link
Contributor

pmur commented Feb 21, 2024

I've set GOPLSCACHE=/home/swarming/.swarming/w via the container image on all ppc64x luci builders while waiting for a more generic fix.

@mknyszek
Copy link
Contributor

It should be as simple as setting GOPLSCACHE to a temp directory for the entire run.

@adonovan @pmur Can you clarify what you mean by "a temp directory"? Is the TMPDIR set in the builds sufficient (https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8755540096649505841/+/u/step/9/log/1)? That gets cleared after every run.

@adonovan
Copy link
Member

Is the TMPDIR set in the builds sufficient? That gets cleared after every run.

Yes, a temp directory that lasts for a complete run of tests at a single CL is ideal. Thanks.

@mknyszek
Copy link
Contributor

Sent crrev.com/c/5314212. I created an explicit subdirectory in the workdir for it which will have the same effect, and I thought it might be a bit clearer to have it next to the GOCACHE directory.

@mknyszek
Copy link
Contributor

The change landed, so expect this to roll out over the next half hour or so.

@mknyszek
Copy link
Contributor

mknyszek commented Feb 21, 2024

I confirmed that the environment variable is now set in new builds to a directory that will definitely get wiped on each run. Hopefully this should be resolved. Closing optimistically.

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "cmd/internal/testdir" && test == "Test/rangegen.go" && goarch ~ `ppc64`
2024-02-21 17:22 gotip-linux-ppc64-power10 go@cd170327 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:142: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64/compile: signal: killed
        
--- FAIL: Test/rangegen.go (47.96s)

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "cmd/internal/testdir" && test == "Test/rangegen.go" && goarch ~ `ppc64`
2024-04-09 04:07 gotip-linux-ppc64_power8 go@9f3f4c64 cmd/internal/testdir.Test/rangegen.go (log)
=== RUN   Test/rangegen.go
=== PAUSE Test/rangegen.go
=== CONT  Test/rangegen.go
    testdir_test.go:145: exit status 1
        command-line-arguments: /home/swarming/.swarming/w/ir/x/w/goroot/pkg/tool/linux_ppc64/compile: signal: killed
        
--- FAIL: Test/rangegen.go (54.26s)

watchflakes

@gopherbot gopherbot reopened this Apr 18, 2024
@pmur
Copy link
Contributor

pmur commented Apr 22, 2024

I set up some of the new builders to use 8G instead of 10G memory limits, which can OOM rangegen tests. That has since been resolved for the last week or so.

@pmur pmur closed this as completed Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-ppc64x compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Done
Development

No branches or pull requests

6 participants