runtime: systemOOM occurs when running kubelet on ppc64le cluster #50048

laboger · 2021-12-08T18:27:53Z

What version of Go are you using (`go version`)?

$ go version
tip 1.18

Does this issue reproduce with the latest release?

Does not happen on Go 1.17

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
Linux local-cluster-setup 4.18.0-348.2.1.el8_5.ppc64le power9

What did you do?

When using Golang master to build kubernetes, the run of kubelet will get a system OOM after 45 minutes or more.

What did you expect to see?

Valid run with no crash or OOM message.

What did you see instead?

kubelet crashes and logs show system OOM

After experimentation, I found that this started happening on the commit where the register ABI was enabled for PPC64 ae83301.
It does not fail on the commit before this.

We also found that different values of GOGC can affect whether this fails or not. Setting GOGC=off prevents it from failling. Setting GOGC=300 allows it to run much longer before failling. Setting GOGC=10 can cause it to fail within a few minutes rather than 40+ minutes.

When it fails top shows this:

3799969 root      20   0 2223168  90304  61824 S   2.0   0.3   0:07.58 kubelet                                             
3799969 root      20   0 2223168  90304  61824 S   3.7   0.3   0:07.69 kubelet                                             
3799969 root      20   0 2223168  90304  61824 S   2.0   0.3   0:07.75 kubelet                                             
3799969 root      20   0 2223168  90304  61824 S   1.3   0.3   0:07.79 kubelet                                             
3799969 root      20   0 2223168  90304  61824 S   2.0   0.3   0:07.85 kubelet                                             
3799969 root      20   0   72.9t 100032  61824 S 154.8   0.3   0:12.51 kubelet                                             
3799969 root      20   0   74.0t 112640  61824 S 232.8   0.3   0:19.54 kubelet                                             
3799969 root      20   0   74.4t   2.5g  61824 R 241.3   8.2   0:26.85 kubelet                                             
3799969 root      20   0   74.4t   6.7g  61824 S 197.0  21.6   0:32.80 kubelet                                             
3799969 root      20   0   74.4t  17.9g  61824 S 123.3  58.1   0:36.51 kubelet

An strace at this point shows mmap is called with a very large size.

[pid 3211132] mmap(NULL, 2165776, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7d9b07d60000
[pid 3211132] mmap(0xc001400000, 46137344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xc001400000
[pid 3211132] mmap(0x1c000000000, 79165910941696, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1c000000000
[pid 3211132] mmap(0x7fff91c30000, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0 <unfinished ...>

Attaching gdb and breaking at the bad call to mmap shows that npages is being passed as a very large value:

#1  0x000000001000e5e0 in runtime.sysReserve (v=<optimized out>, n=79166582030336) at /root/gotest/go/src/runtime/mem_linux.go:157
#2  runtime.(*mheap).sysAlloc (h=0x16c06260 <runtime.mheap_>, n=79166582030336) at /root/gotest/go/src/runtime/malloc.go:656
#3  0x000000001002aafc in runtime.(*mheap).grow (h=0x16c06260 <runtime.mheap_>, npage=<optimized out>)
    at /root/gotest/go/src/runtime/mheap.go:1347
#4  0x000000001002a568 in runtime.(*mheap).allocSpan (h=0x16c06260 <runtime.mheap_>, npages=9663885233, typ=0 '\000', 
    spanclass=0 '\000') at /root/gotest/go/src/runtime/mheap.go:1179
#5  0x0000000010029fc0 in runtime.(*mheap).alloc.func1 () at /root/gotest/go/src/runtime/mheap.go:913
#6  0x000000001006b4c8 in runtime.systemstack () at /root/gotest/go/src/runtime/asm_ppc64x.s:256
#7  0x000000001006b3ac in runtime.mstart () at /root/gotest/go/src/runtime/asm_ppc64x.s:128

Based on the objdump of kubelet, the arguments are all being passed in storage so it seems like the storage holding the arguments is getting corrupted.

We trying to create a reproducer for this. If there is any information we can collect or experiments to try, let us know.

The text was updated successfully, but these errors were encountered:

gopherbot · 2021-12-10T23:32:57Z

Change https://golang.org/cl/371034 mentions this issue: cmd/asm,cmd/compile: fix tail call in leaf functions on PPC64

seankhliao added arch-ppc64x NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 8, 2021

gopherbot closed this as completed in d198a36 Dec 13, 2021

golang locked and limited conversation to collaborators Dec 13, 2022

gopherbot added the FrozenDueToAge label Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: systemOOM occurs when running kubelet on ppc64le cluster #50048

runtime: systemOOM occurs when running kubelet on ppc64le cluster #50048

laboger commented Dec 8, 2021

gopherbot commented Dec 10, 2021

runtime: systemOOM occurs when running kubelet on ppc64le cluster #50048

runtime: systemOOM occurs when running kubelet on ppc64le cluster #50048

Comments

laboger commented Dec 8, 2021

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

gopherbot commented Dec 10, 2021

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?