runtime: systemOOM occurs when running kubelet on ppc64le cluster #50048
Labels
arch-ppc64x
FrozenDueToAge
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Does not happen on Go 1.17
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
When using Golang master to build kubernetes, the run of kubelet will get a system OOM after 45 minutes or more.
What did you expect to see?
Valid run with no crash or OOM message.
What did you see instead?
kubelet crashes and logs show system OOM
After experimentation, I found that this started happening on the commit where the register ABI was enabled for PPC64 ae83301.
It does not fail on the commit before this.
We also found that different values of GOGC can affect whether this fails or not. Setting GOGC=off prevents it from failling. Setting GOGC=300 allows it to run much longer before failling. Setting GOGC=10 can cause it to fail within a few minutes rather than 40+ minutes.
When it fails top shows this:
An strace at this point shows mmap is called with a very large size.
Attaching gdb and breaking at the bad call to mmap shows that npages is being passed as a very large value:
Based on the objdump of kubelet, the arguments are all being passed in storage so it seems like the storage holding the arguments is getting corrupted.
We trying to create a reproducer for this. If there is any information we can collect or experiments to try, let us know.
The text was updated successfully, but these errors were encountered: