-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: ppc64le builders broken on OSU machines only #44541
Comments
/cc @golang/release |
We had a problem once where there was an update to the build files but not all files were correctly updated and the buildlets didn't all get restarted on all systems. Is that possible here? |
This log show is odd, it does not show the header that initializes the buildlet run. https://build.golang.org/log/999916eb450b1e55c253f8c8cc4b76c150d0ee63. |
I just noticed this on the "pools" page, not sure if it is significant. Most others show buildlet version 25, but the ppc64le buildlets are still 24. Also interesting that the ppc64 buildlets don't have a problem with the power8 and power9 ppc64le buildlets do.
|
Hello from the OSUOSL! It looks like the failing jobs are running out of memory and triggering the oom-killer. $ journalctl --since '2020-02-25 00:08:20'
go-le-bionic-1 kernel: api invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
go-le-bionic-1 kernel: api cpuset=6883dd2c43a4899f40205b9c3ab0bb732c10b93dabcb637516d7cb9738536afa mems_allowed=0
go-le-bionic-1 kernel: CPU: 2 PID: 11408 Comm: api Not tainted 4.15.0-65-generic #74-Ubuntu
go-le-bionic-1 kernel: Call Trace: <snip>
go-le-bionic-1 kernel: Task in /docker/6883dd2c43a4899f40205b9c3ab0bb732c10b93dabcb637516d7cb9738536afa killed as a result of limit of /docker/6883dd2c43a4899f40205b9c3ab0bb732c10b93dabcb637516d7cb9738536afa
go-le-bionic-1 kernel: memory: usage 4089408kB, limit 4089408kB, failcnt 645
go-le-bionic-1 kernel: memory+swap: usage 0kB, limit 9007199254740928kB, failcnt 0
go-le-bionic-1 kernel: kmem: usage 48896kB, limit 9007199254740928kB, failcnt 0
go-le-bionic-1 kernel: Memory cgroup stats for /docker/6883dd2c43a4899f40205b9c3ab0bb732c10b93dabcb637516d7cb9738536afa: cache:2932608KB rss:1107904KB rss_huge:927744KB shmem:2932608KB mapped_file:15424KB dirty:0KB writeb
go-le-bionic-1 kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
go-le-bionic-1 kernel: [ 2016] 0 2016 1694 137 36608 0 0 stage0
go-le-bionic-1 kernel: [ 2151] 0 2151 1767 153 40192 0 0 buildlet.exe
go-le-bionic-1 kernel: [11141] 0 11141 18084 172 48128 0 0 go
go-le-bionic-1 kernel: [11146] 0 11146 11009 89 35840 0 0 dist
go-le-bionic-1 kernel: [11203] 0 11203 19281 392 50176 0 0 go
go-le-bionic-1 kernel: [11243] 0 11243 10998 54 39680 0 0 run
go-le-bionic-1 kernel: [11249] 0 11249 18102 191 47872 0 0 go
go-le-bionic-1 kernel: [11254] 0 11254 27324 16324 171264 0 0 api
go-le-bionic-1 kernel: Memory cgroup out of memory: Kill process 11254 (api) score 244 or sacrifice child
go-le-bionic-1 kernel: Killed process 11254 (api) total-vm:1748736kB, anon-rss:1041280kB, file-rss:128kB, shmem-rss:3328kB
go-le-bionic-1 containerd[1593]: time="2021-02-25T00:08:26.989336625Z" level=info msg="shim reaped" id=6883dd2c43a4899f40205b9c3ab0bb732c10b93dabcb637516d7cb9738536afa
go-le-bionic-1 dockerd[4660]: time="2021-02-25T00:08:26.998084685Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" The 5 builder containers are restricted to 3.9G of memory and the current VM has 20G -- is this inline with the other builders? Would you like us to bump this? |
Hi @detjensrobert, would it be possible to allocate another 1GB for all the ppc containers? hopefully that is enough room to grow for a while longer. I think they are straddling those limits during the some of the testing. |
@cagedmantis Do you know when the build environements are supposed to be updated on these systems and the buildlets restarted? According to the pool output I displayed above, the buildlets are still at version 24, and in golang.org/x/build the buildletVersion was updated to 25 in November of 2019. These have been failing for almost a week. How do we get these fixed. |
I posted an update to our builder configuration. Who can approve/apply this to our containers? |
Change https://golang.org/cl/296669 mentions this issue: |
Will the above change automatically deploy to the runner nodes, or would you like me to deploy that change manually once merged? |
The above change needs to be approved by @dmitshur or @cagedmantis. I don't know how code in x/build is usually deployed. I didn't see an explicit limit on the memory size for other builders like on ppc64le. FWIW the api binary has an extremely large RSS compared to any other binary, not only on ppc64le but also on x86 in my experiments (although ppc64le was a bit larger). |
@detjensrobert can you test the mentioned change prior to commit? I think we're deadlocked waiting for someone to test the fix, and OSU is waiting for us to commit a fix. |
@pmur The extra 1G per container has been deployed to both P8 and P9 runners. Also, the P8BE runner did not have a memory limit set on the runner containers, which would explain why it was passing when the LE runners were not. |
I've been keeping an eye on the nodes and the build dashboard and it looks like that there have been no further failures after bumping the memory limit. Would you like us (OSL) to do anything further? |
The build dashboard has been showing failures for ppc64le power8 and power9 builders since Saturday. We are unable to reproduce these failures on any of our local machines in or outside of a container.
Could I get remote access on one of these machines in order to debug it? I can open a ticket at OSU but the log doesn't provide enough information to know what the problem might be.
The text was updated successfully, but these errors were encountered: