Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: ppc64le builders broken on OSU machines only #44541

Closed
laboger opened this issue Feb 23, 2021 · 15 comments
Closed

x/build: ppc64le builders broken on OSU machines only #44541

laboger opened this issue Feb 23, 2021 · 15 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@laboger
Copy link
Contributor

laboger commented Feb 23, 2021

The build dashboard has been showing failures for ppc64le power8 and power9 builders since Saturday. We are unable to reproduce these failures on any of our local machines in or outside of a container.

Could I get remote access on one of these machines in order to debug it? I can open a ticket at OSU but the log doesn't provide enough information to know what the problem might be.

@cagedmantis cagedmantis changed the title build: ppc64le builders broken on OSU machines only x/build: ppc64le builders broken on OSU machines only Feb 23, 2021
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Feb 23, 2021
@gopherbot gopherbot added this to the Unreleased milestone Feb 23, 2021
@cagedmantis cagedmantis added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Builders x/build issues (builders, bots, dashboards) labels Feb 23, 2021
@cagedmantis
Copy link
Contributor

/cc @golang/release

@cagedmantis cagedmantis added the Builders x/build issues (builders, bots, dashboards) label Feb 23, 2021
@laboger
Copy link
Contributor Author

laboger commented Feb 23, 2021

We had a problem once where there was an update to the build files but not all files were correctly updated and the buildlets didn't all get restarted on all systems. Is that possible here?

@laboger
Copy link
Contributor Author

laboger commented Feb 24, 2021

This log show is odd, it does not show the header that initializes the buildlet run. https://build.golang.org/log/999916eb450b1e55c253f8c8cc4b76c150d0ee63.
I created a ticket at OSU, but they don't know of anything that might cause this.
@dmitshur Can we get on these systems to try and figure out what's wrong?

@laboger
Copy link
Contributor Author

laboger commented Feb 24, 2021

I just noticed this on the "pools" page, not sure if it is significant. Most others show buildlet version 25, but the ppc64le buildlets are still 24. Also interesting that the ppc64 buildlets don't have a problem with the power8 and power9 ppc64le buildlets do.

ppc64_01 (140.211.169.164:43664) version 25, host-linux-ppc64-osu: connected 48m30.5s, idle for 10.4s
ppc64_02 (140.211.169.164:56686) version 25, host-linux-ppc64-osu: connected 47m53.6s, idle for 9.85s
ppc64_03 (140.211.169.164:51012) version 25, host-linux-ppc64-osu: connected 44m29.1s, idle for 3.32s
ppc64_04 (140.211.169.164:36978) version 25, host-linux-ppc64-osu: connected 48m49.4s, idle for 1.47s
ppc64_05 (140.211.169.164:45226) version 25, host-linux-ppc64-osu: connected 37m45.8s, idle for 4.53s
power_01 (140.211.169.160:43028) version 24, host-linux-ppc64le-osu: connected 37m38.9s, idle for 11.5s
power_02 (140.211.169.160:41714) version 24, host-linux-ppc64le-osu: connected 48m58s, idle for 1.02s
power_03 (140.211.169.160:39590) version 24, host-linux-ppc64le-osu: connected 48m36.4s, idle for 10.3s
power_04 (140.211.169.160:49384) version 24, host-linux-ppc64le-osu: connected 48m19.5s, idle for 1.17s
power_05 (140.211.169.160:48534) version 24, host-linux-ppc64le-osu: connected 48m4.7s, idle for 2.05s
power_01 (140.211.169.171:52062) version 24, host-linux-ppc64le-power9-osu: connected 42m38.4s, idle for 9.96s
power_02 (140.211.169.171:41270) version 24, host-linux-ppc64le-power9-osu: connected 28m26.2s, idle for 7.71s
power_03 (140.211.169.171:41362) version 24, host-linux-ppc64le-power9-osu: connected 49m8s, idle for 4.41s
power_04 (140.211.169.171:41220) version 24, host-linux-ppc64le-power9-osu: connected 46m27.7s, idle for 2.14s
power_05 (140.211.169.171:49086) version 24, host-linux-ppc64le-power9-osu: connected 32m13.2s, idle for 9.67s

@laboger
Copy link
Contributor Author

laboger commented Feb 24, 2021

@pmur

@cagedmantis cagedmantis self-assigned this Feb 24, 2021
@detjensrobert
Copy link

detjensrobert commented Feb 25, 2021

Hello from the OSUOSL!

It looks like the failing jobs are running out of memory and triggering the oom-killer.

$ journalctl --since '2020-02-25 00:08:20'
go-le-bionic-1 kernel: api invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
go-le-bionic-1 kernel: api cpuset=6883dd2c43a4899f40205b9c3ab0bb732c10b93dabcb637516d7cb9738536afa mems_allowed=0
go-le-bionic-1 kernel: CPU: 2 PID: 11408 Comm: api Not tainted 4.15.0-65-generic #74-Ubuntu
go-le-bionic-1 kernel: Call Trace: <snip>
go-le-bionic-1 kernel: Task in /docker/6883dd2c43a4899f40205b9c3ab0bb732c10b93dabcb637516d7cb9738536afa killed as a result of limit of /docker/6883dd2c43a4899f40205b9c3ab0bb732c10b93dabcb637516d7cb9738536afa
go-le-bionic-1 kernel: memory: usage 4089408kB, limit 4089408kB, failcnt 645
go-le-bionic-1 kernel: memory+swap: usage 0kB, limit 9007199254740928kB, failcnt 0
go-le-bionic-1 kernel: kmem: usage 48896kB, limit 9007199254740928kB, failcnt 0
go-le-bionic-1 kernel: Memory cgroup stats for /docker/6883dd2c43a4899f40205b9c3ab0bb732c10b93dabcb637516d7cb9738536afa: cache:2932608KB rss:1107904KB rss_huge:927744KB shmem:2932608KB mapped_file:15424KB dirty:0KB writeb
go-le-bionic-1 kernel: [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
go-le-bionic-1 kernel: [ 2016]     0  2016     1694      137    36608        0             0 stage0
go-le-bionic-1 kernel: [ 2151]     0  2151     1767      153    40192        0             0 buildlet.exe
go-le-bionic-1 kernel: [11141]     0 11141    18084      172    48128        0             0 go
go-le-bionic-1 kernel: [11146]     0 11146    11009       89    35840        0             0 dist
go-le-bionic-1 kernel: [11203]     0 11203    19281      392    50176        0             0 go
go-le-bionic-1 kernel: [11243]     0 11243    10998       54    39680        0             0 run
go-le-bionic-1 kernel: [11249]     0 11249    18102      191    47872        0             0 go
go-le-bionic-1 kernel: [11254]     0 11254    27324    16324   171264        0             0 api
go-le-bionic-1 kernel: Memory cgroup out of memory: Kill process 11254 (api) score 244 or sacrifice child
go-le-bionic-1 kernel: Killed process 11254 (api) total-vm:1748736kB, anon-rss:1041280kB, file-rss:128kB, shmem-rss:3328kB
go-le-bionic-1 containerd[1593]: time="2021-02-25T00:08:26.989336625Z" level=info msg="shim reaped" id=6883dd2c43a4899f40205b9c3ab0bb732c10b93dabcb637516d7cb9738536afa
go-le-bionic-1 dockerd[4660]: time="2021-02-25T00:08:26.998084685Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

The 5 builder containers are restricted to 3.9G of memory and the current VM has 20G -- is this inline with the other builders? Would you like us to bump this?

@pmur
Copy link
Contributor

pmur commented Feb 25, 2021

Hi @detjensrobert, would it be possible to allocate another 1GB for all the ppc containers? hopefully that is enough room to grow for a while longer. I think they are straddling those limits during the some of the testing.

@laboger
Copy link
Contributor Author

laboger commented Feb 26, 2021

@cagedmantis Do you know when the build environements are supposed to be updated on these systems and the buildlets restarted? According to the pool output I displayed above, the buildlets are still at version 24, and in golang.org/x/build the buildletVersion was updated to 25 in November of 2019.

These have been failing for almost a week. How do we get these fixed.

@pmur
Copy link
Contributor

pmur commented Feb 26, 2021

I posted an update to our builder configuration. Who can approve/apply this to our containers?

@gopherbot
Copy link

Change https://golang.org/cl/296669 mentions this issue: env/linux-ppc64/osuosl: increase container ram limits

@detjensrobert
Copy link

Will the above change automatically deploy to the runner nodes, or would you like me to deploy that change manually once merged?

@laboger
Copy link
Contributor Author

laboger commented Mar 1, 2021

Will the above change automatically deploy to the runner nodes, or would you like me to deploy that change manually once merged?

The above change needs to be approved by @dmitshur or @cagedmantis. I don't know how code in x/build is usually deployed.

I didn't see an explicit limit on the memory size for other builders like on ppc64le. FWIW the api binary has an extremely large RSS compared to any other binary, not only on ppc64le but also on x86 in my experiments (although ppc64le was a bit larger).

@dmitshur dmitshur added this to Planned in Go Release Team Mar 2, 2021
@pmur
Copy link
Contributor

pmur commented Mar 4, 2021

@detjensrobert can you test the mentioned change prior to commit? I think we're deadlocked waiting for someone to test the fix, and OSU is waiting for us to commit a fix.

@detjensrobert
Copy link

@pmur The extra 1G per container has been deployed to both P8 and P9 runners.

Also, the P8BE runner did not have a memory limit set on the runner containers, which would explain why it was passing when the LE runners were not.

@detjensrobert
Copy link

I've been keeping an eye on the nodes and the build dashboard and it looks like that there have been no further failures after bumping the memory limit. Would you like us (OSL) to do anything further?

@dmitshur dmitshur moved this from Planned to In Progress in Go Release Team Mar 15, 2021
@dmitshur dmitshur self-assigned this Mar 16, 2021
Go Release Team automation moved this from In Progress to Done Mar 16, 2021
@golang golang locked and limited conversation to collaborators Mar 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Archived in project
Development

No branches or pull requests

6 participants