Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/env/linux-arm64/packet: restrict CPUs to a reasonable number of cores #36170

Closed
toothrot opened this issue Dec 16, 2019 · 30 comments
Closed
Assignees
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@toothrot
Copy link
Contributor

During the Go 1.14 beta1 release, we had a number of failures due to issues on our arm64 builders #36166.

Inspecting the build box, we discovered that our 96 core machine was very overloaded: a 1 minute load of over 300. This appeared to be related to the failures we encountered.

An initial fix of a maximum of 2 containers with 48 cores each seemed to resolve the issue, but there's likely a more efficient configuration.

@toothrot toothrot added the Builders x/build issues (builders, bots, dashboards) label Dec 16, 2019
@toothrot toothrot changed the title env/linux-arm64/packet: restrict CPUs to a reasonable number of cores x/build/env/linux-arm64/packet: restrict CPUs to a reasonable number of cores Dec 16, 2019
@gopherbot gopherbot added this to the Unreleased milestone Dec 16, 2019
@toothrot toothrot self-assigned this Dec 16, 2019
@gopherbot
Copy link

Change https://golang.org/cl/211579 mentions this issue: dashboard: restrict CPU usage on packet builders

@bcmills
Copy link
Contributor

bcmills commented Dec 17, 2019

I like that the packet builder is so effective in sussing out fragile tests.

Users sometimes overload their own machines, and sometimes run on machines that are much bigger than most of our builders — if we scale down the packet builders too much, we will fail to detect both real, timing-sensitive bugs, and fragile tests that manifest as flakiness for our users.

@bcmills
Copy link
Contributor

bcmills commented Dec 17, 2019

The specific failures observed in #36166 may be due to memory constraints, and in particular the CPU∶RAM ratio.

If so, then perhaps we need to re-tune the go and/or dist commands to pay more attention to RAM limits instead of (just) CPU limits.

@toothrot
Copy link
Contributor Author

toothrot commented Dec 17, 2019

I agree with that sentiment and benefit. We do still need to figure out a way to be able to build a release for this platform without wasting a day in re-running tests (which is what happened for the latest release, see #36166).

I'm not sure whether we should be adding more complexity to the scheduler for this, if we have releasebot change the configuration during a release, or if we should instead use a dedicated builder to suss out timing or memory sensitive test failures.

@bcmills
Copy link
Contributor

bcmills commented Dec 17, 2019

I agree that we should get to a state where we can build a release for linux/arm64 as quickly and reliably as for other platforms.

Dedicated builders for unusually high and unusually low CPU∶RAM ratios seem like a nice solution if we can do that. I'm not sure that the scheduler needs to be particularly more complex — if we have some reasonable idea of what a build costs, CPU and RAM limits are pretty easy to handle with semaphores. (I think the hard part is getting a reliable indicator of how much CPU and RAM the machine or container actually has available.)

@toothrot
Copy link
Contributor Author

Sure, I understand conceptually how to keep track of such things in a scheduler, but I don't think that the act of accurately accounting for all of the CPU and RAM across our builders is low complexity. It's not impossible, but I don't know if it's at the top of my list.

I appreciate your concern here, and I don't want to lose any coverage by making this change. That being said, I don't know if relying on punishing a single platform is the best way to do that. It is convenient that the packet (and ARM servers in general) have a large number of cores.

Maybe there's another way to move forward with this that doesn't involve breaking the releases for one of our platforms, but still achieves part of your goal? A dedicated highcore builder might suffice to cover what we currently have.

@bcmills
Copy link
Contributor

bcmills commented Dec 17, 2019

The scheduler doesn't have to be “accurate”. (Like most compiler-y problems, “conservative” often suffices. 😉)

At any rate: I think it's very important that our hardware be representative of what we expect real users to run. Do real users deploy to many-core linux/arm64 servers? If so, what kind of resource ratios are they likely to have? I'm happy to turn it down to “realistic”, but preferably not below that point.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 18, 2019
@bcmills
Copy link
Contributor

bcmills commented Dec 18, 2019

Thinking some more about this, there are actually two separate bits of value that we're currently getting from the packet builders:

  1. Due to the very high core count, the packet builder tends to be more sensitive to concurrency bugs, especially in the runtime.
  2. Due to the CPU overload, the packet builder tends to be more sensitive to timing-sensitivity bugs in tests.

So if we replace our large, overloaded packet builders with dedicated builders for unusual conditions, that suggests two new builders:

  1. Probably just one “very high CPU count” builder to suss out concurrency bugs, analogous to a longtest builder, or perhaps actually a longtest builder because the more CPU-intensive tests will tend to be long tests anyway). An ARM64 builder seems ideal for this, because we can get lots of independent cores for cheap.
  2. One or more “very low CPU per builder” builders to suss out timing bugs, but not necessarily an ARM64 builder (generally any arch will do).

@cherrymui
Copy link
Member

We almost never see such a OOM failure on the builder dashboard, but we do see this when building the release. I also sometimes saw the failure when running all.bash manually on gomote. Do you know what the difference is?

As Bryan said, it is helpful to have this builder with very high core count. I think it would be much better if we could limit the number of parallel processes in building the release or running all.bash, but not limiting the number of accessible cores (for each individual process). But I'm not sure how to do this.

@bcmills
Copy link
Contributor

bcmills commented Dec 18, 2019

Note that https://farmer.golang.org/ is currently showing 18 of 20 packet builders down. Is that related to the release errors?

@toothrot
Copy link
Contributor Author

@bcmills The 18 of 20 being down is related to the pending CL. We manually configured it to 2 builders in order to get the release out. I was expecting to be able to move forward with that CL to update the dashboard configuration.

@toothrot
Copy link
Contributor Author

@cherrymui We could limit the number of parallel builds without restricting the number of accessible cores. With an unrestricted number of cores, I found that if there were more than two trybot runs happening in addition to a release, the release generally failed.

I believe packet has a newer hardware configuration that we could use, as @bcmills mentioned, that has a similar number of cores but significantly greater RAM.

@toothrot
Copy link
Contributor Author

As long as we all agree on the following thoughts, I think we can move forward:

  • The linux-arm64 builder should be able to reliably create releases. We'll restrict CPU, RAM, and parallel builds on the current builder in order to achieve this.
  • A second linux-arm64 machine can be provisioned as a highcpu-longtest builder to capture our loss of a builder that covers this scenario. Ideally, it also runs releases at some cadence in order to capture the OOM failure @cherrymui described.
  • A yet-to-be-determined low-CPU (single CPU?) builder should be provisioned somewhere

gopherbot pushed a commit to golang/build that referenced this issue Dec 19, 2019
The packet arm64 build box has 96 cores, and easily gets overloaded with
multiple builds running simultaneously. Restrict each container to fewer
cores, and decrease the number of workers.

In order to not lose some of the coverage we got from packet as a
high-CPU builder, we will need to create new builder types going forward
to ensure we don't lose coverage. See the issue for details.

Updates golang/go#36170

Change-Id: I16478fa05258bc14ac59a1e708133ffcc4c5a664
Reviewed-on: https://go-review.googlesource.com/c/build/+/211579
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@vielmetti
Copy link

vielmetti commented Apr 29, 2020

@toothrot - Packet has also available a 2d generation Ampere eMag server, a c2.large.arm - happy to allocate one more of those systems to the build pool there. It's 32 cores with appreciably faster per-core performance than the older ThunderX 96-core system.

@toothrot
Copy link
Contributor Author

/cc @cagedmantis, who is currently working on our ARM builders.

@toothrot toothrot removed their assignment Apr 30, 2020
@cagedmantis cagedmantis self-assigned this Apr 30, 2020
@vielmetti
Copy link

Good to be acquainted @cagedmantis - looking forward to adding a system to the collection.

@cagedmantis
Copy link
Contributor

@vielmetti Would it be possible to continue this conversation over email? You can reach me at carlos@golang.org.

@vielmetti
Copy link

@cagedmantis email sent!

@gopherbot
Copy link

Change https://golang.org/cl/303232 mentions this issue: dashboard: add linux-arm64-aws to set of trybots

@vielmetti
Copy link

vielmetti commented Mar 19, 2021

@cagedmantis Do you still need the linux-arm64-packet trybot for anything? We've been planning to transition that system to much faster Ampere Altra hardware.

@cagedmantis
Copy link
Contributor

@vielmetti Yes, we still think it's useful to have the linux-arm64-packet builder. As @bcmills noted above, one reason it is valuable is in sussing out fragile tests. Thank you for the contribution.

gopherbot pushed a commit to golang/build that referenced this issue Apr 2, 2021
This change adds linux-arm64-aws to the set of trybots. It also
removes linux-arm and linux-arm64 from the list of builds attempted in
the misc-compile-linuxarm builder.

Updates golang/go#36170
Fixes golang/go#45065

Change-Id: If25fed08e35b2a91c9c9dbbf31701ff5464dc913
Reviewed-on: https://go-review.googlesource.com/c/build/+/303232
Trust: Carlos Amedee <carlos@golang.org>
Run-TryBot: Carlos Amedee <carlos@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@vielmetti
Copy link

The hardware behind the linux-arm64-packet builder, an Equinix Metal (ex-Packet) c1.large.arm system, is being decommissioned at the end of the month of May 2022. @cagedmantis you will get an email to that effect. Questions can go here or to the contact directed in the message.

The Cavium ThunderX had a good run - I have a poster of it in my office - but it's no longer commercially relevant and no longer provisionable at Equinix Metal. We now have a c3.large.arm (Ampere Altra) which is better in every single way except that it only has 80 cores instead of 96, but with Neoverse N1 cores (same as Gravit2) rather than the wimpy cores in ThunderX.

@cagedmantis
Copy link
Contributor

@vielmetti I've been meaning to address the email with the @golang/release team. I will respond via email. Thanks for the update.

@vielmetti
Copy link

c1.large.arm deprecation notice here:

https://metal.equinix.com/developers/docs/hardware/legacy-servers/#the-first-generation

"Equinix Metal maintains a number of servers that pre-date its acquisition by Equinix. Due to their age and their concentration in legacy Packet facilities, they are no longer provisionable on demand. Currently deployed servers and reserved servers are unaffected and remain the same. We strongly encourage you to migrate to newer generation systems in Equinix IBX locations. If you have questions about first generation server deployments, please contact support.

@vielmetti
Copy link

I realize that there are not a lot of ThunderX servers in the wild (I think Scaleway turned theirs off too).

@pgmwoa
Copy link

pgmwoa commented Jun 15, 2022

@vielmetti I've been meaning to address the email with the @golang/release team. I will respond via email. Thanks for the update.

@cagedmantis, I would like to inform you that we plan take down the old servers by 27th Jun. It was planned earlier but as we were waiting to hear from you, we held this specific machine for some more time. It is not possible for us to delay further. Request you to please plan accordingly.

@cagedmantis
Copy link
Contributor

cagedmantis commented Jun 17, 2022

@pgmwoa I apologize for the delayed response. The team is incredibly appreciative of your group for providing this machine to the community over the years. At this point we think we have enough linux-arm64 builders to help us sufficiently test that port. Please feel free to run down the machine. I will create an issue (#53436) to sunset things on our side.

@pgmwoa
Copy link

pgmwoa commented Jun 28, 2022

@cagedmantis , heads up... we are taking the machine down. Though it is still showing active, I assume you have taken care of things at your end and should be ok.

@cagedmantis
Copy link
Contributor

@pgmwoa Yes, you are free to take it down. Thank you for confirming.

@dmitshur dmitshur added this to In Progress in Go Release Team Jun 28, 2022
@cagedmantis
Copy link
Contributor

I am closing this issue since we no longer have the packet builder.

Go Release Team automation moved this from In Progress to Done Jun 28, 2022
@golang golang locked and limited conversation to collaborators Jun 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Archived in project
Development

No branches or pull requests

8 participants