Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: migrate off Kubernets buildlets back to VMs? #25108

Closed
bradfitz opened this issue Apr 26, 2018 · 10 comments
Closed

x/build: migrate off Kubernets buildlets back to VMs? #25108

bradfitz opened this issue Apr 26, 2018 · 10 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge
Milestone

Comments

@bradfitz
Copy link
Contributor

We've had a lot of flakiness with our Kubernetes-based buildlets. This seems to happen most during periods of high load, suggesting that we're still hitting isolation problems with Kubernetes, despite our various pod configuration knobs (requested CPU/mem + limit CPU/mem) being pretty paranoid and reviewed by various Kubernetes people.

Further, Kubernetes seems to get into a state where things are bad and then a cluster upgrade or nuke+recreate makes things good again.. for a bit.

I think it might be time to stop using Kubernetes for our buildlets (we'll still use them for all our misc services) and switch back to using VMs.

In the past, our Linux VMs were tedious because we had to prepare VM images for each config. We did this using a "docker2boot" tool I wrote that converted a container image (built from a Dockerfile) into a bootable GCE VM. But the whole process & testing was still slow & painful to iterate on.

When we moved to Kubernetes, we moved to more vanilla Dockerfiles with pushes & pulls to gcr.io. This was much less painful.

I don't propose we move back to custom VM images. I don't want to use docker2boot again (as cool of a hack as it was).

Instead, I think we should use GCE's container OS image (https://cloud.google.com/container-optimized-os/docs/) and use our existing buildlet containers.

The pros of moving to VMs:

  • perfect isolation
  • better scalability without fighting the managed instance group resizing
  • faster scalability (new Kubelet nodes don't start up quickly).

The cons of moving to VMs:

  • slower instance creation time. We'll have to wait 30-60 seconds for VMs to boot, which will be on par with all our other operating systems (all the BSDs, Mac, Windows, Plan 9, etc). But this will improve in the future as GCE continues to make VM boot-up faster. And this also wouldn't matter as much if we did x/build: coordinator should keep a small pool of ready-to-go GCE buildlets #11107 which would benefit all builder types.
  • more pulls from gcr.io, once per boot. But the network is ridiculous fast and we also aren't paying for it. But we can also reuse buildlets in the future to mitigate this.
  • we have to write some code, but I think it's minimal.

I think this is worth trying. We can make it a flag and be able to revert easily. We don't delete the Kubernetes-based building code in case we want to switch back to it in the future.

/cc @andybons @FiloSottile

@gopherbot gopherbot added this to the Unreleased milestone Apr 26, 2018
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Apr 26, 2018
@bradfitz
Copy link
Contributor Author

/cc @bcmills

@andybons
Copy link
Member

+1 for this. The flakiness of the builders under load was palpable and painful during the 1.10.1 release.

@bradfitz
Copy link
Contributor Author

bradfitz commented May 3, 2018

Just started on this. So far so good. I created a VM running the cos-stable image (projects/cos-cloud/global/images/cos-stable-66-10452-81-0), with GCE metadata:

$ curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/gce-container-declaration
spec:
  containers:
    - name: buildlet
      image: 'gcr.io/symbolic-datum-552/linux-x86-std-kube:latest'
      securityContext:
        privileged: true
      stdin: false
      tty: true
  restartPolicy: Always

(and the normal buildlet-binary-url)

... and then the buildlet came right up, and pretty quickly. (didn't measure)

The "privileged: true" part was a test. I think we'll be able to run a few more tests than we did before with it. And I think the 'tty: true' part is unnecessary. I thought we needed it for something, but I think I'm misremembering.

@gopherbot
Copy link

Change https://golang.org/cl/111267 mentions this issue: all: use Container-Optimized VMs instead of Kubernetes for buildlet containers

@bradfitz
Copy link
Contributor Author

bradfitz commented May 5, 2018

Note: after this is live, we'll probably want to remove the IN_KUBERNETES env from the image, and let those tests checking for it then try to test more.

@gopherbot
Copy link

Change https://golang.org/cl/111639 mentions this issue: all: rename Kube to Container and IsGCE to IsVM

gopherbot pushed a commit to golang/build that referenced this issue May 5, 2018
Once containers run on COS instead of Kubernetes, one name (Kube*) is
wrong and the other (GCE) is ambiguous. So rename them now to be more
specific.

No behavior changes. Just renaming in this step, to reduce size of
next CL.

Updates golang/go#25108

Change-Id: Ib09eb682ef74acbbf6ed50b46074f834ef5e0c0b
Reviewed-on: https://go-review.googlesource.com/111639
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/111640 mentions this issue: buildenv: add our GCP project numbers in addition to their named IDs

gopherbot pushed a commit to golang/build that referenced this issue May 5, 2018
Updates golang/go#25108

Change-Id: I5a82a4b26407158cf24d770a887759f8335d6441
Reviewed-on: https://go-review.googlesource.com/111640
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/112715 mentions this issue: cmd/buildlet: detect a noexec tmpfs /workdir on Linux and remount it exec

gopherbot pushed a commit to golang/build that referenced this issue May 10, 2018
…exec

Google's Container-Optimized Linux's konlet container start-up program
creates any requested tmpfs mounts as noexec. That doesn't work for
doing builds in, so remount it as executable.

This is required to run builds on COS instead of GKE.

Updates golang/go#25108

Change-Id: I9b719caf9180a03bafefa5b3b4b47ee43b9e5c1c
Reviewed-on: https://go-review.googlesource.com/112715
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/112735 mentions this issue: env/linux-nacl: update builders to unreleased pepper_67

@gopherbot
Copy link

Change https://golang.org/cl/112735 mentions this issue: env/linux-x86-nacl: update builders to unreleased pepper_67

gopherbot pushed a commit to golang/build that referenced this issue Jun 12, 2018
The nacl image hadn't been updated in 2+ years and it needed to be
updated as part of rolling out the new COS-based builders.

But no released version works for us yet; we were getting the same
errors as in golang/go#23836 ("Signal 11 from untrusted code")

We were getting lucky that it was working with an ancient (pepper_34?)
version, but I was unable to get those working again either.

Rolling forward is better anyway, as we haven't had a Dockerfile
reflecting reality for this builder for 2+ years.

This is the same version used in playground in CL 101735, which said:

> playground: update NaCl to trunk.544461
>
> This pulls in https://crrev.com/c/962675, which fixes the
> underlying issue of NaCl mishandling signals during a SIGSEGV.

Updates golang/go#23836
Updates golang/go#25108

Change-Id: I187042af71a1249e84ce2070aa8039a88d2c02c2
Reviewed-on: https://go-review.googlesource.com/112735
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@golang golang locked and limited conversation to collaborators May 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge
Projects
None yet
Development

No branches or pull requests

3 participants