Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/playground: use gVisor on linux/amd64 binaries instead of NaCl for sandboxing #25224

Closed
andybons opened this issue May 2, 2018 · 57 comments
Closed
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@andybons
Copy link
Member

andybons commented May 2, 2018

Native Client deprecation has been announced in favor of Web Assembly. It's unclear what that means in terms of NaCl's future development and support.

I'd like us to investigate using gVisor as our sandbox mechanism for the playground. This is not a statement that we should definitively switch over to gVisor.

https://github.com/google/gvisor

@ysmolsky

@gopherbot gopherbot added this to the Unreleased milestone May 2, 2018
@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 2, 2018
@andybons
Copy link
Member Author

andybons commented Aug 25, 2018

This has come up again as attempting to upgrade the playground to 1.11 results in the following breakage during the NaCl time patching phase:

 ---> ed8ffc3e1836
Step 20/63 : RUN patch -p1 -d /usr/local/go </usr/local/playground/strict-time.patch
 ---> Running in 1f97414872b7
patching file src/runtime/os_nacl.go
Reversed (or previously applied) patch detected!  Assume -R? [n] 
Apply anyway? [n] 
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file src/runtime/os_nacl.go.rej
patching file src/runtime/sys_nacl_amd64p32.s
Reversed (or previously applied) patch detected!  Assume -R? [n] 
Apply anyway? [n] 
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file src/runtime/sys_nacl_amd64p32.s.rej
The command '/bin/sh -c patch -p1 -d /usr/local/go </usr/local/playground/strict-time.patch' returned a non-zero code: 1

/cc @bradfitz

@bradfitz
Copy link
Contributor

That's @bcmills's patch. The old patch (enable-fake-time.patch) was designed to basically never bit rot but the new patch seems to have more context that makes it fragile. We should really upstream this and have it behind a bool or build tag so it's less likely to rot.

@bradfitz
Copy link
Contributor

I've been playing around with this. I'm deciding between:

  1. continuing to use App Engine Flex and launching gvisor's runsc "directly" (if one could consider the OCI interface direct). Unfortunately runsc still requires root today (for now: runsc doesn't work with rootless podman google/gvisor#311) and App Engine Flex doesn't give you root at serving time (only for debugging). So doing this would require modifying runsc to not require root.

  2. migrating off App Engine Flex and just using Docker with vanilla runsc. We'd need to change how we serve a bit. Probably an HTTP load balancer to a GCE VM instance group (perhaps running Container-Optimized OS, which already runs docker, and supports declarative yaml configs of which containers to run, including I believe privileged ones)

I think I'm leaning towards (2).

As part of this, we'd migrate to (2) well before the Go 1.14 release (which drops nacl), so the release process doesn't involve changing playground machinery during a release. We'd still support Go 1.13 nacl mode with the new infrastructure, and then the Go 1.14 release would just be changing a single line in its Dockerfile.

/cc @dmitshur @andybons @toothrot @prattmic

@toothrot
Copy link
Contributor

@bradfitz Have you investigated using Cloud Run? The documentation seems to suggest that the runtime user is root: https://cloud.google.com/run/docs/tips#container-security

@bradfitz
Copy link
Contributor

@toothrot, Cloud Run is already under gVisor itself. (and it doesn't let us tweak its sandbox config to be more restrictive, as we'd need). Cloud Run gives you root, but fake-ish constrained root. We definitely couldn't use gVisor's KVM mode ("platform") under it. We could probably just the ptrace mode (ptrace syscall is at least partially supported: https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/). But once runsc went to do the other root-requiring stuff (setting up namespaces, etc), I suspect we'd trip over unimplemented system calls. e.g. I see on that system call list:

Mount namespace (CLONE_NEWNS) not supported. Options CLONE_PARENT, CLONE_SYSVSEM not supported.

So basically, to use Cloud Run we'd need to do the same work hacking up runsc to run rootless that we'd have to do to run it on its current App Engine Flex environment.

That said, we should move many of our things to Cloud Run regardless, but I don't think it'd help us here.

@prattmic can correct me if indeed the ptrace mode of runsc would work unmodified under gvisor. That would be a happy surprise.

@bradfitz
Copy link
Contributor

@prattmic confirms that nested gVisor doesn't work yet.

@bradfitz bradfitz changed the title x/playground: investigate using gVisor instead of NaCl for sandboxing x/playground: use gVisor on linux/amd64 binaries instead of NaCl for sandboxing Sep 16, 2019
@bradfitz
Copy link
Contributor

In case we do end up going the path of using some modified root-less runsc directly, without docker, this syzkaller code drives runsc directly:

https://github.com/google/syzkaller/blob/master/vm/gvisor/gvisor.go

@gopherbot
Copy link

Change https://golang.org/cl/195983 mentions this issue: sandbox: add gvisor runsc-based sandbox

@tv42
Copy link

tv42 commented Jan 9, 2020

For what it's worth, this doesn't seem to work right: gVisor runs see a lot of "Error communicating with remote server." and when it does work, stdout is not visible. https://play.golang.org/p/5_6CpdUjYY1

@bradfitz
Copy link
Contributor

bradfitz commented Jan 9, 2020

@tv42, it's not fully deployed yet.

@bradfitz
Copy link
Contributor

bradfitz commented Jan 9, 2020

Reopening until it's all deployed, though.

@bradfitz bradfitz reopened this Jan 9, 2020
@bradfitz bradfitz added the NeedsFix The path to resolution is known, but the work has not been done. label Jan 9, 2020
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 9, 2020
@bradfitz
Copy link
Contributor

bradfitz commented Jan 9, 2020

@toothrot, next up we need to run the frontend in a different network:

https://cloud.google.com/appengine/docs/flexible/go/using-shared-vpc

Have you done this before? (I haven't.)

Because right now it can't contact the internal load balancer.

We should test it first at a play-test.golang.org version and verify that it works (and doesn't break, say, memcache connections).

@bradfitz bradfitz modified the milestones: Backlog, Go1.14 Jan 9, 2020
@gopherbot
Copy link

Change https://golang.org/cl/229418 mentions this issue: sandbox: update configuration to match production

@gopherbot
Copy link

Change https://golang.org/cl/229677 mentions this issue: playground: only build binaries for health check

gopherbot pushed a commit to golang/playground that referenced this issue Apr 23, 2020
This change avoids cascading sandbox backend failures to the frontend
instances.

The timeout TODO is safe to remove as we now pass in the request context
for cancellation, and sandboxBuild has its own build timeout.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: If892f86bad08c55429b6ebab768b83c5d7621cf1
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229677
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/229679 mentions this issue: playground: instrument HTTP handlers with StackDriver

@gopherbot
Copy link

Change https://golang.org/cl/229680 mentions this issue: playground: instrument number of running containers

gopherbot pushed a commit to golang/playground that referenced this issue Apr 23, 2020
- Specify the correct image in konlet.yaml. gvisor-playground-sandbox is
the child-process container.
- Correct interpolation in config identifiers, which is deprecated.
- Set min_ready_sec for update policy to not cause an outage when
updating
- Use name_prefix for instance_template instead of name, which allows
updates. Templates are immutable, so previously this was not possible to
update.

Updates golang/go#38530
Updates golang/go#25224

Change-Id: I3f7618b8e378eaa9714e571b90390b7052bf2855
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229418
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/229681 mentions this issue: sandbox: add stackdriver scope

@gopherbot
Copy link

Change https://golang.org/cl/229957 mentions this issue: sandbox: change from n1 to e2 instances

gopherbot pushed a commit to golang/playground that referenced this issue Apr 24, 2020
This conforms with a policy change, and should save money with no
measurable performance impact.

Removes bradfitz's SSH key from instances (that are inaccessible
anyway).

Updates golang/go#25224

Change-Id: I1733192c98deee1deabf2237ae5fe19edd29ab93
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229957
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
gopherbot pushed a commit to golang/playground that referenced this issue Apr 24, 2020
This changes the sandbox VMs to allow them to write to StackDriver
metrics.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: I82954b8eed3664289f5c69c0f5301a72206f0948
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229681
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
gopherbot pushed a commit to golang/playground that referenced this issue Apr 24, 2020
This change adds OpenCensus's HTTP instrumentation to the sandbox. In
development mode, it exposes a prometheus metrics interface on /statusz.
This is the first in a series of CLs to add instrumentation to different
parts of the sandbox to help investigate instability issues. For now,
reporting metrics around our responses per handler to StackDriver will
be helpful.

OpenTelemetry would be preferable, as it is the successor of OpenCensus,
however the StackDriver integration is not quite done.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: I600fd695bb66c8bee16bc0b778d51930f4cdd476
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229679
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
gopherbot pushed a commit to golang/playground that referenced this issue Apr 24, 2020
This change adds a metric to count the number of running containers,
according to Docker.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: Id989986928dff594cb1de0903a56dcffed8220c4
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229680
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/229958 mentions this issue: sandbox: correct stackdriver labels

gopherbot pushed a commit to golang/playground that referenced this issue Apr 24, 2020
The generic_task target has a required label format that was not being
met. Moves metadata fetching into a helper function.

Removes call to view.RegisterExporter for StackDriver exporter, which
was unncessary.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: Ib009f5ce906f5b9479cdda8c7e8322d06e3036e4
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229958
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/229980 mentions this issue: sandbox: add container creation metrics

@gopherbot
Copy link

Change https://golang.org/cl/229981 mentions this issue: sandbox: reduce container starvation

gopherbot pushed a commit to golang/playground that referenced this issue Apr 27, 2020
This change measures the latency and success of container creation.
These metrics will help capacity planning and investigating production
issues.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: Id7f373acb8741d4465c6e632badb188b6e855787
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229980
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
gopherbot pushed a commit to golang/playground that referenced this issue Apr 27, 2020
Creating a container in the sandbox takes 500ms to 1s. The sandbox was
creating containers serially, but serving requests in parallel. This
means that we can be starved for workers with a trivial number of
requests.

In addition, the sandbox in production is not CPU bound, meaning we
probably have room to do some extra work while serving a request.

This CL introduces a worker pool to create containers. It also changes
the readyContainer chan to unbuffered to avoid having twice as many
containers as we expect while idle (the container waiting to be sent
plus those already in the channel's buffer).

Updates golang/go#25224
Updates golang/go#38530

Change-Id: I0e535cf65409c3dbf32329577a1c0687c2614a0d
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229981
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
@toothrot
Copy link
Contributor

Cross-linking this comment mentioning our new sandbox metrics for posterity: #38530 (comment)

@toothrot
Copy link
Contributor

I misunderstood in the referenced issue #38530: you can use custom metrics for autoscaling: https://cloud.google.com/compute/docs/autoscaler/scaling-stackdriver-monitoring-metrics#custom_metrics

QPS or container_count could be good candidates. We'd have to update our target to be a GCE instance rather than a generic_task for the metric we choose.

@toothrot
Copy link
Contributor

I think this is effectively finished.

@golang golang locked and limited conversation to collaborators Jan 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

9 participants