x/playground: serving 502 errors #38530

bradfitz · 2020-04-19T20:16:51Z

I've been seeing a number of 502s on the playground just now.

(Hope it wasn't my fault!)

toothrot · 2020-04-19T20:30:52Z

I'll take a look this afternoon, but it just worked for me now on mobile. There are a couple more stability issues to resolve.

gopherbot · 2020-04-20T21:17:40Z

Change https://golang.org/cl/228438 mentions this issue: playground: show a meaningful error on build timeout

Add support for returning a meaningful error message to the user, along with partial build output if present. This change also improves graceful command termination throughout the playground. It introduces internal.WaitOrStop, which will wait for a command to finish, or gracefully stop the command if it has not finished by the time the context is cancelled. This also resolves comments from CL 227351. Updates golang/go#38052 Updates golang/go#38530 Updates golang/go#25224 Change-Id: I3a36ad2c5f3626c9cd5b3c1139543ccde3e17ba1 Reviewed-on: https://go-review.googlesource.com/c/playground/+/228438 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Bryan C. Mills <bcmills@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>

toothrot · 2020-04-22T16:05:19Z

It looks like the sandbox instances are running out of memory. I'm investigating.

toothrot · 2020-04-22T16:17:07Z

Nevermind, that may be a red herring of Docker doing it's job.

bradfitz · 2020-04-22T16:19:22Z

I was getting a 502 of a GET to /. (See screenshot) That wouldn't make its way back to the gvisor world.

toothrot · 2020-04-22T16:21:54Z

I understand that, but when too many requests 500, the AppEngine LB will return a 502 for all requests from a particular Flex instance.

gopherbot · 2020-04-22T17:29:45Z

Change https://golang.org/cl/229418 mentions this issue: sandbox: update configuration to match production

odeke-em · 2020-04-23T00:23:13Z

I've also been getting these consistently for the past 30 minutes

toothrot · 2020-04-23T13:37:44Z

Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause.

toothrot · 2020-04-23T13:38:07Z

Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause.

toothrot · 2020-04-23T13:39:12Z

Sorry, closed the wrong issue while flipping through tabs.

toothrot · 2020-04-23T15:23:34Z

For some context, the new architecture from #25224 introduces sandbox hosts that are separate from our GAE instances (which I have been calling frontends, in the Google parlance).

The frontend healthcheck looks like this:

func (s *server) healthCheck() error {
	ctx := context.Background() // TODO: cap it to some reasonable timeout
	resp, err := compileAndRun(ctx, &request{Body: healthProg})
	if err != nil {
		return err
	}
	if resp.Errors != "" {
		return fmt.Errorf("compile error: %v", resp.Errors)
	}
	if len(resp.Events) != 1 || resp.Events[0].Message != "ok" {
		return fmt.Errorf("unexpected output: %v", resp.Events)
	}
	return nil
}

With the change to add new sandbox backends, this health check fails if all of the backends are down. AppEngine will mark our app as dead, and will try restarting it.

Until we understand the instability issues in our backend, I think we should change this health check to only do the "compile" phase, which happens on the frontend. It would still be helpful to mark them unhealthy if binary compilation is failing entirely.

This should help us stop serving 502s, but it doesn't get us out of the woods of our sandbox hosts being unstable. I have other CLs planned for investigating that, and it starts with measurements.

toothrot · 2020-04-23T15:23:50Z

/cc @dmitshur @cagedmantis @andybons

dmitshur · 2020-04-23T15:27:06Z

I've discussed this with @toothrot and agree that is a good next step.

toothrot · 2020-04-23T15:36:34Z

It's also worth mentioning that we should move to split health checks: https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#updated_health_checks

gopherbot · 2020-04-23T15:41:11Z

Change https://golang.org/cl/229677 mentions this issue: playground: only build binaries for health check

This change avoids cascading sandbox backend failures to the frontend instances. The timeout TODO is safe to remove as we now pass in the request context for cancellation, and sandboxBuild has its own build timeout. Updates golang/go#25224 Updates golang/go#38530 Change-Id: If892f86bad08c55429b6ebab768b83c5d7621cf1 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229677 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>

gopherbot · 2020-04-23T22:03:18Z

Change https://golang.org/cl/229679 mentions this issue: playground: instrument HTTP handlers with StackDriver

gopherbot · 2020-04-23T22:03:20Z

Change https://golang.org/cl/229680 mentions this issue: playground: instrument number of running containers

- Specify the correct image in konlet.yaml. gvisor-playground-sandbox is the child-process container. - Correct interpolation in config identifiers, which is deprecated. - Set min_ready_sec for update policy to not cause an outage when updating - Use name_prefix for instance_template instead of name, which allows updates. Templates are immutable, so previously this was not possible to update. Updates golang/go#38530 Updates golang/go#25224 Change-Id: I3f7618b8e378eaa9714e571b90390b7052bf2855 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229418 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>

gopherbot · 2020-04-23T22:07:46Z

Change https://golang.org/cl/229681 mentions this issue: sandbox: add stackdriver scope

This changes the sandbox VMs to allow them to write to StackDriver metrics. Updates golang/go#25224 Updates golang/go#38530 Change-Id: I82954b8eed3664289f5c69c0f5301a72206f0948 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229681 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>

This change adds OpenCensus's HTTP instrumentation to the sandbox. In development mode, it exposes a prometheus metrics interface on /statusz. This is the first in a series of CLs to add instrumentation to different parts of the sandbox to help investigate instability issues. For now, reporting metrics around our responses per handler to StackDriver will be helpful. OpenTelemetry would be preferable, as it is the successor of OpenCensus, however the StackDriver integration is not quite done. Updates golang/go#25224 Updates golang/go#38530 Change-Id: I600fd695bb66c8bee16bc0b778d51930f4cdd476 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229679 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>

This change adds a metric to count the number of running containers, according to Docker. Updates golang/go#25224 Updates golang/go#38530 Change-Id: Id989986928dff594cb1de0903a56dcffed8220c4 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229680 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>

gopherbot · 2020-04-24T20:21:22Z

Change https://golang.org/cl/229958 mentions this issue: sandbox: correct stackdriver labels

The generic_task target has a required label format that was not being met. Moves metadata fetching into a helper function. Removes call to view.RegisterExporter for StackDriver exporter, which was unncessary. Updates golang/go#25224 Updates golang/go#38530 Change-Id: Ib009f5ce906f5b9479cdda8c7e8322d06e3036e4 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229958 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>

gopherbot · 2020-04-25T16:11:04Z

Change https://golang.org/cl/229980 mentions this issue: sandbox: add container creation metrics

gopherbot · 2020-04-25T16:11:06Z

Change https://golang.org/cl/229981 mentions this issue: sandbox: reduce container starvation

toothrot · 2020-04-25T17:10:16Z

After clearing up the concurrency issues, and modifying the frontend health check, the latest series of metrics and load testing provided us the results we were looking for:

I performed a load test of a trivial program to figure out how the sandbox scaled.

The container pool can be exhausted without using very much CPU
- CPU intensive programs would have a different footprint, but forget that for a moment
/healthz is a proxy for how long we have to wait for a container. Seeing that histogram go up, and eventually cause an instance to be unhealthy, tells us that we can do a better job of CPU utilization here.
- As expected, the vast majority of our load (6 QPS) in steady state is from healthz requests
The instance group manager is set to scale up instances after 50% CPU utilization, which is hard to hit on the sandboxes from high usage of trivial programs (which I expect most of our traffic is).

https://golang.org/cl/229980 adds container creation time metrics. Locally, it takes about 600ms to 1s to create a container. That means it could take as long as 8s to recover from 8 requests.

https://golang.org/cl/229981 addresses this issue by creating containers in parallel, allowing us to keep up with the request load. I'm expecting this to improve CPU utilization, hopefully triggering auto-scaling under load.

This change measures the latency and success of container creation. These metrics will help capacity planning and investigating production issues. Updates golang/go#25224 Updates golang/go#38530 Change-Id: Id7f373acb8741d4465c6e632badb188b6e855787 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229980 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>

Creating a container in the sandbox takes 500ms to 1s. The sandbox was creating containers serially, but serving requests in parallel. This means that we can be starved for workers with a trivial number of requests. In addition, the sandbox in production is not CPU bound, meaning we probably have room to do some extra work while serving a request. This CL introduces a worker pool to create containers. It also changes the readyContainer chan to unbuffered to avoid having twice as many containers as we expect while idle (the container waiting to be sent plus those already in the channel's buffer). Updates golang/go#25224 Updates golang/go#38530 Change-Id: I0e535cf65409c3dbf32329577a1c0687c2614a0d Reviewed-on: https://go-review.googlesource.com/c/playground/+/229981 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>

toothrot · 2020-04-27T20:54:28Z

The results of those changes are similar, but I was unable to cause the instances to become unhealthy, nor could I make the instances autoscale with trivial programs:

Many CPU-bound snippets would still trigger horizontal scaling successfully.

I honestly think this is good enough for this issue. For scaling further, we may want to consider:

Increasing worker count to greater than runtime.NumCPU().
- For a traffic pattern of many small requests, we are currently more bound on IO wait than CPU
Removing the run semaphore. The blocking call to getContainer is likely sufficient.
- For calculating max concurrent processes from a binary memory size, we should do some math to compare our docker RAM limit, and our max binary size * number of workers.
Increasing our number of containers to something more than max-workers.
- It takes about 500ms to create a container (see below). A flood of tiny requests could overwhelm our ability to create them in the current configuration, without triggering autoscaling.
Finally, if we ever get our load balancer, we can instead trigger autoscaling by QPS.
- You can't configure managed instance groups to scale by custom metrics, otherwise we could use our own metrics to do this. 😞

toothrot · 2020-04-27T20:59:06Z

I'm going to close this issue as I think we're now in a good place when it comes to our daily usage. We can re-open it, or open a new one, if we notice something specific now.

I'm very happy about our new metrics, and hopefully we'll have way more insight into what's going on the next time something like this comes up for the playground.

gopherbot added this to the Unreleased milestone Apr 19, 2020

toothrot self-assigned this Apr 20, 2020

toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 20, 2020

toothrot mentioned this issue Apr 23, 2020

x/playground: https://play.golang.org was unavailable #38616

Closed

toothrot closed this as completed Apr 23, 2020

toothrot reopened this Apr 23, 2020

toothrot closed this as completed Apr 27, 2020

toothrot mentioned this issue Apr 27, 2020

x/playground: use gVisor on linux/amd64 binaries instead of NaCl for sandboxing #25224

Closed

golang locked and limited conversation to collaborators Apr 27, 2021

gopherbot added the FrozenDueToAge label Apr 27, 2021

rsc unassigned toothrot Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/playground: serving 502 errors #38530

x/playground: serving 502 errors #38530

bradfitz commented Apr 19, 2020

toothrot commented Apr 19, 2020

gopherbot commented Apr 20, 2020

toothrot commented Apr 22, 2020

toothrot commented Apr 22, 2020

bradfitz commented Apr 22, 2020

toothrot commented Apr 22, 2020

gopherbot commented Apr 22, 2020

odeke-em commented Apr 23, 2020

toothrot commented Apr 23, 2020

toothrot commented Apr 23, 2020

toothrot commented Apr 23, 2020

toothrot commented Apr 23, 2020

toothrot commented Apr 23, 2020

dmitshur commented Apr 23, 2020

toothrot commented Apr 23, 2020

gopherbot commented Apr 23, 2020

gopherbot commented Apr 23, 2020

gopherbot commented Apr 23, 2020

gopherbot commented Apr 23, 2020

gopherbot commented Apr 24, 2020

gopherbot commented Apr 25, 2020

gopherbot commented Apr 25, 2020

toothrot commented Apr 25, 2020

toothrot commented Apr 27, 2020

toothrot commented Apr 27, 2020

x/playground: serving 502 errors #38530

x/playground: serving 502 errors #38530

Comments

bradfitz commented Apr 19, 2020

toothrot commented Apr 19, 2020

gopherbot commented Apr 20, 2020

toothrot commented Apr 22, 2020

toothrot commented Apr 22, 2020

bradfitz commented Apr 22, 2020

toothrot commented Apr 22, 2020

gopherbot commented Apr 22, 2020

odeke-em commented Apr 23, 2020

toothrot commented Apr 23, 2020

toothrot commented Apr 23, 2020

toothrot commented Apr 23, 2020

toothrot commented Apr 23, 2020

toothrot commented Apr 23, 2020

dmitshur commented Apr 23, 2020

toothrot commented Apr 23, 2020

gopherbot commented Apr 23, 2020

gopherbot commented Apr 23, 2020

gopherbot commented Apr 23, 2020

gopherbot commented Apr 23, 2020

gopherbot commented Apr 24, 2020

gopherbot commented Apr 25, 2020

gopherbot commented Apr 25, 2020

toothrot commented Apr 25, 2020

toothrot commented Apr 27, 2020

toothrot commented Apr 27, 2020