Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/playground: serving 502 errors #38530

Closed
bradfitz opened this issue Apr 19, 2020 · 25 comments
Closed

x/playground: serving 502 errors #38530

bradfitz opened this issue Apr 19, 2020 · 25 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bradfitz
Copy link
Contributor

I've been seeing a number of 502s on the playground just now.

Screen Shot 2020-04-19 at 1 16 14 PM

(Hope it wasn't my fault!)

/cc @toothrot

@gopherbot gopherbot added this to the Unreleased milestone Apr 19, 2020
@toothrot
Copy link
Contributor

I'll take a look this afternoon, but it just worked for me now on mobile. There are a couple more stability issues to resolve.

@toothrot toothrot self-assigned this Apr 20, 2020
@gopherbot
Copy link

Change https://golang.org/cl/228438 mentions this issue: playground: show a meaningful error on build timeout

@toothrot toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 20, 2020
gopherbot pushed a commit to golang/playground that referenced this issue Apr 21, 2020
Add support for returning a meaningful error message to the user, along
with partial build output if present.

This change also improves graceful command termination throughout the
playground. It introduces internal.WaitOrStop, which will wait for a
command to finish, or gracefully stop the command if it has not finished
by the time the context is cancelled. This also resolves comments from
CL 227351.

Updates golang/go#38052
Updates golang/go#38530
Updates golang/go#25224

Change-Id: I3a36ad2c5f3626c9cd5b3c1139543ccde3e17ba1
Reviewed-on: https://go-review.googlesource.com/c/playground/+/228438
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@toothrot
Copy link
Contributor

It looks like the sandbox instances are running out of memory. I'm investigating.

@toothrot
Copy link
Contributor

Nevermind, that may be a red herring of Docker doing it's job.

@bradfitz
Copy link
Contributor Author

I was getting a 502 of a GET to /. (See screenshot) That wouldn't make its way back to the gvisor world.

@toothrot
Copy link
Contributor

I understand that, but when too many requests 500, the AppEngine LB will return a 502 for all requests from a particular Flex instance.

@gopherbot
Copy link

Change https://golang.org/cl/229418 mentions this issue: sandbox: update configuration to match production

@odeke-em
Copy link
Member

I've also been getting these consistently for the past 30 minutes
Screen Shot 2020-04-22 at 4 57 38 PM
Screen Shot 2020-04-22 at 5 22 57 PM

@toothrot
Copy link
Contributor

Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause.

1 similar comment
@toothrot
Copy link
Contributor

Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause.

@toothrot
Copy link
Contributor

Sorry, closed the wrong issue while flipping through tabs.

@toothrot toothrot reopened this Apr 23, 2020
@toothrot
Copy link
Contributor

For some context, the new architecture from #25224 introduces sandbox hosts that are separate from our GAE instances (which I have been calling frontends, in the Google parlance).

The frontend healthcheck looks like this:

func (s *server) healthCheck() error {
	ctx := context.Background() // TODO: cap it to some reasonable timeout
	resp, err := compileAndRun(ctx, &request{Body: healthProg})
	if err != nil {
		return err
	}
	if resp.Errors != "" {
		return fmt.Errorf("compile error: %v", resp.Errors)
	}
	if len(resp.Events) != 1 || resp.Events[0].Message != "ok" {
		return fmt.Errorf("unexpected output: %v", resp.Events)
	}
	return nil
}

With the change to add new sandbox backends, this health check fails if all of the backends are down. AppEngine will mark our app as dead, and will try restarting it.

Until we understand the instability issues in our backend, I think we should change this health check to only do the "compile" phase, which happens on the frontend. It would still be helpful to mark them unhealthy if binary compilation is failing entirely.

This should help us stop serving 502s, but it doesn't get us out of the woods of our sandbox hosts being unstable. I have other CLs planned for investigating that, and it starts with measurements.

@toothrot
Copy link
Contributor

/cc @dmitshur @cagedmantis @andybons

@dmitshur
Copy link
Contributor

I've discussed this with @toothrot and agree that is a good next step.

@toothrot
Copy link
Contributor

It's also worth mentioning that we should move to split health checks: https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#updated_health_checks

@gopherbot
Copy link

Change https://golang.org/cl/229677 mentions this issue: playground: only build binaries for health check

gopherbot pushed a commit to golang/playground that referenced this issue Apr 23, 2020
This change avoids cascading sandbox backend failures to the frontend
instances.

The timeout TODO is safe to remove as we now pass in the request context
for cancellation, and sandboxBuild has its own build timeout.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: If892f86bad08c55429b6ebab768b83c5d7621cf1
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229677
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/229679 mentions this issue: playground: instrument HTTP handlers with StackDriver

@gopherbot
Copy link

Change https://golang.org/cl/229680 mentions this issue: playground: instrument number of running containers

gopherbot pushed a commit to golang/playground that referenced this issue Apr 23, 2020
- Specify the correct image in konlet.yaml. gvisor-playground-sandbox is
the child-process container.
- Correct interpolation in config identifiers, which is deprecated.
- Set min_ready_sec for update policy to not cause an outage when
updating
- Use name_prefix for instance_template instead of name, which allows
updates. Templates are immutable, so previously this was not possible to
update.

Updates golang/go#38530
Updates golang/go#25224

Change-Id: I3f7618b8e378eaa9714e571b90390b7052bf2855
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229418
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/229681 mentions this issue: sandbox: add stackdriver scope

gopherbot pushed a commit to golang/playground that referenced this issue Apr 24, 2020
This changes the sandbox VMs to allow them to write to StackDriver
metrics.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: I82954b8eed3664289f5c69c0f5301a72206f0948
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229681
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
gopherbot pushed a commit to golang/playground that referenced this issue Apr 24, 2020
This change adds OpenCensus's HTTP instrumentation to the sandbox. In
development mode, it exposes a prometheus metrics interface on /statusz.
This is the first in a series of CLs to add instrumentation to different
parts of the sandbox to help investigate instability issues. For now,
reporting metrics around our responses per handler to StackDriver will
be helpful.

OpenTelemetry would be preferable, as it is the successor of OpenCensus,
however the StackDriver integration is not quite done.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: I600fd695bb66c8bee16bc0b778d51930f4cdd476
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229679
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
gopherbot pushed a commit to golang/playground that referenced this issue Apr 24, 2020
This change adds a metric to count the number of running containers,
according to Docker.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: Id989986928dff594cb1de0903a56dcffed8220c4
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229680
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/229958 mentions this issue: sandbox: correct stackdriver labels

gopherbot pushed a commit to golang/playground that referenced this issue Apr 24, 2020
The generic_task target has a required label format that was not being
met. Moves metadata fetching into a helper function.

Removes call to view.RegisterExporter for StackDriver exporter, which
was unncessary.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: Ib009f5ce906f5b9479cdda8c7e8322d06e3036e4
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229958
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/229980 mentions this issue: sandbox: add container creation metrics

@gopherbot
Copy link

Change https://golang.org/cl/229981 mentions this issue: sandbox: reduce container starvation

@toothrot
Copy link
Contributor

After clearing up the concurrency issues, and modifying the frontend health check, the latest series of metrics and load testing provided us the results we were looking for:

image

I performed a load test of a trivial program to figure out how the sandbox scaled.

  • The container pool can be exhausted without using very much CPU
    • CPU intensive programs would have a different footprint, but forget that for a moment
  • /healthz is a proxy for how long we have to wait for a container. Seeing that histogram go up, and eventually cause an instance to be unhealthy, tells us that we can do a better job of CPU utilization here.
    • As expected, the vast majority of our load (6 QPS) in steady state is from healthz requests
  • The instance group manager is set to scale up instances after 50% CPU utilization, which is hard to hit on the sandboxes from high usage of trivial programs (which I expect most of our traffic is).

https://golang.org/cl/229980 adds container creation time metrics. Locally, it takes about 600ms to 1s to create a container. That means it could take as long as 8s to recover from 8 requests.

https://golang.org/cl/229981 addresses this issue by creating containers in parallel, allowing us to keep up with the request load. I'm expecting this to improve CPU utilization, hopefully triggering auto-scaling under load.

gopherbot pushed a commit to golang/playground that referenced this issue Apr 27, 2020
This change measures the latency and success of container creation.
These metrics will help capacity planning and investigating production
issues.

Updates golang/go#25224
Updates golang/go#38530

Change-Id: Id7f373acb8741d4465c6e632badb188b6e855787
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229980
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
gopherbot pushed a commit to golang/playground that referenced this issue Apr 27, 2020
Creating a container in the sandbox takes 500ms to 1s. The sandbox was
creating containers serially, but serving requests in parallel. This
means that we can be starved for workers with a trivial number of
requests.

In addition, the sandbox in production is not CPU bound, meaning we
probably have room to do some extra work while serving a request.

This CL introduces a worker pool to create containers. It also changes
the readyContainer chan to unbuffered to avoid having twice as many
containers as we expect while idle (the container waiting to be sent
plus those already in the channel's buffer).

Updates golang/go#25224
Updates golang/go#38530

Change-Id: I0e535cf65409c3dbf32329577a1c0687c2614a0d
Reviewed-on: https://go-review.googlesource.com/c/playground/+/229981
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
@toothrot
Copy link
Contributor

The results of those changes are similar, but I was unable to cause the instances to become unhealthy, nor could I make the instances autoscale with trivial programs:

image

Many CPU-bound snippets would still trigger horizontal scaling successfully.

I honestly think this is good enough for this issue. For scaling further, we may want to consider:

  • Increasing worker count to greater than runtime.NumCPU().
    • For a traffic pattern of many small requests, we are currently more bound on IO wait than CPU
  • Removing the run semaphore. The blocking call to getContainer is likely sufficient.
    • For calculating max concurrent processes from a binary memory size, we should do some math to compare our docker RAM limit, and our max binary size * number of workers.
  • Increasing our number of containers to something more than max-workers.
    • It takes about 500ms to create a container (see below). A flood of tiny requests could overwhelm our ability to create them in the current configuration, without triggering autoscaling.
  • Finally, if we ever get our load balancer, we can instead trigger autoscaling by QPS.
    • You can't configure managed instance groups to scale by custom metrics, otherwise we could use our own metrics to do this. 😞

image

@toothrot
Copy link
Contributor

I'm going to close this issue as I think we're now in a good place when it comes to our daily usage. We can re-open it, or open a new one, if we notice something specific now.

I'm very happy about our new metrics, and hopefully we'll have way more insight into what's going on the next time something like this comes up for the playground.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants