Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: GC causes latency spikes #14812

Closed
Dieterbe opened this issue Mar 14, 2016 · 120 comments
Closed

runtime: GC causes latency spikes #14812

Dieterbe opened this issue Mar 14, 2016 · 120 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@Dieterbe
Copy link
Contributor

Hello,
while running the program at https://github.com/raintank/raintank-metric/tree/master/metric_tank
I'm seeing mark-and-scan times of 15s cpu time, 20002500 ms clock time. (8 core system) for a heap of about 6.5GB
(STW pauses are fine and ~1ms)
I used https://circleci.com/gh/raintank/raintank-metric/507 to obtain the data below.

$ metric_tank --version
metrics_tank (built with go1.6, git hash 8897ef4f8f8f1a2585ee88ecadee501bfc1a4139)
$ go version
go version go1.6 linux/amd64
$ uname -a #on the host where the app runs
Linux metric-tank-3-qa 3.19.0-43-generic #49~14.04.1-Ubuntu SMP Thu Dec 31 15:44:49 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I know the app is currently not optimized for GC workload: while i've gotten allocations down in various parts of the program there are currently probably about a million or more live pointers referencing pieces of data. I was going to work on optimizing this when Dave Cheney suggested there's a problem with the runtime and I should file a bug (https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg)
Here's the log with gctrace and schedtrace enabled: https://gist.githubusercontent.com/Dieterbe/18453451c5af0cdececa/raw/9c4f2abd85bb7a815c6cda5c1828334d3d29817d/log.txt

at http://dieter.plaetinck.be/files/go/mt3-qa-gc-vs-no-gc.zip you'll find a zip containing this log, the binary, a cpu profile taken during gc run 1482, and a cpu and heap profile in between run 1482 and 1483

I also have these two dashboards that seem useful. (they both end just after the spike induced by GC run 1482)
https://snapshot.raintank.io/dashboard/snapshot/MtLqvc4F6015zbs4iMQSPzfizvG7OQjC
shows memory usage, GC runs and STW pause times. it also shows that incoming load (requests) of the app is constant so this conveys to me that any extra load is caused by GC, not by changing workload
https://snapshot.raintank.io/dashboard/snapshot/c2zwTZCF7BmfyzEuGF6cHN9GX9aM1V99
this shows the system stats. note the cpu spikes corresponding to the GC workload.

let me know if there's anything else I can provide,
thanks,
Dieter.

@RLH
Copy link
Contributor

RLH commented Mar 14, 2016

The GC will use as much CPU as is available. If your program is basically
idle, which it appears to be, the GC will use the idle CPU and CPU load
will naturally go up. If your application is active and load is already
high then the GC will limit its load to 25% of GOMAXPROCS. The mark and
scan phase is concurrent, it is unclear how it is adversely affecting your
idle application.

On Mon, Mar 14, 2016 at 1:54 AM, Dieter Plaetinck notifications@github.com
wrote:

Hello,
while running the program at
https://github.com/raintank/raintank-metric/tree/master/metric_tank
I'm seeing excessive time spent in mark-and-scan on go 1.6
(STW pauses are fine and ~1ms)
I used https://circleci.com/gh/raintank/raintank-metric/507 in the below
description.

./metric_tank --version
metrics_tank (built with go1.6, git hash 8897ef4f8f8f1a2585ee88ecadee501bfc1a4139)
go version go1.6 linux/amd64
$uname -a #on the host where the app runs
Linux metric-tank-3-qa 3.19.0-43-generic #49~14.04.1-Ubuntu SMP Thu Dec 31 15:44:49 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I know the app is currently not optimized for GC workload: while i've
gotten allocations down in various parts of the program there are currently
probably about a million or more live pointers referencing pieces of data.
I was going to work on optimizing this when Dave Cheney suggested there's a
problem with the runtime and I should file a bug (
https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg)
I'm seeing MAS times of 15s and more cpu time, 2000~2500 ms clock time. (8
core system) for a heap of about 6.5GB
Here's the log with gctrace and schedtrace enabled:
https://gist.githubusercontent.com/Dieterbe/18453451c5af0cdececa/raw/9c4f2abd85bb7a815c6cda5c1828334d3d29817d/log.txt

at http://dieter.plaetinck.be/files/go/mt3-qa-gc-vs-no-gc.zip you'll find
a zip containing this log, the binary, a cpu profile taken during gc run
1482, and a cpu and heap profile in between run 1482 and 1483

I also have these two dashboards that seem useful. (they both end just
after the spike induced by GC run 1482)

https://snapshot.raintank.io/dashboard/snapshot/MtLqvc4F6015zbs4iMQSPzfizvG7OQjC
shows memory usage, GC runs and STW pause times. it also shows that
incoming load (requests) of the app is constant so this conveys to me that
any extra load is caused by GC, not by changing workload

https://snapshot.raintank.io/dashboard/snapshot/c2zwTZCF7BmfyzEuGF6cHN9GX9aM1V99
this shows the system stats. note the cpu spikes corresponding to the GC
workload.

let me know if there's anything else I can provide,
thanks,
Dieter.


Reply to this email directly or view it on GitHub
#14812.

@ianlancetaylor ianlancetaylor changed the title mark and scan needs excessive amount of time (15s for 6.5GB heap) runtime: mark and scan needs excessive amount of time (15s for 6.5GB heap) Mar 14, 2016
@ianlancetaylor ianlancetaylor added this to the Go1.7 milestone Mar 14, 2016
@Dieterbe
Copy link
Contributor Author

The mark and scan phase is concurrent, it is unclear how it is adversely affecting your idle application.

just a guess, but perhaps the cause is the extra workload induced by the write barrier? (i watched your gophercon talk again today :) Interestingly, when I use top, i haven't been able to ever catch a core running at 100%.

But you're right that there's essentially two things going on, which may or may not be related:

  • mark and scan takes too long
  • app slows down during GC runs, while cpu cores don't saturate.

Let me know how I can help.

@aclements
Copy link
Member

Hi @Dieterbe, could you clarify what the issue is? 15s for 6.5GB is actually pretty good (I get ~5s/GB of CPU time on some benchmarks locally, but this can vary a lot based on heap layout and hardware).

If it's that the CPU utilization goes up during GC, please clarify why this is a problem (the GC has to do its work somehow, and FPGA accelerators for GC are still an open area of research :)

If it's that response time goes up during GC, could you try the CL in #15022? (And, if you're feeling adventurous, there's also https://go-review.googlesource.com/21036 and https://go-review.googlesource.com/21282)

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Apr 4, 2016

Hey @aclements!

15s for 6.5GB is actually pretty good (I get ~5s/GB of CPU time on some benchmarks locally, but this can vary a lot based on heap layout and hardware).

ok , fair enough for me. i just reported this here because @davecheney mentioned in
https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg
that 1.5s for 5GB was unexpected and that i should open a ticket for it. so hence this ticket.

If it's that the CPU utilization goes up during GC, please clarify why this is a problem (the GC has to do its work somehow, and FPGA accelerators for GC are still an open area of research :)

of course, this is by itself not a problem.

If it's that response time goes up during GC, could you try the CL in #15022?

initially the ticket wasn't about this, but it was brought up and is definitely a problem for us. so from now on we may as well consider this the issue at hand.
I recompiled my app with a recompiled go using your patch, and did a test run before and after.
unfortunately i see no change and the latency spikes are still there (details at grafana/metrictank#172)
note that i can verify this problem quite early on. e.g. in this case i've seen spikes as early as GC run 270. the issue is there probably much earlier but my app needs to load in a lot of data before i can test. the bug mentioned in #15022 looks like it only activates after a sufficient amount of GC runs.

@aclements aclements changed the title runtime: mark and scan needs excessive amount of time (15s for 6.5GB heap) runtime: GC causes latency spikes Apr 22, 2016
@aclements
Copy link
Member

@Dieterbe, would it be possible for you to collect a runtime trace (https://godoc.org/runtime/trace) around one of the periods of increased latency? If you do this with current master, the trace file will be entirely self-contained (otherwise, I'll also need the binary to read the trace file).

I have a hunch about what could be causing this. GC shrinks the stacks, so if many of your goroutines are constantly growing and shrinking the amount of stack they're using by at least a factor of 4, you would see a spike as many goroutines re-grew their stacks after the shrink. This should be more smeared out on master than with Go 1.6 since f11e4eb made shrinking happen concurrently at the beginning of the GC cycle, but if this is the problem I don't think that would have completely mitigated it. (Unfortunately, the trace doesn't say when stack growth happens, so it wouldn't be a smoking gun, but if many distinct goroutines have high latency right after GC that will be some evidence for this theory.)

@Dieterbe
Copy link
Contributor Author

Hey @aclements
I did curl 'http://localhost:6063/debug/pprof/trace?seconds=20' > trace.bin
about 5~7 seconds in I think (it's a bit hard to tell) is where the GC kicks in and a latency spike was observed
files: http://dieter.plaetinck.be/files/go-gc-team-is-awesome/trace.bin and http://dieter.plaetinck.be/files/go-gc-team-is-awesome/metric_tank for the binary. compiled with official 1.6.2 . hopefully this helps to diagnose. if not let me know, maybe i can get a better trace.

@aclements aclements modified the milestones: Go1.7Maybe, Go1.7 May 24, 2016
@Dieterbe
Copy link
Contributor Author

I read through and #9477 and #10345 and wonder if this issue is another similar case? note that this app is centered around a map (https://github.com/raintank/raintank-metric/blob/master/metric_tank/mdata/aggmetrics.go#L13) that has just over 1M values (and each value in turn has a bunch of pointers to things that have more pointers, and lots of strings involved too). optimizing this is on my todo, but in the meantime i wonder if maybe a GC thread blocks the map leaving other application threads (mutators) unable to interact with the map. and since everything in the app needs this map, it could explain the slow downs?

@aclements
Copy link
Member

@Dieterbe, it's possible. Could you try the fix I posted for #10345? (https://golang.org/cl/23540)

Note that it's not that the GC thread blocks the map. Mutators are free to read and write the map while GC is scanning it; there's no synchronization on the map itself. The issue is that whatever thread gets picked to scan the buckets array of the map is stuck not being able to do anything else until it's scanned the whole bucket array. If there's other mutator work queued up on that thread, it's blocked during this time.

(Sorry I haven't had a chance to dig into the trace you sent.)

@aclements
Copy link
Member

I dug into the trace you posted (thanks!) and took another look at your gctrace. This is pretty clearly an over-assist problem. When the GC is running, the expected 25% of the CPU goes to dedicated GC workers, but nearly all of the rest of the CPU time goes to GC assists triggered by allocation. Hence, when GC is running, even though it's technically concurrent, nearly all of the CPU is being soaked up by GC. You can see this really clearly in the trace:

trace-14812

The GC kicks in in the middle. When the GC isn't running, all of the user goroutine executions are really short and the system keeps up. As soon as GC kicks in, two Ps go to dedicated gcBgMarkWorkers (expected) and everything else turns to a checkerboard of 10–20 ms long user goroutine time slices (not expected). Because of this, the run queue lengths climb rapidly (the second row in the group of three at the top), which is almost certainly what's causing the latency spikes. Right after GC finishes there's a period where the user goroutines are running quickly again but they're packed much more densely as it drains the long run queues and returns to normal.

I don't yet understand why this is happening. The heap is in steady state, so I would expect it to compute an assist ratio that causes it to finish roughly half way between the trigger and the goal, but it seems to be computing a much higher assist ratio and finishing much earlier.

@Dieterbe, could you re-run your application with GODEBUG=gctrace=1,gcpacertrace=1 and grab stderr? It would be awesome if you can run with master, since there's more gcpacertrace output than there was in 1.6, but even the 1.6 output would be helpful.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jun 8, 2016

hey @aclements

i recompiled latest master go, then compiled my app with it, and ran with that GODEBUG variable.
this time i just ran it in local dev environment with fake traffic, but the symptoms are visible again (some of the fake requests got slow responses at GC times).
here's the err output: https://gist.github.com/Dieterbe/6a57caf1e6fae129f375146ef95e6276
note at line 459 I pressed return twice.
that's where my fake traffic started and where the symptoms become clear.
the 1st and 2nd GC run after that didn't show it clearly, but then at the 3rd, 4th and 5th and basically every GC run ever that showed response time spikes again.

if it helps, here's a dashboard of the entire run:
https://snapshot.raintank.io/dashboard/snapshot/gNKMeTpufl77H9gK1JCZn5TYvo88uLju
and zoomed in on the region where fake traffic kicked in:
https://snapshot.raintank.io/dashboard/snapshot/QgUlChTZGR4Bg4bw95mHxmN3vozfb334

the "golang GC" graph 2nd row middle and "request handle duration (all steps)" 3rd row on right will probably be most helpful.

@aclements
Copy link
Member

@Dieterbe, is there a way I can run the load test with fake traffic myself? I built metric_tank (I had to switch github.com/grafana/grafana/pkg/log to origin/metrics_reporting, but otherwise go get just worked), but I don't know how to run it.

@Dieterbe
Copy link
Contributor Author

I had to switch github.com/grafana/grafana/pkg/log to origin/metrics_reporting

i'm not sure what's metrics_reporting, but the log problem is caused due to grafana recently changing up its logging mechanism, you can just checkout the last commit before that (grafana/grafana@22cda19) and metric-tank will build fine against it.

as for running it, MT (metric tank) relies on cassandra, elasticsearch and nsqd (from the nsq.io project) at runtime. the easiest (hopefully) way to get it all up and running is by using https://github.com/raintank/raintank-docker, so you can follow that readme, with a few changes:

  • you need to update screens/metricTank with
-./metric_tank \
+GODEBUG=gctrace=1,gcpacertrace=1 ./metric_tank \
  • you would also need to create a file screens/benchmark in raintank-docker with contents like the below:
#backfill an hour at 100k, wait for it, and then do realtime at 10k in background, and start http load.
fake_metrics -nsqd-tcp-address nsqd:4150 -statsd-addr statsdaemon:8125 -orgs 100 -keys-per-org 100 -offset 1h -speedup 10 -stop-at-now
fake_metrics -nsqd-tcp-address nsqd:4150 -statsd-addr statsdaemon:8125 -orgs 100 -keys-per-org 100  &
go get github.com/raintank/raintank-metric/inspect-es
go get github.com/tsenart/vegeta
inspect-es --es-addr elasticsearch:9200 -format vegeta-mt | sed 's#localhost:6063#metricTank:6063#' | vegeta attack -rate 500 -duration=120m > attack.out

these tools set up the "contextual circumstances": fake_metrics generates an input stream, inspect-es piped to vegeta triggers a consistent http request load.

you might hit some snags in the process of getting this all up and running though, and i would hate to have you lose much time with this. but i could hop on a call or whatever to help out.

thanks.

@aclements
Copy link
Member

I think I got it up and running (with many tweaks), but now I'm a little stuck.

  1. What's the easiest way to get a custom-built Go into the metric_tank container? I'd like to add some debugging to the runtime.
  2. How do I monitor the end-to-end latency so I can test any changes?

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jun 15, 2016

hey @aclements sorry you had to go through that. what i personally do is compile metric-tank manually on the host, so that the binary is $GOPATH/src/github.com/raintank/raintank-metric/metric_tank/metric_tank, and then make raintank-docker/raintank_code/raintank-metric a symlink to $GOPATH/src/github.com/raintank/raintank-metric/metric_tank/ . you can see in fig.yaml, the metricTank entry, that ./raintank_code/raintank-metric from the host gets mapped to /go/src/github.com/raintank/raintank-metric within the container using a docker directory mapping. in other words, the directory in the container is synchronized with the host, so whatever you compile on the host will be available in the container.

  1. there's 2 methods:

A) using vegeta. in the vegeta attack command you already have, probably change the duration from 120m to about 5m or so, to make the benchmark shorter. then it should be as simple as cat attack.out | vegeta report. vegeta has a few other interesting reporting modes as well. and i think you can even run reports "streaming", e.g. get reports while the file is still being written to.

B) if grafana is running properly in the container (the easiest to verify is to attach to the screen session and go to the grafana tab. in fact you may want to check out all tabs just to verify nothing's obviously broken. usually the last tab i switch to is the benchmark one. there you'll have the console where vegeta is executing and where you can run more vegeta commands). but anyway if grafana's running then you can go to http://localhost:3000, log in as admin:admin and load the "metric tank" dashboard. that's the dashboard i've been using to provide the snapshots above. it should show real-time stats coming in (which it loads from graphite, which also runs in a container)

@aclements
Copy link
Member

I spent much of this week working on this. I still have ideas on my list, but wanted to post an update.

First, I was able to bring up the whole system with a custom runtime and observe the behavior through Grafana and vegeta report. I don't think I was able to quite reproduce the problem locally, though. My laptop simply didn't have the resources, so it was swapping and over-saturating the CPU and causing all sorts of horrible latency. My workstation had the resources (though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!), but I'm not sure it exhibited the problem. According to vegeta, max latency was ~350ms, but P99 was just 2ms in several runs. I did see small jumps in max "request handle duration" correlated with GC in Grafana, though they were only around 15ms.

However, I was still able to dig into the execution traces and glean some insight. I fixed some problems with high goroutine churn potentially losing track of assist credit, and with the way we build up spare assist credit causing long assists when the assist ratio is high (which it is in Metric Tank). I also did several more experimental changes, which did get rid of the bad checkerboard pattern in the execution trace and break it up into a much more finely interleaved workload. This is certainly what we want to see in the trace, but it didn't seem to move the overall stats. I have a few more ideas on why that may be, but I'll be out next week, so it may be a little bit before I can try them.

I'm going to mail a bunch of my experimental commits, which should get linked here shortly. @Dieterbe, if you have time, it would be good try running with these, though I'm not sure you'll see much difference.

I did notice one thing you may be able to change on the application side to mitigate some of the problem. I think Metric Tank is allocating very small objects at a very high rate. This is causing the GC scan rate to be much lower than typical, since the memory access pattern is even more random than typical. This is stretching out the GC assists and interacting poorly with some of the tuning that assumes a more typical scan rate. It would be worth doing a heap profile with -alloc_objects to see if there are any sources of large numbers of small allocations that would be easy to address.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jul 4, 2016

I don't think I was able to quite reproduce the problem locally, though. My laptop simply didn't have the resources

yeah it's not trivial to consistently trigger clear symptoms (large spikes). But it seems more load always seems to help.

though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!)

I run https://github.com/meltwater/docker-cleanup/blob/master/docker-cleanup-volumes.sh periodically in between runs of the stack, this frees up space.

It would be worth doing a heap profile with -alloc_objects to see if there are any sources of large numbers of small allocations that would be easy to address.

Oh boy. I already went through weeks of optimizing memory allocations and what's left now is a bit harder. IIRC the remaining inefficiencies are about allocating a whole bunch of strings (several for each incoming metric). optimizing that (splitting up in the incoming stream in separate data and metadata streams, string interning, using enums, etc) is on the agenda but will require a broader architectural overhaul

I'll be out next week

Enjoy your time off!

@uluyol
Copy link
Contributor

uluyol commented Jul 4, 2016

(though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!)

Mounting a directory on another partition into the container might help with this.

@ianlancetaylor
Copy link
Contributor

Postponing to 1.8.

@gopherbot
Copy link

CL https://golang.org/cl/25155 mentions this issue.

@CAFxX
Copy link
Contributor

CAFxX commented Aug 7, 2019

The CPU spike at GC is working as intended. If there are idle Ps then the GC will used them so you should expect full CPU utilization.

I know this goes way beyond the scope of the Go runtime, but as it's arguably a common configuration I think it warrants mentioning: such a behavior (CPU spike during GC) is not desirable when using CFS bandwidth control cfs_quota_us/cfs_period_us (i.e. when running in basically any linux container with a CPU limitation set), as the default CFS period is 100ms and a GC-induced spike in CPU usage that burns through the quota will make the kernel stop scheduling the process until the end of the period, thereby turning what would be a harmless concurrent GC phase into a pretty long (in the order of the cfs_period_us, ~100ms) stop-the-world pause.

I am not saying that we need to necessarily do something about this in Go, just that this may not be the most desirable behavior in a pretty common configuration (running a go program in a linux container with cpu limits), and therefore stretching a bit the definition of "working as intended" if the intention is having consistent sub-ms stop-the-world pauses.

@larytet
Copy link

larytet commented Aug 7, 2019

In my experience keeping a large map (10M entries) seriously impacts the system latency. Under heavy loads the typical response time is under 100 microseconds. When GC kicks in, the system stops responding altogether for 3-4 seconds. It happens every ~5 minutes. The system has 4 cores. In the gctrace I see that GC frees ~8GB of memory and it takes most of 4s. In this scenario GC consumes all cores.

The flow is: default HTTP server gets a short HTTP GET (under 200 bytes), handler looks in the map (syncmap), generates a short response (under 200 bytes). The map gets a new entry or two every few hours.

In one approach GC could avoid scanning objects which appear to live long time. Low level malloc/free API in the unsafe package could help me as well.

I switched to BigCache which solved the problem.

@dr2chase
Copy link
Contributor

dr2chase commented Aug 7, 2019

If you could turn that large map problem into some sort of a test or benchmark, that would be really useful. There's a short list of few-millisecond hiccup causes, but I don't know of any related to maps.

@larytet
Copy link

larytet commented Aug 7, 2019

@dr2chase This is not quite a "benchmark" you probably have in mind, but it demonstrates the latency problem https://github.com/larytet-go/map-benchmark
This is a simple HTTP server. When allocated memory reaches ~3GB the server stops responding for ~5s. I keep the system around 0% idle time. Arguably the demo - in-memory data base - is not a typical Go micro service. The service is CPU and DDR bound.

The issue is in no way a new one. I assume that the behavior is by design. If GC sees that the system remains busy it will stay put (GC "marker" lags allocations) until the amount of allocated memory crosses a threshold (GCPercent?). The threshold forces the GC's hand. GC will consume all available cores, release the allocated memory, return to hibernation (background). I think that I understand some of the trade offs. GC's main goal is to let the system serve occasional short spikes with minimal latency impact (?). Devops is expected to add nodes until the latency distribution hits the target.

I do have a strong opinion about this. The GC is excellent when the machine has CPUs to spare and/or the service waits for I/O a lot. I personally prefer predictable latency with well defined and reasonable worst case. I wonder why there is no a separate list for long living objects, const pointers, access to runtime/unsafe malloc/free, a buddy allocator in the standard library, more GC tuning options, GC feedback and so on. Is it possible to avoid scanning 10M pointers in a map each and every time GC wakes up? Should I use rate limiter in all services? I agree that my opinion is not a very educated one.

In the trace on a 4 core system I see

SCHED 119544ms: gomaxprocs=4 idleprocs=0 threads=11 spinningthreads=0 idlethreads=2 runqueue=0 [0 0 0 0]
GC forced
gc 15 @120.185s 0%: 0.011+1.8+0.006 ms clock, 0.045+0/1.8/1.5+0.027 ms cpu, 2->2->1 MB, 4 MB goal, 4 P
SCHED 120557ms: gomaxprocs=4 idleprocs=4 threads=12 spinningthreads=0 idlethreads=5 runqueue=0 [0 0 0 0]
SCHED 120050ms: gomaxprocs=4 idleprocs=0 threads=11 spinningthreads=0 idlethreads=2 runqueue=0 [0 0 0 0]
gc 15 @113.396s 6%: 0.010+7082+0.005 ms clock, 0.043+8.9/7071/18652+0.020 ms cpu, 2977->3054->1663 MB, 3245 MB goal, 4 P
SCHED 121066ms: gomaxprocs=4 idleprocs=4 threads=12 spinningthreads=0 idlethreads=5 runqueue=0 [0 0 0 0]

Links
#9477
Probably relevant #27410
GC performance http://big-elephants.com/2018-09/unexpected-gc-pauses/

@dr2chase
Copy link
Contributor

Sorry for the huge delay in replying, the short answer to your problem is "don't call memstat that often and don't put the call someplace you would mind waiting for an answer". There was an old problem with memstat blocking till the GC finished; we though we had fixed it, but apparently you have found a new problem with memstat blocking. When I remove the calls to memstat from your test program, all is well.

Even when we fix this new bug, frequent calls to memstat are not a great idea, because it actually stops the world. Not for long, but everything. It has overheads.

@genez
Copy link

genez commented Jan 14, 2020

We have a software in production that has a similar behavior.
Even without calling memstat, we experience large delays (+100ms for heap size in the range of 7 or 8 GB)
Unfortunately the software is not open-source.

Memory models is composed of a relatively low number of btrees (10) (IIRC https://github.com/emirpasic/gods#redblacktree) that represent partitions of a large dataset (loaded at startup) of key/value pairs, where the value component is a Go structure and the key is a string

Once loaded, the dataset is 95% read and 5% updated

From traces it is clearly evident that large delays are caused by GC sweeping and starving mutators (more frequent) or GC STW phase (rare - please see this example)
image

image

@dr2chase
Copy link
Contributor

Which version of Go? There's several bugs that cause few-millisecond pauses for one or more threads, some are planned for 1.15, one -- failure to preempt call-free long-running loops -- got fixed in 1.14, so if you test with tip and the latency gets better, that was your problem. We also fixed one of the others in 1.14, but it's limited to just a few milliseconds. See #27732 for a discussion of those bugs, and traces showing what they look like -- it would be nice to know if what's causing your problems is new or known.

A quick summary of the known bugs (from 27732):

  1. Stack allocations are not divisible work; can take a long time to process. (Workaround for 1.14 added https://go-review.googlesource.com/c/go/+/180817/ -- use go build -gcflags 'all=-smallframes' hello.go)
  2. Mark assist applied to global and stack allocations (generally, roots) doesn't generate any credit.
  3. Sweeping problem (runtime: latency in sweep assists when there's no garbage #18155 - experimental fix here: https://go-review.googlesource.com/c/go/+/187817/ )
  4. Reschedule (i.e., quantum expires) that discovers a need for GC does not wake other Ps (or, does not check global run queue) (i.e., the work to do expanded by 1, if any Ps are idle, they one should be wakened to take over doing "real work"). (Fixed in 1.14: https://go-review.googlesource.com/c/go/+/146817/ )
  5. (Related to 4?) Dedicated worker doesn't kick goroutines out of its local run queue.

The workaround for bug 1 will also address bug 2 -- if there's no oversized frames to mark, the size of the accounting error is small, also.

Bug 3 is rare, usually only occurs during program startup or if your code is very very good at not generating garbage; if the heap is completely live, scanning for a free slot can run very long. The workaround is somewhat nutty -- add a counter to the allocator for the problem, and every 64 allocations (counter&63 == 0) drop the allocation on the floor and do it again. This leaves holes in your heap that prevent long stalls for scanning.

If you try these workarounds and one of them works, that's really useful information about which bugs is the bigger problem and needs attention first.

It's also helpful to exclude startup from delay stats for long-running servers; those are a separate category of bug, and (our opinion) not nearly as important as steady-state glitches. They do matter; minor changes in startup heap growth heuristics have a noticeable effect on build speed, so we care, just not as much.

@genez
Copy link

genez commented Jan 14, 2020

Thanks for the detailed follow-up
I am using Go 1.13.5 on windows x64

I have a reduced/modified version of the sample program that I can publish if needed.
Unfortunately the simplified version exhibits a less evident issue but still can be usefull for investigations.

Behavior is the same with 1.13 and tip (go version devel +563287ae06 Mon Jan 13 23:35:18 2020 +0000 windows/amd64)

Based on your explanation it look like we are in scenario (3)
We have a relatively large allocation at startup, then we execute a large number of read/update on the data set. Only the second phase is considered in the trace.

I expect that the real live heap is about 95% of the total heap

This is a sample fragment of the output of the trace tool:
image

I will provide source code and other details as soon as possible

@genez
Copy link

genez commented Feb 18, 2020

please check repro code here: https://github.com/genez/collectiontest
please note this is extremely simplified, production code does more computations (json and string handling, logging, etc...) that lead to >50ms latency

in production code we are not using stdlib maps because they do not shrink and cause OOM panics, we use a custom data structure (an rb+ tree where we serialize the items as a byte slice)

on my test bench (Intel i9 16 core, 32 GB of RAM) I can repro >10ms latency
go version go1.13 windows/amd64

it looks like the platform (win, linux) does not matter

image

@dr2chase
Copy link
Contributor

Short answer it is definitely sweeping, and the patch for sweep delays works nicely.
Unfortunately the next mole to whack is mark assist, and though I think it's technically working as intended, it's sometimes not good.

You can see these in the MMU graph, here's a particularly terrible one (some look great, this one is awful even with an experimental "fix"). By turning the mark-assist toggle on and off you can see where most of the problem comes from:

Screen Shot 2020-02-18 at 5 45 54 PM

Screen Shot 2020-02-18 at 5 46 20 PM

@genez
Copy link

genez commented Feb 19, 2020

Thank you for this insight!
I am not sure I can fully grasp what does "turning mark assist off" mean and how to correctly read those two charts

I'm sorry but I am not that confident in runtime internals

@dr2chase
Copy link
Contributor

"Turning mark assist off" refers to the theft-of-time toggles for the graph; if you don't count mark-assist "theft" of mutator time, the worst-case delay is below 10ms, and the longest window in which only 50% of the cpu is available is about 25-30ms.

Turning mark assist "on" (in the accounting), the 50% point is pushed out into seconds.

The chart shows "minimum mutator utilization" (MMU) and is available from "go tool trace". The time on the bottom (x-axis) is the size of a window that is slid across the trace, and the fraction (y-axis) is the minimum for all windows (of that size) fraction of time in which the mutator wanted to run but could not. The toggle buttons ("Include") allow you to specify things that are not mutation -- if it's one of those, then the mutator is "not running". "View" allows you to specify either the entire system or worst-case per-goroutine ("system" runs the window across the threads in a stack, takes the fraction for the entire stack, "Per-goroutine" is instead thread-by-thread).

By default the graph shows worst case, and toggling "Percentile" is supposed to get you more nuanced numbers, but I think there is some sort of a bug because I'm not seeing that no matter how hard I fiddle.

One thing to be careful of is that this is a distillation of your app's memory allocation, and it is quite possible that it sees mark assist problems that your app will not. Spending a lot of time in mark assist happens when the mutator (your app) is allocating quickly during GC -- when that happens a tax is applied to allocations to prevent the GC from falling too far behind. If your app does more work that is not allocation -- or even if it does a lot more allocations but they are often smaller than in this benchmark -- then you won't see the mark assist problem.

Spending a lot of time in sweep (your original problem) is a property of a heap that is very well-filled; it's large, and it's 100% live objects. Right after GC completes, a "background sweeper" thread starts to figure out where free space is, but if the mutator allocates before that thread has processed very much memory, then the mutator has to go looking (sweeping) instead. When a lot of the heap is 100% filled, it can look for a while. The patch for that simply says "after you've looked at 100 spans with no luck, give up and allocate a new span".

So, if you're up for patching a Go release, you might try
https://go-review.googlesource.com/c/go/+/187817/
and see if your latency problem goes away (meaning, your real app does other stuff), or if it only gets a little better (you app allocates a lot anyway and we need to improve mark assist.)

Thanks very much for this benchmark. This gives us a nice reproducible case (taken from a real-world app) of sweep being a problem, which will probably get that fix into 1.15, and it provides a nice motivating case for improving mark assist.

@genez
Copy link

genez commented Feb 24, 2020

I just repeated the test on a real production-grade linux machine with the following Go versions:

  • Go 1.13.8 (the one we are using at the moment in prod)
    image

  • Go tip origin/master 218f457
    image

It looks like there's a nice improvement
I can upload and share trace output files if needed

I did try to patch the tip sources, but I get compilation errors:

Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.

/opt/go_MASTER/src/runtime/mcentral.go:262:17: too many arguments in call to mheap_.freeSpan
        have (*mspan, bool)
        want (*mspan)

/opt/go_MASTER/src/runtime/mcentral.go:272:19: too many arguments in call to mheap_.alloc
        have (uintptr, spanClass, bool, bool)
        want (uintptr, spanClass, bool)

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
nvanbenschoten added a commit to nvanbenschoten/go that referenced this issue Sep 19, 2022
At a high level, the runtime garbage collector can impact user goroutine
latency in two ways. The first is that it pauses all goroutines during
its stop-the-world sweep termination and mark termination phases. The
second is that it backpressures memory allocations by instructing user
goroutines to assist with scanning and marking in response to a high
allocation rate.

There is plenty of observability into the first of these sources of
user-visible latency. There is significantly less observability into the
second. As a result, it is often more difficult to diagnose latency
problems due to over-assist (e.g. golang#14812, golang#27732, golang#40225). To this
point, the ways to determine that GC assist was a problem were to use
execution tracing or to use GODEBUG=gctrace=1 tracing, neither of which
is easy to access programmatically in a running system. CPU profiles
also give some insight, but are rarely as instructive as one might
expect because heavy GC assist time is scattered across a profile.
Notice even in https://tip.golang.org/doc/gc-guide, the guidance on
recognizing and remedying performance problems due to GC assist is
sparse.

This commit adds a counter to the MemStats and GCStats structs called
AssistTotalNs, which tracks the cumulative nanoseconds in GC assist
since the program started. This provides a new form of observability
into GC assist delays, and one that can be manipulated programmatically.

There's more work to be done in this area. This feels like a reasonable
first step.
@gopherbot
Copy link

Change https://go.dev/cl/431877 mentions this issue: runtime: export total GC Assist ns in MemStats and GCStats

@mknyszek
Copy link
Contributor

mknyszek commented Sep 19, 2022

Go 1.18 introduced a new GC pacer (that reduces the amount of assists) and 1.19 introduced GOMEMLIMIT (I saw GOMAXHEAP mentioned somewhere earlier). We've also bounded sweeping on the allocation path back in Go 1.15, I believe. Skimming over the issue, I get the impression that there's a chance some or all of the issues that have remained here, beyond what was already fixed earlier. It's possible that may not be the case, but many of the sub-threads are fairly old.

I'm inlined to put this into WaitingForInfo unless anyone here wants to chime in with an update. We can always file new issues if it turns out something remains (and it'll probably be clearer and easier to manage than continuing a conversation that started halfway through this issue :)). EDIT: It's already in WaitingForInfo. In that case, this is just an update.

@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Dec 26, 2022
@Salamandastron1
Copy link

I have seen others succeed in speeding up their applications by attempting to reduce the number of allocations made. If Go accommodated this conceptual way of working, it would be easier than optimizing the compiler further. Off the bat is that it would be ideal for supporting functions taking multiple values from return functions to reduce the need to allocate for an error.

@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@golang golang locked and limited conversation to collaborators Jan 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests