Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: long pauses STW (sweep termination) on massive block allocation #31222

Closed
un000 opened this issue Apr 3, 2019 · 28 comments
Closed

runtime: long pauses STW (sweep termination) on massive block allocation #31222

un000 opened this issue Apr 3, 2019 · 28 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. early-in-cycle A change that should be done early in the 3 month dev cycle. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@un000
Copy link

un000 commented Apr 3, 2019

What version of Go are you using (go version)?

~ ᐅ go version
go version go1.12.1 darwin/amd64

What operating system and processor architecture are you using (go env)?

go env Output
~ ᐅ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/un0/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/un0/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.12.1/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.12.1/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/bf/sm9v_xtn4tj6xz_rmmwp5tzm0000gn/T/go-build942797956=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

We have a big entity cache inside an application. So it's loading in background goroutine.
And when the application allocates a slice in the background the latency inside all response handlers increases.

Got a reproducer

package main

import (
	"fmt"
	"math/rand"
	"os"
	"runtime/trace"
	"time"
)

type Rates struct {
	a, b, c, d float64
}

var maxWait time.Duration

func main() {
	const size = 100e6

	out, err := os.OpenFile("prof", os.O_WRONLY|os.O_CREATE|os.O_TRUNC, os.ModePerm)
	if err != nil {
		panic(err)
	}
	defer func() {
		_ = out.Sync()
		_ = out.Close()
	}()
	err = trace.Start(out)
	if err != nil {
		panic(err)
	}

	// simulate some work
	go loop()

	time.Sleep(3 * time.Second)
	// make a huge allocation in parallel
	go func() {
		_ = make([]Rates, size)		// << huge alloc
	}()
	time.Sleep(3 * time.Second)

	trace.Stop()

	fmt.Println("maxWait =", maxWait.String())
}

func loop() {
	for {
		now := time.Now()
		r := make([]Rates, 0)
		for i := 0; i < 100; i++ {
			r = append(r, Rates{
				rand.Float64(), rand.Float64(), rand.Float64(), rand.Float64(),
			})
		}
		d := time.Since(now)
		if maxWait < d {
			maxWait = d
		}
	}
}

image

What did you expect to see?

Worker gorotine(loop in example) should not pause more than 1-5ms.

What did you see instead?

43ms pauses

@un000 un000 changed the title runtime GC: long pauses STW (sweep termination) on massive allocation runtime GC: long pauses STW (sweep termination) on massive block allocation Apr 3, 2019
@agnivade agnivade changed the title runtime GC: long pauses STW (sweep termination) on massive block allocation runtime: long pauses STW (sweep termination) on massive block allocation Apr 3, 2019
@agnivade agnivade added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 3, 2019
@agnivade
Copy link
Contributor

agnivade commented Oct 8, 2019

@aclements @mknyszek

@mknyszek
Copy link
Contributor

To be clear, the reproducer is making a 3.2 GiB allocation which is going to take a while for the OS to fulfill.

However, I dug into this a bit and I'm seeing ~1s pauses, sometimes with only ~50ms pauses.

The 50ms pauses represent a lower bound for how fast the OS can give us back pages. The ~1s pauses actually come from the fact that we're zeroing the entire 3.2 GiB allocation, even though it just came from the OS.

We have an optimization which avoids re-zeroing fresh pages, but unfortunately this optimization is conservative. For example, if a needzero span at the end of the heap is contiguous with the new 3.2 GiB allocation, then those two are going to coalesce, and the 3.2 GiB free space will be marked needzero, when really only the first N KiB of it actually needs to be zeroed.

One way we could fix this is by making needzero a number, and if we can guarantee that only the first N KiB need zeroing, then we only zero that. If it's some weird swiss cheese of non-zeroed memory then we just assume the whole thing needs zeroing, rather than trying to keep track of which parts need zeroing.

I'm not sure if this fix is worth it though, since future allocations would still have to go back and zero this memory anyway, causing the 1s delay to return.

@aclements
Copy link
Member

For 1.15, one of my planned follow-up tasks on non-cooperative preemption is to make memclrNoHeapPointers preemptible. For large pointer-free allocations, this should make the long zeroing operation preemptible.

@aclements aclements added this to the Backlog milestone Oct 29, 2019
@aclements aclements added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 29, 2019
@evanj
Copy link
Contributor

evanj commented Apr 16, 2020

In case others are wondering if they are running into this: memclrNoHeapPointers shows up in pprof CPU profiles, so you should be able to find code paths calling it. You can also look for really large allocations in the alloc_space profile. (For the record, we are seeing ~second long GC pauses due to this issue in a production application using Go 1.14; thanks for the excellent reproduction which helped me understand this)

@changkun
Copy link
Member

changkun commented Aug 30, 2021

I just encountered a >1s GC pause scenario, any suggestions on how to avoid the long pause?

image

@mknyszek
Copy link
Contributor

@changkun Is that with Go 1.17? Zeroing some large allocations is now preemptible, if you suspect that's the problem here.

1 second is quite a while. Is this reproducible? Can you get a GC trace where this GC happens? That would help in narrowing down the source of the pause (that is, which part of the cycle it happened). Even if it doesn't show up in the trace (like, this is about trying to stop the world, not actually having the world stopped) that's still useful information.

@changkun
Copy link
Member

changkun commented Aug 30, 2021

@mknyszek

I am using a tip version go version devel go1.18-527609d47b Wed Aug 25 17:07:58 2021 +0200 darwin/arm64. The profile is collected from an app running on iOS.

The GC pause reported by runtime/metrics package is as follows:

1.2032e-05:	*
2.2528e-05:	*
3.2768e-05:	**
3.3792e-05:	***
3.4816e-05:	***
3.584e-05:	**
3.6864e-05:	**
3.7888e-05:	**
3.8912e-05:	*
3.9936e-05:	*
4.1984e-05:	*
4.608e-05:	*
4.7104e-05:	*
8.3968e-05:	*
0.000139264:	*
0.010747904:	*
0.012320768:	*
0.017301504:	*
0.01835008:	**
0.019398656:	**
0.019922944:	*
0.020447232:	*
0.02097152:	*
0.021495808:	**
0.022020096:	**
0.02359296:	*
0.037748736:	*
0.040894464:	*
0.042991616:	*
0.044040192:	*
0.04718592:	*
0.049283072:	*
0.071303168:	**
0.075497472:	***
0.081788928:	*
0.08388608:	*
0.090177536:	*
0.096468992:	**
0.142606336:	*
0.14680064:	**
1.476395008:	*

Examples:

GC idle took 0.02 second:

image

GC fractional took 1.2 second and 1.8 second:

image
image

Sweep termination took 1.8 second:

image

I hope I found good examples, if not: there are more examples, see the trace profile:

https://drive.google.com/file/d/1OaMruMiu8RWBXclQDRuf7MGDkzjkR4PP/view?usp=sharing


I had a few running cases where the max observation was 4 seconds (didn't collect trace profile for this case, but the reproducer should be able to observe that again):

telegram-cloud-photo-size-2-5289949315852252757-y


The reproducer (depend on develop branch f52dd5f2461b4b6396df83a6a723037f74a75a72):

package main

import (
	"image/color"
	"time"

	"fyne.io/fyne/v2"
	"fyne.io/fyne/v2/app"
	"fyne.io/fyne/v2/canvas"
	"fyne.io/fyne/v2/container"
	"fyne.io/fyne/v2/layout"
	"fyne.io/fyne/v2/test"
	"fyne.io/fyne/v2/widget"
)

func main() {
	app := app.NewWithID("iosrend")
	win := app.NewWindow("Some Window")

	cells := make([]fyne.CanvasObject, 0, 80*25)
	for i := 0; i < 80; i++ {
		for j := 0; j < 25; j++ {
			cells = append(cells, canvas.NewRectangle(color.RGBA{R: 255, G: 255, B: 255, A: 128}))
		}
	}
	container := container.New(layout.NewGridLayout(25), cells...)

	c := 0
	b := widget.NewButton("START", func() {
		for i := 0; i < 80; i++ {
			for j := 0; j < 25; j++ {
				index := i*25 + j
				obj, _ := container.Objects[index].(*canvas.Rectangle)
				if c == 0 {
					obj.FillColor = color.RGBA{R: 20, G: 30, B: 100, A: 128}
				} else if c == 1 {
					obj.FillColor = color.RGBA{R: 100, G: 20, B: 30, A: 128}
				} else if c == 2 {
					obj.FillColor = color.RGBA{R: 30, G: 100, B: 20, A: 128}
				}
				obj.Refresh()
			}
		}
		c = (c + 1) % 3
	})
	go func() {
		tk := time.NewTicker(50 * time.Millisecond)
		for range tk.C {
			test.Tap(b)
		}
	}()

	win.SetContent(fyne.NewContainerWithLayout(layout.NewBorderLayout(nil, b, nil, nil), container, b))
	win.ShowAndRun()
}

@aclements
Copy link
Member

Thanks for the examples. The GC idle and fractional traces aren't STW, Something odd appears to be happening in the GC idle example you gave because only G37 is running for a long time; that may or may not be a runtime issue, but it wouldn't show up as pause time. The sweep termination trace, however, is STW. Do you know which of the traces you uploaded is that one? We'd need to look at the stack trace of G21 at the end of the STW to get a sense of where this is. From the screenshot, G21 is clearly doing some extremely long (~35 second?!) operation.

@changkun
Copy link
Member

changkun commented Sep 9, 2021

Thanks for the examples. The GC idle and fractional traces aren't STW, Something odd appears to be happening in the GC idle example you gave because only G37 is running for a long time; that may or may not be a runtime issue, but it wouldn't show up as pause time.

@aclements , thanks for the follow-up. I wasn't unable (or patient enough) to open the slowly loaded huge trace file, but from the collected runtime/metrics, the data shows the largest STW was ~4s. From the traces, it seems that those areas are not working on more rendering work (issue GPU draw calls). Such a phenomenon is always surrounded by GC events, hence I had the assumption that it has some trouble to collaborate with GC.

The sweep termination trace, however, is STW. Do you know which of the traces you uploaded is that one? We'd need to look at the stack trace of G21 at the end of the STW to get a sense of where this is. From the screenshot, G21 is clearly doing some extremely long (~35 second?!) operation.

Yes, you can download the trace file here:
https://drive.google.com/file/d/1OaMruMiu8RWBXclQDRuf7MGDkzjkR4PP/view?usp=sharing
and the G21 is in http://[::]:8080/trace?start=46979199&end=48050088

G21 suppose to do work in the application lifetime, except for a clear request. As the name of G21 revealed, it is a goroutine for an unbounded channel where the implementation is just a for loop that takes some input from the channel, append to a slice buffer, then send out if there is a receiver waiting. Here is the code:

type UnboundedCanvasObjectChan struct { in, out chan fyne.CanvasObject }

func NewUnboundedCanvasObjectChan() *UnboundedCanvasObjectChan {
	ch := &UnboundedCanvasObjectChan{
		in:  make(chan fyne.CanvasObject, 128),
		out: make(chan fyne.CanvasObject, 128),
	}
	go func() {
		q := make([]fyne.CanvasObject, 0, 1<<10)
		for {
			e, ok := <-ch.in
			if !ok {
				close(ch.out)
				return
			}
			q = append(q, e)
			for len(q) > 0 {
				select {
				case ch.out <- q[0]:
					q = q[1:]
				case e, ok := <-ch.in:
					if ok {
						q = append(q, e)
						break
					}
					for _, e := range q {
						ch.out <- e
					}
					close(ch.out)
					return
				}
			}
			if cap(q) < 1<<5 {
				q = make([]fyne.CanvasObject, 0, 1<<10)
			}
		}
	}()
	return ch
}

func (ch *UnboundedCanvasObjectChan) In() chan<- fyne.CanvasObject { return ch.in }
func (ch *UnboundedCanvasObjectChan) Out() <-chan fyne.CanvasObject { return ch.out }

@aclements
Copy link
Member

Thanks. From the trace, I can confirm that this is a problem with mallocgc blocking preemption. Specifically, the exit stack trace is:

runtime.mallocgc:1165
runtime.growslice:269
fyne.io/fyne/v2/internal/async.NewUnboundedCanvasObjectChan.func1:47

This reflects the first preemption point after the non-preemptible operation. So it's a little hard to say exactly where this time is going in mallocgc, but we can tell it's from an append in NewUnboundedCanvasObjectChan (the one on line 47). fyne.CanvasObject is contains pointers, so the zeroing of that memory will be non-preemptible currently.

Unfortunately, I think all this tells us is that this is the known issue with allocating large blocks of pointerful memory causing long non-preemptible regions.

I am surprised it's a whole second, and that makes me wonder if there's something beyond zeroing the memory at play here. @changkun , do you know how large these slices get? What hardware are you running this on?

@changkun
Copy link
Member

changkun commented Sep 9, 2021

@aclements Thanks again for the confirmation. If it is a known issue, is there is any known approach for a workaround?

The unbounded channel code is for delivering draw operations. It is hard to know how large the slice is, but I can do a rough estimation:

The code runs for about 20 seconds without GC running in the background. Assume the slice is cleared from the last GC and starts from zero sizes. There are at least 2000 pointer objects that must be appended into the slice, every 50 milliseconds. Hence 800,000 objects had been cached in the slice before sweep termination. If a pointer object is 16 bytes, then the entire slice should have roughly at least 12MB.

The reproducer (in previous comments) was running on iOS 15, iPhone 12 Pro (without Max).

@aclements
Copy link
Member

Hmm. It definitely shouldn't take anywhere near that long just to zero 12MB. For lack of a better hypothesis, my best guess is that there's some unfortunate OS interaction involved, like it has to page in more memory for this process but to do that it has to shut down some other cached process (that's a total guess).

In terms of workarounds, unfortunately the answer is to make smaller allocations. That might be a better code structure here anyway: you're effectively creating a deque, which is often implemented as a linked list of fixed-size blocks. That would limit each individual allocation and allow GC to more effectively reclaim memory.

@changkun
Copy link
Member

Fixed-size blocks is a nice structure but grow a little bit more complexity on the implementation of resizing there. Before doing it, I still wonder is why append will cause a huge allocation. Especially:

		q := make([]*int, 0, 1<<10) // slice
		for {
			e, ok := <-ch.in
			...
			q = append(q, e)                         // (1) large alloc (?)
			for len(q) > 0 {
				select {
				case ch.out <- q[0]:
					q = q[1:]                // (2) here
				case e, ok := <-ch.in:
					if ok {
						q = append(q, e) // (3) large alloc (?)
						break
					}

The previous assumption (~12MB) is really a worst-case scenario meaning that no receiver available in 20 seconds, which is entirely false, a real case should be better than than. The slice operation (in (2)) moves the head of the slice which removes the reference to previous elements. A large allocation in append means it will double an "already-huge" slice capacity. Hence it seems to suggest the slice for caching elements can only be GCed until very late. Is there a particular reason that GC does not collect the those unreferenced elements in the slice?

@aclements
Copy link
Member

Is there a particular reason that GC does not collect the those unreferenced elements in the slice?

The GC doesn't work in partial objects. That is, it doesn't track the "earliest" reference into a slice, just that there was some reference to the slice. And once the entire slice is retained by GC, it also retains everything it points to.

One could certainly imagine GC tracking the earliest pointer into a slice and only scanning the slice past that point (in fact, you could do this for all objects). This gets really tricky when there are multiple references to an object. For example, the GC could have already followed one reference and scanned the object past that point, only to later find a reference that points earlier in the object and either have to scan it again or track the point in each object it last scanned from. It's all theoretically doable, but it wouldn't pay for its own costs in most cases.

If you want GC to collect the unreferenced elements, set q[0] to nil before slicing it. But that's not going to help with the large allocations.

@NeetishPathak
Copy link

@aclements does runtime.mallocgc blocks the pre-emption in go1.18 ? We are observing a similar issue with long sweep termination phase. Allocation size is 32K - 64K.

@felixge
Copy link
Contributor

felixge commented Mar 30, 2023

Just commenting since we're also seeing this issue on some services internally. The worst outliers can be up to 15s:

CleanShot 2023-03-30 at 14 16 16

Here is some detail on a 2s pause we were able to catch while running an execution trace.

CleanShot 2023-03-30 at 14 08 02

This isn't critical for us and we think we know how we could mitigate the problem by changing our code. I'm just commenting since it came up in a discussion with @prattmic today. My colleague @pmbauer did the investigation on this.

@mknyszek
Copy link
Contributor

I did some thinking about breaking up the zeroing of large pointerful objects. I think it's doable, but it's pretty subtle, and I might be missing something (there may be some GC interaction I'm not seeing, specifically).

It would look something like this:

  1. Have spanAlloc set the initial class for a large scan span to noscan.
  2. spanAlloc does an atomic span state write, then publishes the span to the GC, as usual.
  3. After returning from spanAlloc (via largeAlloc), mallocgc becomes preemptible.
  4. mallocgc zeroes the span.
  5. mallocgc becomes non-preemptible again.
  6. mallocgc writes out the pointer bitmap.
  7. mallocgc sets the span as scannable.

There's an additional detail here that we don't update freeIndex until the second non-preemptible section, I think.

Setting the span as scannable late and non-atomically works because if it happens during a GC cycle (or if we transition into a GC cycle while zeroing) we'll mark the object before returning, so it'll never need to get scanned (and thus its class doesn't matter). STWs also effectively global memory barriers so GC transitions are safe.

I think the only downside is how subtle this all is.

@mknyszek
Copy link
Contributor

One way to make the approach in my previous comment less complex would be to push more of the responsibilities of initializing heap spans up closer to mallocgc. (For example, make it the mcache.largeAlloc's and mcentral.grow's problem.) In that case all new spans could be allocated initially in such a way that the GC ignores them, or treats them as not having pointers. It would then be allocSpan caller's responsibility to get it into the right state, publish it, etc.

The main benefit of this is it would significantly reduce the distance between where the span class is set for good and where the subtleties of it actually matter. Large object span setup (including zeroing) could then be pushed down into largeAlloc, and then all the subleties could live in one function instead of spread all over the allocator.

@evanj
Copy link
Contributor

evanj commented May 12, 2023

Just to clarify, because I misunderstood some of the comments here: at least as of Go 1.20, large allocations of chunks that do not contain pointers also still block GC, so this problem currently does not depend on if the allocations contains pointers or not (see memclrNoHeapPointers in the trace that @felixge posted as an example). (Edited this is incorrect; @mknyszek 's next comment #31222 (comment) clarifies that large blocks that do not contain pointers have pre-emption points, as of approximately Go 1.17 if I looked at the commit history correctly).

@mknyszek
Copy link
Contributor

mknyszek commented May 12, 2023

memclrNoHeapPointers doesn't mean that the allocation isn't for a type that contains pointers, it just means that there are no valid pointers the GC has to execute write barriers for. It's a bit of a misnomer, but we did actually fix the problem for allocations for types without pointers in them in that we clear in chunks with explicit preemption points in between. If there was a problem with that, I would expect to see memclrNoHeapPointersChunked in the trace.

@mknyszek
Copy link
Contributor

Also, I think it's too late in this cycle to do anything here for Go 1.21 unfortunately (clearing memory that will soon contain pointers is subtle) but I will do something here for 1.22 early-in-cycle.

We've brainstormed a few ideas and I feel pretty OK about the idea of faking a span as noscan until its' allocation is ready to be returned. We already do this kind of faking elsewhere, so we're at least not adding too much new complexity.

@mknyszek mknyszek modified the milestones: Backlog, Go1.22 May 12, 2023
@mknyszek mknyszek self-assigned this May 12, 2023
@mknyszek mknyszek added NeedsFix The path to resolution is known, but the work has not been done. compiler/runtime Issues related to the Go compiler and/or runtime. labels May 12, 2023
@gopherbot gopherbot removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label May 12, 2023
@mknyszek mknyszek added the early-in-cycle A change that should be done early in the 3 month dev cycle. label May 12, 2023
@evanj
Copy link
Contributor

evanj commented May 12, 2023

Oh interesting; thanks you for the rapid clarification! Now I need to go back to debugging the application a bit more, since there is something happening I don't understand...

@RLH
Copy link
Contributor

RLH commented May 18, 2023

It is my understanding that when the kernel returns a block of zeroed memory all the virtual addresses point to a single physical write protected page of zeros. The first write to a page ends up allocating a physical page, zeroes it, and associates the virtual memory to the physical memory. This copy (or zero) on write greatly reduces the latency of asking for zeroed memory and then does the zeroing incrementally on demand when the page is written to.

Can we mimic this approach for large allocations? I suspect it can’t be done in user space but perhaps we can round trip the memory to the kernel so when it returns it works like fresh memory.

If this works then the problem with allocating, deallocating, and reallocating large objects will also see reduced latency. I have no idea how large of an allocation would be needed for this to be a win.

@mknyszek
Copy link
Contributor

mknyszek commented May 19, 2023

That's an interesting idea. You can force this via madvise(MADV_DONTNEED) on Linux: the next access will always return zeroed memory, on-demand. I suspect it's more expensive than just zeroing for O(10s of pages) or less since they get recycled often (previous experiments with MADV_DONTNEED indicated that calling it on ~every span allocation leads to a bad time).

But, for larger allocations that need to be zeroed anyway, it might not actually matter (or, the latency win far outweighs any throughput loss). This could be a much simpler and safer way to achieve the latency improvement vs. the chunking approach. Thanks for the suggestion Rick!

The one downside is I don't think a similar approach will work on non-Linux platforms. For instance, the comparable mechanism on Windows does actually cause a fault. On the BSDs there's only madvise(MADV_FREE) which I think is much lazier and doesn't make the "zeroed on access" guarantee.

@gopherbot
Copy link

This issue is currently labeled as early-in-cycle for Go 1.22.
That time is now, so a friendly reminder to look at it again.

@mknyszek
Copy link
Contributor

This is going to be a good bit easier to do with the new allocation headers landing in Go 1.22. Punting to Go 1.23, but this time for sure. :)

@mknyszek mknyszek modified the milestones: Go1.22, Go1.23 Nov 28, 2023
@gopherbot
Copy link

This issue is currently labeled as early-in-cycle for Go 1.23.
That time is now, so a friendly reminder to look at it again.

@gopherbot
Copy link

Change https://go.dev/cl/577495 mentions this issue: runtime: make zeroing of large objects containing pointers preemptible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. early-in-cycle A change that should be done early in the 3 month dev cycle. NeedsFix The path to resolution is known, but the work has not been done.
Projects
Development

No branches or pull requests

10 participants