Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: contention in runtime.(*mheap).freeSpan.func1 on mheap_.lock #61427

Open
rhysh opened this issue Jul 18, 2023 · 2 comments
Open

runtime: contention in runtime.(*mheap).freeSpan.func1 on mheap_.lock #61427

rhysh opened this issue Jul 18, 2023 · 2 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Jul 18, 2023

This is the second in what I think will be a series of three issue reports about contention I've seen on various runtime-internal singleton locks in an app that typically runs on dual-socket linux/amd64 machines with 96 (or sometimes 64) hardware threads.

This issue is about contention in runtime.(*mheap).freeSpan.func1 on mheap_.lock.

I hope it'll be of interest to @mknyszek . Also CC @golang/runtime

Sorry about the somewhat outdated version. I haven't seen related issues go by, so I hope the info in this one is still of some use.

What version of Go are you using (go version)?

$ go version
go version go1.19.6 linux/amd64

Does this issue reproduce with the latest release?

I don't know yet if the issue is present in Go 1.20 series or in the release candidates of the Go 1.21 series.

What operating system and processor architecture are you using (go env)?

The app runs on Linux / x86_64, typically on machines with two processor sockets (NUMA). I don't have the output of go env from the app's runtime environment.

What did you do?

A data-processing app received a higher volume of data than usual. Here are some more of its dimensions:

The average allocation size is about 128 bytes. Across the whole app, it averages around 4 GiB per second of allocations, or 16 kiB of allocations every 4 µs. Split across the 96 threads the runtime uses for the app (GOMAXPROCS == num hardware threads), that's 40 MiB per second per thread, an average of 3 µs between allocations, or 16 kiB of allocations every 400 µs.

What did you expect to see?

I expected to see very little contention on runtime-internal singleton locks.

What did you see instead?

Contention in runtime.(*mheap).freeSpan.func1 makes it hard for callers of runtime.mallocgc to do their required sweep assist work. The contention is on these two lines:

https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L1486
https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L1500

The code in runtime.(*sweepLocked).sweep closer to the root of the call stack looks interesting. It's the same in the development tip now as it was in Go 1.15. From my read of it, when a swept span is either completely or partially full, it gets pushed onto a lock-free stack. But if the swept span is completely empty, the resulting call to mheap_.freeSpan requires taking the global mheap_.lock lock.

@mknyszek , would it be possible for mheap_.freeSpan to use a lock-free structure (maybe lfstack) too?

go/src/runtime/mgcsweep.go

Lines 751 to 768 in 5d481ab

if !preserve {
// The caller may not have removed this span from whatever
// unswept set its on but taken ownership of the span for
// sweeping by updating sweepgen. If this span still is in
// an unswept set, then the mcentral will pop it off the
// set, check its sweepgen, and ignore it.
if nalloc == 0 {
// Free totally free span directly back to the heap.
mheap_.freeSpan(s)
return true
}
// Return span back to the right mcentral list.
if uintptr(nalloc) == s.nelems {
mheap_.central[spc].mcentral.fullSwept(sweepgen).push(s)
} else {
mheap_.central[spc].mcentral.partialSwept(sweepgen).push(s)
}
}


Data set 39f2fa9279db66ec9e1b0a19b63bb1f479d827b39abfc67dbd715a4e18f7680b is from a instance of the app running on a 96-thread machine. The CPU profile ran for 5.16 seconds, during which it collected samples corresponding to a total 392 seconds of on-CPU time (an average of 76 on-CPU threads) and 98.16+73.70 = 171.86 seconds of time (an average of 33 on-CPU threads) in calls from runtime.(*mheap).freeSpan.func1 to lock and unlock mheap_.lock.

$ ln -s /usr/local/go/bin/go /tmp/redacted
$ (cd 39f2fa9279db66ec9e1b0a19b63bb1f479d827b39abfc67dbd715a4e18f7680b && go tool pprof -focus=futex '-show_from=^runtime\.' '-prune_from=runtime.(lock|unlock)' '-peek=runtime\.(lock|unlock)$' /tmp/redacted pprof/profile)                  
File: redacted
Type: cpu
Time: Jul 1, 2023 at 1:23pm (PDT)
Duration: 5.16s, Total samples = 391.71s (7591.56%)
Active filters:
   focus=futex
   show_from=^runtime\.
Showing nodes accounting for 181.79s, 46.41% of 391.71s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            98.16s 94.27% |   runtime.(*mheap).freeSpan.func1 (inline)
                                             2.81s  2.70% |   runtime.newMarkBits (inline)
                                             1.64s  1.57% |   runtime.(*mheap).allocSpan (inline)
                                             0.66s  0.63% |   runtime.chansend (inline)
                                             0.31s   0.3% |   runtime.(*spanSet).push (inline)
                                             0.27s  0.26% |   runtime.sellock (inline)
                                             0.10s 0.096% |   runtime.findRunnable (inline)
                                             0.06s 0.058% |   runtime.goschedImpl (inline)
                                             0.04s 0.038% |   runtime.stopm (inline)
                                             0.02s 0.019% |   runtime.checkRunqsNoP (inline)
                                             0.02s 0.019% |   runtime.newArenaMayUnlock (inline)
                                             0.02s 0.019% |   runtime.runOneTimer (inline)
                                             0.01s 0.0096% |   runtime.checkTimers (inline)
                                             0.01s 0.0096% |   runtime.setprofilebucket (inline)
     0.01s 0.0026% 0.0026%    104.13s 26.58%                | runtime.lock
                                           104.12s   100% |   runtime.lockWithRank (inline)
----------------------------------------------------------+-------------
                                            73.70s 95.05% |   runtime.(*mheap).freeSpan.func1 (inline)
                                             1.57s  2.02% |   runtime.(*mheap).allocSpan (inline)
                                             1.41s  1.82% |   runtime.newMarkBits (inline)
                                             0.32s  0.41% |   runtime.chansend (inline)
                                             0.22s  0.28% |   runtime.(*spanSet).push (inline)
                                             0.11s  0.14% |   runtime.selunlock (inline)
                                             0.08s   0.1% |   runtime.findRunnable (inline)
                                             0.06s 0.077% |   runtime.goschedImpl (inline)
                                             0.02s 0.026% |   runtime.chansend.func1 (inline)
                                             0.01s 0.013% |   runtime.freeSpecial (inline)
                                             0.01s 0.013% |   runtime.mProf_Free (inline)
                                             0.01s 0.013% |   runtime.runOneTimer (inline)
                                             0.01s 0.013% |   runtime.stopm (inline)
         0     0% 0.0026%     77.54s 19.80%                | runtime.unlock
                                            77.54s   100% |   runtime.unlockWithRank (inline)
----------------------------------------------------------+-------------

freeSpan-pprof

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 18, 2023
@mknyszek
Copy link
Contributor

I'm generally aware of contention on this path, but the path to resolution isn't obvious with the current data structure. There is the potential for some low-ish hanging fruit here. For instance, frees to the same 4 MiB aligned region (pallocChunk) can be batched together. The page reclaimer could notice when it's freeing into the same region (possibly across goroutines) and batch the frees together. (The mpagealloc.go code knows how to do this already. There's an update function that takes a bool as to whether it should treat the alloc/dealloc as contiguous or not. So what this would look like is: (1) lock, (2) set all the bits for the freed memory in the chunk, (3) do one non-contiguous update for those bits.).

However, your profile above doesn't even mention the page reclaimer. The case mentioned in the profile is much harder to batch frees in. I briefly considered some kind of lock-free (or just, much faster but still locked) "free queue" for each chunk that gets flushed (in a batch) every once in a while, but it's a trade-off with making freed memory unavailable and it's not obvious to me when the flush should happen.

But, something I hadn't considered until now is that the runtime can fairly efficiently check if a span is going to be empty (the page reclaimer needs to know this information). Such spans could be placed on a separate sweep queue, so partially in-use spans are prioritized instead by deductSweepCredit. This would leave sweeping of "freeable" spans to mostly the page reclaimer, which has a much stronger opportunity for batching frees.

(As an aside, one could even imagine the page reclaimer reserving 4 MiB chunks to observe at a time (single low-contention CAS-loop per chunk, a la allocNeedsZero), so the atomic loads on the bitmap aren't necessary. (Might be nice for arm64. Makes no difference on amd64.))

would it be possible for mheap_.freeSpan to use a lock-free structure

Possible, yeah. :) But it's not easy or obvious. The expensive part of that is calling into mheap_.pages.free, and most of that is probably the summary update code (this is mpagealloc.go). There's a recent OSDI paper that implements a lock-free page-frame allocator for the Linux kernel which at face value looks quite similar to Go's current page allocator. It almost certainly doesn't meet our requirements. (In fact, they point out that one of the reasons page-frame allocation is slow is because it's also used as part of a more general allocation strategy in the Linux kernel. The paper recognizes that they can do a lot better by just focusing on one problem.)

@heschi heschi added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 19, 2023
@heschi
Copy link
Contributor

heschi commented Jul 19, 2023

cc @golang/runtime

@mknyszek mknyszek added this to the Backlog milestone Jul 25, 2023
@mknyszek mknyszek self-assigned this Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

5 participants