-
Notifications
You must be signed in to change notification settings - Fork 18k
arena: possible performance improvements: huge pages, free approach #51667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @mknyszek |
Thanks for the detailed experiments and data. There's a more direct way to force huge pages that I'd been considering: we already have Also I've got patches on the prototype out to:
Your experiments on partial reuse are also great. I feel like that that point we should just use smaller arenas instead. My patch for (2) causes |
Hi @mknyszek
FWIW, I did try that as well in a couple of different spots, and it did not seem to make a difference, but perhaps I did not call it in the right spot or made some other mistake. |
Interesting. THP has always been a little specific about its requirements. When I get back to this (arenas in general, I mean, got a bunch of other stuff on my plate right now) I'll experiment with that some more. |
Also, it looked like sysHugePage might already be called in the original CL 387975 via reflect_unsafe_newUserArenaChunk -> allocSpan -> sysUsed -> sysHugePage, but not 100% confident in that. |
@thepudds Based on your suggestions I implemented https://go.dev/cl/423361. I think I included all your suggestions, though rather than changing the reuse policy I just made arenas 8 MiB in size to begin with. Checking against the same benchmark, I got roughly the same performance difference you got, so yay reproducibility! Thanks again for the time you took to analyze this more closely. :) |
@mknyszek, that's great to hear, and glad it was helpful! FWIW, I think there are likely some implications from these improvements... I had a few related comments in #51317, including this snippet:
|
This is more possible with https://go.dev/cl/423361 since the arena chunk sizes are now decoupled from the heap arena size. You can't have a chunk size larger than a heap arena at the moment, but any power-of-two smaller works. As for having lots of different chunk sizes, that becomes straightforward if the chunks are just taken from the page allocator, since differently-sized chunks can be returned to the page heap instead of sitting on a list. It wouldn't be too hard to do, I don't think. Something worth considering in the future perhaps! |
Hi, thanks so much for working on the arena GOEXPERIMENT! I took it for a test drive today, and found it very usable. However, I quickly ran into the limitations discussed above around many concurrently live arenas. I'm interested to see if arenas could be applicable for CockroachDB's use cases. In particular, during the query planning phase, CockroachDB performs a large number of allocations that have identical or very similar lifetimes. It would be impactful to be able to arena allocate these allocations. However, CockroachDB typically hosts a large number of concurrent SQL sessions - in the 1000s. Because of this, the use case I'm describing would really need configurable arena chunk sizes, like you've mentioned above. So, +1 to that idea! |
The other thing that I wanted to mention was that it would be impactful to be able to inspect an arena to understand the amount of memory reserved for it. Without that, it would be tough for the CockroachDB use case I mentioned to use arenas as implemented, because CockroachDB needs to account for memory that it allocates to ensure that when running out of memory the server can react appropriately (e.g. queuing or rejecting incoming queries, and causing queries to spill disk). |
What version of Go are you using (
go version
)?CL 387975 (patch set 5)
Does this issue reproduce with the latest release?
n/a
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I was interested in understanding the performance described in the arena proposal (#51317), so ran a few benchmarks using the prototype arena implementation in CL 387975 (patch set 5).
I didn't look very much at the interface{}/reflect based API, and instead started by adding a simple generic API for
arena.NewOf[T any](a *Arena) *T
.Looking at the initial performance on this particular benchmark (links below), some things that jumped out:
I poked at it a few different ways, and concluded the arenas in the benchmark were not getting huge pages. I checked some OS-level settings that didn't seem to help (e.g.,
/sys/kernel/mm/transparent_hugepage/enabled
was defaulted toalways
,/sys/kernel/mm/transparent_hugepage/defrag
defaulted tomadvise
but changing toalways
didn't help).I then built up a simple C program that does a similar series of
mmap(... PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, ...)
,mmap(... PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, ...)
,madvise(... MADV_HUGEPAGE)
on 64 MB chunks at a time in a light attempt to emulate syscalls done by the unmodified arena code. The C program seemed to get "correct" behavior of huge pages. I also used strace to contrast syscalls by the Go runtime vs. glibc malloc (which also does mmaps under the covers and ends up "correctly" with huge pages on this machine).Based on that, I tried a few modifications to the Go runtime, with the main results below. The largest improvement was forcing huge pages by doing memclr within the runtime on each new 2MB piece of the 64 MB arena chunks.
Some heavy caveats include these were quick YOLO changes to poke at the performance I was observing, and probably not the actual changes you would want 😅, and of course, all of this might be a red herring or an OS config issue or user error or something else entirely...
What did you see?
Here is a summary of main performance results, all with
GOMAXPROCS=8
:Baseline: no arenas
With unmodified go (no arenas):
Runtime patch 1: add arenas using arena.NewOf[T any]
arena.NewOf[T any](a *Arena) *T
. No other changes.With modified go:
Runtime patch 2: memclr 2MB pieces of arena chunks prior to allowing use
With modified go:
Runtime patch 3: unmap chunk once >8 MB is used
With modified go:
Sample benchmark output
The benchmark creates a small number of large binary trees and a large number of small binary trees, and also walks each tree to count its nodes. Here is sample output from
binarytree-arena 21
. Larger values will create more and larger trees.The text was updated successfully, but these errors were encountered: