arena: possible performance improvements: huge pages, free approach #51667

thepudds · 2022-03-14T19:33:51Z

What version of Go are you using (`go version`)?

CL 387975 (patch set 5)

$ go version
go version devel go1.18-c217a17823 Tue Mar 1 10:12:02 2022 -0800 linux/amd64

Does this issue reproduce with the latest release?

n/a

What operating system and processor architecture are you using (`go env`)?

amd64 (AMD EPYC)
Debian 11 bullseye (stock GCE image; some light checking of similar behavior on Ubuntu 20.04 LTS)

go env Output

$ go env
GO111MODULE=""

GOARCH="amd64"

GOBIN=""

GOCACHE="/home/thepudds1460/.cache/go-build"

GOENV="/home/thepudds1460/.config/go/env"

GOEXE=""

GOEXPERIMENT=""

GOFLAGS=""

GOHOSTARCH="amd64"

GOHOSTOS="linux"

GOINSECURE=""

GOMODCACHE="/home/thepudds1460/go/pkg/mod"

GONOPROXY=""

GONOSUMDB=""

GOOS="linux"

GOPATH="/home/thepudds1460/go"

GOPRIVATE=""

GOPROXY="https://proxy.golang.org,direct"

GOROOT="/home/thepudds1460/sdk/gotip"

GOSUMDB="sum.golang.org"

GOTMPDIR=""

GOTOOLDIR="/home/thepudds1460/sdk/gotip/pkg/tool/linux_amd64"

GOVCS=""

GOVERSION="devel go1.18-c217a17823 Tue Mar 1 10:12:02 2022 -0800"

GCCGO="gccgo"

GOAMD64="v1"

AR="ar"

CC="gcc"

CXX="g++"

CGO_ENABLED="1"

GOMOD="/home/thepudds1460/arena-benchmark-wip/go.mod"

GOWORK=""

CGO_CFLAGS="-g -O2"

CGO_CPPFLAGS=""

CGO_CXXFLAGS="-g -O2"

CGO_FFLAGS="-g -O2"

CGO_LDFLAGS="-g -O2"

PKG_CONFIG="pkg-config"

GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1871014333=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I was interested in understanding the performance described in the arena proposal (#51317), so ran a few benchmarks using the prototype arena implementation in CL 387975 (patch set 5).

I didn't look very much at the interface{}/reflect based API, and instead started by adding a simple generic API for arena.NewOf[T any](a *Arena) *T.

Looking at the initial performance on this particular benchmark (links below), some things that jumped out:

the cost of first writing to newly allocated memory in the user code seemed to dominate the overall time in these benchmarks.
high system time.
the kernel was spending time in in places like handle_mm_fault, flush_tlb_mm_range, and friends.
many minor page faults.

I poked at it a few different ways, and concluded the arenas in the benchmark were not getting huge pages. I checked some OS-level settings that didn't seem to help (e.g., /sys/kernel/mm/transparent_hugepage/enabled was defaulted to always, /sys/kernel/mm/transparent_hugepage/defrag defaulted to madvise but changing to always didn't help).

I then built up a simple C program that does a similar series of mmap(... PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, ...), mmap(... PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, ...), madvise(... MADV_HUGEPAGE) on 64 MB chunks at a time in a light attempt to emulate syscalls done by the unmodified arena code. The C program seemed to get "correct" behavior of huge pages. I also used strace to contrast syscalls by the Go runtime vs. glibc malloc (which also does mmaps under the covers and ends up "correctly" with huge pages on this machine).

Based on that, I tried a few modifications to the Go runtime, with the main results below. The largest improvement was forcing huge pages by doing memclr within the runtime on each new 2MB piece of the 64 MB arena chunks.

Some heavy caveats include these were quick YOLO changes to poke at the performance I was observing, and probably not the actual changes you would want 😅, and of course, all of this might be a red herring or an OS config issue or user error or something else entirely...

What did you see?

Here is a summary of main performance results, all with GOMAXPROCS=8:

Baseline: no arenas

Benchmark -- this is the baseline benchmark, which is the currently the fastest Go entry for the binarytree benchmark from the Benchmark Games site.

With unmodified go (no arenas):

$ go install github.com/thepudds/arena-performance/cmd/binarytree-original@f595de77
$ binarytree-original 21    # 21 controls the size of the trees

     5.12 sec wall clock
    34.53 sec user
     5.50 sec system
    40.03 sec total cpu 
      537 MB RSS
   76,886 minor page faults

Runtime patch 1: add arenas using arena.NewOf[T any]

Go patch -- add simple implementation of arena.NewOf[T any](a *Arena) *T. No other changes.
Benchmark -- change original benchmark to use arenas via the generic API.

With modified go:

$ go install github.com/thepudds/arena-performance/cmd/binarytree-arena@f595de77
$ binarytree-arena 21

    12.50 sec wall clock   (144% increase from baseline)
    20.80 sec user
    77.92 sec system
    98.72 sec total cpu    (147% increase from baseline)
      750 MB RSS           (40% increase from baseline)
2,334,682 minor page faults

Runtime patch 2: memclr 2MB pieces of arena chunks prior to allowing use

Go patch -- also add calls to memclrNoHeapPointers within runtime/arena.go and reflect/arena.go to help commit memory in 2MB pieces, which seems to result in huge pages from kernel. There is very likely a better change than this, but this seemed to help substantially and at least suggests huge pages have an impact.
Benchmark -- same as prior

With modified go:

$ go install github.com/thepudds/arena-performance/cmd/binarytree-arena@f595de77
$ binarytree-arena 21

     2.55 sec wall clock   (50% reduction from baseline)
    19.34 sec user
     0.89 sec system
    20.23 sec total cpu    (49& reduction from baseline)
      765 MB RSS           (42% increase from baseline)
   14,492 minor page faults

Runtime patch 3: unmap chunk once >8 MB is used

Go patch -- also change reflect/arena.go so that arena Free unmaps a chunk once >8 MB is used (rather than waiting for full 64MB)
Benchmark -- same as prior

With modified go:

$ go install github.com/thepudds/arena-performance/cmd/binarytree-arena@f595de77
$ binarytree-arena 21

     2.71 sec wall clock   (47% reduction from baseline)
    20.11 sec user
     0.97 sec system
    21.08 sec total cpu    (47% reduction from baseline)
      312 MB RSS           (42% reduction from baseline)
   13,980 minor page faults

Sample benchmark output

The benchmark creates a small number of large binary trees and a large number of small binary trees, and also walks each tree to count its nodes. Here is sample output from binarytree-arena 21. Larger values will create more and larger trees.

$ binarytree-arena 21

   stretch tree of depth 22       arenas: 1      nodes: 8388607    MB: 128.0
  2097152 trees of depth 4        arenas: 992    nodes: 65011712   MB: 992.0
   524288 trees of depth 6        arenas: 1015   nodes: 66584576   MB: 1016.0
   131072 trees of depth 8        arenas: 1017   nodes: 66977792   MB: 1022.0
    32768 trees of depth 10       arenas: 993    nodes: 67076096   MB: 1023.5
     8192 trees of depth 12       arenas: 911    nodes: 67100672   MB: 1023.9
     2048 trees of depth 14       arenas: 683    nodes: 67106816   MB: 1024.0
      512 trees of depth 16       arenas: 512    nodes: 67108352   MB: 1024.0
      128 trees of depth 18       arenas: 128    nodes: 67108736   MB: 1024.0
       32 trees of depth 20       arenas: 32     nodes: 67108832   MB: 1024.0
long lived tree of depth 21       arenas: 1      nodes: 4194303    MB: 64.0

The text was updated successfully, but these errors were encountered:

heschi · 2022-03-14T21:27:50Z

cc @mknyszek

mknyszek · 2022-03-14T22:08:11Z

Thanks for the detailed experiments and data.

There's a more direct way to force huge pages that I'd been considering: we already have sysHugePage in the runtime which just just madvise(MADV_HUGEPAGE). I appreciate your experiments there, it shows that this is worthwhile to make work. (This should probably be better than zeroing, because in theory the OS is already giving us zeroed pages so the zeroing is redundant? Unless I'm missing something.)

Also I've got patches on the prototype out to:

Don't actually unmap the region, just re-map it as PROT_NONE. That's not actually safe since something else could map there.
Make the 64 MiB chunk eligible for reuse once the GC finds no pointers into the region.

Your experiments on partial reuse are also great. I feel like that that point we should just use smaller arenas instead. My patch for (2) causes heapArena metadata to retain a 1:1 mapping with address space (as opposed to now, where the heapArena might be used for something else), so it'll be really quite feasible to handle smaller chunks, or partial chunks, in general.

thepudds · 2022-03-14T22:11:04Z

Hi @mknyszek

There's a more direct way to force huge pages that I'd been considering: we already have sysHugePage in the runtime which just just madvise(MADV_HUGEPAGE).

FWIW, I did try that as well in a couple of different spots, and it did not seem to make a difference, but perhaps I did not call it in the right spot or made some other mistake.

mknyszek · 2022-03-14T22:20:03Z

Interesting. THP has always been a little specific about its requirements. When I get back to this (arenas in general, I mean, got a bunch of other stuff on my plate right now) I'll experiment with that some more.

thepudds · 2022-03-14T22:25:08Z

Also, it looked like sysHugePage might already be called in the original CL 387975 via reflect_unsafe_newUserArenaChunk -> allocSpan -> sysUsed -> sysHugePage, but not 100% confident in that.

mknyszek · 2022-09-21T15:45:16Z

@thepudds Based on your suggestions I implemented https://go.dev/cl/423361. I think I included all your suggestions, though rather than changing the reuse policy I just made arenas 8 MiB in size to begin with. Checking against the same benchmark, I got roughly the same performance difference you got, so yay reproducibility!

Thanks again for the time you took to analyze this more closely. :)

thepudds · 2022-09-21T16:03:50Z

@mknyszek, that's great to hear, and glad it was helpful!

FWIW, I think there are likely some implications from these improvements... I had a few related comments in #51317, including this snippet:

My guess (which might be wrong! ;-) based on what I saw in #51667 is that if arenas are adopted and there is a full implementation, it might be plausible to use 100s or perhaps even 1000s of them concurrently rather than "few to 10s" [from original proposal], though of course it would likely depend on how much RAM you have, how much benefit you are getting, and especially at the upper end it might come down to CPU vs. RAM tradeoffs.

If it ends up being 100s or 1000s concurrently, that would of course have some implications for how widely they are used, how viral they are in terms of APIs, and so on.

One API implication of wanting to use more arenas concurrently is that there might be some benefit to allowing a user to control or hint the batching size or chunk size. That said, that might not be worthwhile if the chunk size and/or batch size is "small enough" or perhaps auto-tuned. Also, the API could possibly allow getting a snapshot of how many allocations have been done by an arena instance. That could be helpful for things like testing and monitoring, and could allow a user to more conveniently decide to re-use an arena instance or not, although a user could in theory track themselves.

mknyszek · 2022-09-21T16:14:09Z

One API implication of wanting to use more arenas concurrently is that there might be some benefit to allowing a user to control or hint the batching size or chunk size.

This is more possible with https://go.dev/cl/423361 since the arena chunk sizes are now decoupled from the heap arena size. You can't have a chunk size larger than a heap arena at the moment, but any power-of-two smaller works.

As for having lots of different chunk sizes, that becomes straightforward if the chunks are just taken from the page allocator, since differently-sized chunks can be returned to the page heap instead of sitting on a list. It wouldn't be too hard to do, I don't think. Something worth considering in the future perhaps!

jordanlewis · 2022-10-14T23:14:47Z

Hi, thanks so much for working on the arena GOEXPERIMENT! I took it for a test drive today, and found it very usable. However, I quickly ran into the limitations discussed above around many concurrently live arenas.

I'm interested to see if arenas could be applicable for CockroachDB's use cases. In particular, during the query planning phase, CockroachDB performs a large number of allocations that have identical or very similar lifetimes. It would be impactful to be able to arena allocate these allocations.

However, CockroachDB typically hosts a large number of concurrent SQL sessions - in the 1000s. Because of this, the use case I'm describing would really need configurable arena chunk sizes, like you've mentioned above.

So, +1 to that idea!

jordanlewis · 2022-10-14T23:17:07Z

The other thing that I wanted to mention was that it would be impactful to be able to inspect an arena to understand the amount of memory reserved for it. Without that, it would be tough for the CockroachDB use case I mentioned to use arenas as implemented, because CockroachDB needs to account for memory that it allocates to ensure that when running out of memory the server can react appropriately (e.g. queuing or rejecting incoming queries, and causing queries to spill disk).

heschi added the NeedsInvestigation label Mar 14, 2022

heschi added this to the Backlog milestone Mar 14, 2022

thepudds mentioned this issue Mar 18, 2022

proposal: arena: new package providing memory arenas #51317

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arena: possible performance improvements: huge pages, free approach #51667

arena: possible performance improvements: huge pages, free approach #51667

thepudds commented Mar 14, 2022 •

edited

Loading

heschi commented Mar 14, 2022

mknyszek commented Mar 14, 2022

thepudds commented Mar 14, 2022 •

edited

Loading

mknyszek commented Mar 14, 2022

thepudds commented Mar 14, 2022

mknyszek commented Sep 21, 2022

thepudds commented Sep 21, 2022

mknyszek commented Sep 21, 2022

jordanlewis commented Oct 14, 2022

jordanlewis commented Oct 14, 2022

arena: possible performance improvements: huge pages, free approach #51667

arena: possible performance improvements: huge pages, free approach #51667

Comments

thepudds commented Mar 14, 2022 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you see?

Baseline: no arenas

Runtime patch 1: add arenas using arena.NewOf[T any]

Runtime patch 2: memclr 2MB pieces of arena chunks prior to allowing use

Runtime patch 3: unmap chunk once >8 MB is used

Sample benchmark output

heschi commented Mar 14, 2022

mknyszek commented Mar 14, 2022

thepudds commented Mar 14, 2022 • edited Loading

mknyszek commented Mar 14, 2022

thepudds commented Mar 14, 2022

mknyszek commented Sep 21, 2022

thepudds commented Sep 21, 2022

mknyszek commented Sep 21, 2022

jordanlewis commented Oct 14, 2022

jordanlewis commented Oct 14, 2022

thepudds commented Mar 14, 2022 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

thepudds commented Mar 14, 2022 •

edited

Loading