cmd/compile,runtime: speed up heap allocations in loops #60879

CAFxX · 2023-06-20T06:58:34Z

There is currently a large performance gap between allocating N elements of type T, and allocating a slice of N elements of type T (even in the case in which T does not contain pointers and the size of T is equal to the size class used for T).

Just to give an idea, the following benchmark¹ shows the performance gap between the two, for varying number of elements N (old is N allocations, new is a single allocation of a slice of N elements):

name          old time/op  new time/op  delta
Allocs/1      65.4ns ± 0%  65.2ns ± 0%     ~     (p=0.079 n=4+5)
Allocs/2       100ns ± 3%    74ns ±11%  -25.60%  (p=0.008 n=5+5)
Allocs/4       184ns ± 1%    84ns ± 4%  -54.26%  (p=0.016 n=4+5)
Allocs/8       315ns ± 1%   103ns ±11%  -67.41%  (p=0.008 n=5+5)
Allocs/16      580ns ± 2%   132ns ±12%  -77.17%  (p=0.008 n=5+5)
Allocs/32     1.18µs ±16%  0.19µs ±24%  -83.93%  (p=0.008 n=5+5)
Allocs/64     2.21µs ± 0%  0.29µs ±25%  -86.93%  (p=0.008 n=5+5)
Allocs/128    4.75µs ±22%  0.51µs ± 9%  -89.32%  (p=0.008 n=5+5)
Allocs/256    12.6µs ±17%   0.9µs ± 4%  -92.50%  (p=0.008 n=5+5)
Allocs/512    22.4µs ±12%   1.8µs ± 5%  -91.94%  (p=0.008 n=5+5)
Allocs/1024   42.5µs ± 6%   3.5µs ± 5%  -91.81%  (p=0.008 n=5+5)
Allocs/2048   85.2µs ±12%   7.1µs ±13%  -91.72%  (p=0.008 n=5+5)
Allocs/4096    169µs ±10%    14µs ± 8%  -91.82%  (p=0.008 n=5+5)
Allocs/8192    298µs ±15%    25µs ± 1%  -91.63%  (p=0.008 n=5+5)
Allocs/16384   542µs ± 1%    49µs ± 3%  -91.02%  (p=0.008 n=5+5)
Allocs/32768  1.07ms ± 1%  0.09ms ± 2%  -91.11%  (p=0.008 n=5+5)
Allocs/65536  2.13ms ± 2%  0.19ms ± 3%  -91.20%  (p=0.008 n=5+5)

To help close the gap, it may be worthwhile to investigate/consider performing batch allocations for heap allocations that occur inside loops. This is something that happens frequently in unmarshaling code and, more in general, in data processing code.

To clarify, I am not suggesting turning N allocations of type T into an allocation of a vector with N elements of type T, as that would prevent garbage collection of individual elements. Instead by "batch allocation" I mean performing the allocation of N individual elements of type T in a single call to the runtime. In this way each one of the N elements can still be individually garbage collected.

This could mean, for a loop with a trip count known at compile time, turning this:

for ... /* trip count = C (known at compile time) */ {
  e := new(T)
  // use e
}

into this:

a := make([]*T, 0, C) // stack allocated
for ... {
  e := new_batched(&a)
  // use e
}
batch_deallocate(a) // optional

and for loops with trip count known at runtime, turning this:

for ... /* trip count = c (not known at compile time) */ {
  e := new(T)
  // use e
}

into:

a := make([]*T, 0, 64) // stack allocated
for ... {
  if len(a) == 0 && remaining_trip_count < cap(a) { // optional
    a = a[:0:remaining_trip_count]
  }
  e := new_batched(&a)
  // use e
}
batch_deallocate(a) // optional

and for loops with unknown trip count, turning this:

for ... {
  e := new(T)
  // use e
}

into:

a := make([]*T, 0, 64) // stack allocated
for ... {
  e := new_batched(&a)
  // use e
}
batch_deallocate(a) // optional

where the runtime functions calls inserted above by the compiler would be equivalent to:

func new_batched[T any](a *[]*T) *T {
  if a == nil || cap(*a) <= 1 {
    return new(T)
  }
  if len(*a) == 0 {
    *a = *a[:cap(*a)]
    batch_allocate(*a)
  }
  e := (*a)[len(*a)-1]
  *a = (*a)[:len(*a)-1]
  return e
}

func batch_allocate[T any](a []*T) {
  // this would be a runtime call that allocates len(a) elements of type T
  for i := range a {
    a[i] = new(T)
  }
}

func batch_deallocate[T any](a []*T) {
  // this would be a runtime call that deallocates the len(a) elements of type T
  for _, e := range a {
    // mark e as unused in the corresponding bitmap
  }
}

In a sense, this would entail having loop-scoped bump allocators to avoid entering the main heap allocator for each individual element.

Surely some heuristics would be needed to decide when to use batch allocation and when not to: for example we definitely would not want to use batch allocation for loops with very low trip counts. Nevertheless I think this would be a worthwhile optimization to consider.

An alternative to the above would be to specialize the heap allocator for {size class, has pointers} or otherwise add generic, fast per-P bump allocators (that would have the benefit of being effective for all code, not just in loops). These seem like more invasive changes to the runtime, though.

Other alternatives include the (over)use of sync.Pool but as was mentioned in the past almost any use of sync.Pool is an indication that the memory management subsystem is lacking in one way or the other.

go version go1.20.5 darwin/amd64, Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz (turbo boost disabled) ↩

The text was updated successfully, but these errors were encountered:

rip-create-your-account · 2023-06-20T19:34:00Z

Opportunities for batch allocation are something that I stumble upon quite regularly and usually there's more than one allocation going on in the loop. Most recently in real code here.

I really don't like doing batch := make([]T, n); obj1 := &batch[0]; obj2 := &batch[1]; ... for the reasons already mentioned in the OP, but it improves performance on all metrics too much to not do it. Please guide us away from the temptation of one-at-a-time processing.

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jun 20, 2023

CAFxX changed the title ~~runtime: speed up heap allocations in loops~~ cmd/compile,runtime: speed up heap allocations in loops Jun 20, 2023

randall77 added Performance FeatureRequest labels Jun 20, 2023

randall77 added this to the Unplanned milestone Jun 20, 2023

dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile,runtime: speed up heap allocations in loops #60879

cmd/compile,runtime: speed up heap allocations in loops #60879

CAFxX commented Jun 20, 2023 •

edited

rip-create-your-account commented Jun 20, 2023

cmd/compile,runtime: speed up heap allocations in loops #60879

cmd/compile,runtime: speed up heap allocations in loops #60879

Comments

CAFxX commented Jun 20, 2023 • edited

Footnotes

rip-create-your-account commented Jun 20, 2023

CAFxX commented Jun 20, 2023 •

edited