Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile,runtime: speed up heap allocations in loops #60879

Open
CAFxX opened this issue Jun 20, 2023 · 1 comment
Open

cmd/compile,runtime: speed up heap allocations in loops #60879

CAFxX opened this issue Jun 20, 2023 · 1 comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@CAFxX
Copy link
Contributor

CAFxX commented Jun 20, 2023

There is currently a large performance gap between allocating N elements of type T, and allocating a slice of N elements of type T (even in the case in which T does not contain pointers and the size of T is equal to the size class used for T).

Just to give an idea, the following benchmark1 shows the performance gap between the two, for varying number of elements N (old is N allocations, new is a single allocation of a slice of N elements):

name          old time/op  new time/op  delta
Allocs/1      65.4ns ± 0%  65.2ns ± 0%     ~     (p=0.079 n=4+5)
Allocs/2       100ns ± 3%    74ns ±11%  -25.60%  (p=0.008 n=5+5)
Allocs/4       184ns ± 1%    84ns ± 4%  -54.26%  (p=0.016 n=4+5)
Allocs/8       315ns ± 1%   103ns ±11%  -67.41%  (p=0.008 n=5+5)
Allocs/16      580ns ± 2%   132ns ±12%  -77.17%  (p=0.008 n=5+5)
Allocs/32     1.18µs ±16%  0.19µs ±24%  -83.93%  (p=0.008 n=5+5)
Allocs/64     2.21µs ± 0%  0.29µs ±25%  -86.93%  (p=0.008 n=5+5)
Allocs/128    4.75µs ±22%  0.51µs ± 9%  -89.32%  (p=0.008 n=5+5)
Allocs/256    12.6µs ±17%   0.9µs ± 4%  -92.50%  (p=0.008 n=5+5)
Allocs/512    22.4µs ±12%   1.8µs ± 5%  -91.94%  (p=0.008 n=5+5)
Allocs/1024   42.5µs ± 6%   3.5µs ± 5%  -91.81%  (p=0.008 n=5+5)
Allocs/2048   85.2µs ±12%   7.1µs ±13%  -91.72%  (p=0.008 n=5+5)
Allocs/4096    169µs ±10%    14µs ± 8%  -91.82%  (p=0.008 n=5+5)
Allocs/8192    298µs ±15%    25µs ± 1%  -91.63%  (p=0.008 n=5+5)
Allocs/16384   542µs ± 1%    49µs ± 3%  -91.02%  (p=0.008 n=5+5)
Allocs/32768  1.07ms ± 1%  0.09ms ± 2%  -91.11%  (p=0.008 n=5+5)
Allocs/65536  2.13ms ± 2%  0.19ms ± 3%  -91.20%  (p=0.008 n=5+5)

To help close the gap, it may be worthwhile to investigate/consider performing batch allocations for heap allocations that occur inside loops. This is something that happens frequently in unmarshaling code and, more in general, in data processing code.

To clarify, I am not suggesting turning N allocations of type T into an allocation of a vector with N elements of type T, as that would prevent garbage collection of individual elements. Instead by "batch allocation" I mean performing the allocation of N individual elements of type T in a single call to the runtime. In this way each one of the N elements can still be individually garbage collected.

This could mean, for a loop with a trip count known at compile time, turning this:

for ... /* trip count = C (known at compile time) */ {
  e := new(T)
  // use e
}

into this:

a := make([]*T, 0, C) // stack allocated
for ... {
  e := new_batched(&a)
  // use e
}
batch_deallocate(a) // optional

and for loops with trip count known at runtime, turning this:

for ... /* trip count = c (not known at compile time) */ {
  e := new(T)
  // use e
}

into:

a := make([]*T, 0, 64) // stack allocated
for ... {
  if len(a) == 0 && remaining_trip_count < cap(a) { // optional
    a = a[:0:remaining_trip_count]
  }
  e := new_batched(&a)
  // use e
}
batch_deallocate(a) // optional

and for loops with unknown trip count, turning this:

for ... {
  e := new(T)
  // use e
}

into:

a := make([]*T, 0, 64) // stack allocated
for ... {
  e := new_batched(&a)
  // use e
}
batch_deallocate(a) // optional

where the runtime functions calls inserted above by the compiler would be equivalent to:

func new_batched[T any](a *[]*T) *T {
  if a == nil || cap(*a) <= 1 {
    return new(T)
  }
  if len(*a) == 0 {
    *a = *a[:cap(*a)]
    batch_allocate(*a)
  }
  e := (*a)[len(*a)-1]
  *a = (*a)[:len(*a)-1]
  return e
}

func batch_allocate[T any](a []*T) {
  // this would be a runtime call that allocates len(a) elements of type T
  for i := range a {
    a[i] = new(T)
  }
}

func batch_deallocate[T any](a []*T) {
  // this would be a runtime call that deallocates the len(a) elements of type T
  for _, e := range a {
    // mark e as unused in the corresponding bitmap
  }
}

In a sense, this would entail having loop-scoped bump allocators to avoid entering the main heap allocator for each individual element.

Surely some heuristics would be needed to decide when to use batch allocation and when not to: for example we definitely would not want to use batch allocation for loops with very low trip counts. Nevertheless I think this would be a worthwhile optimization to consider.

An alternative to the above would be to specialize the heap allocator for {size class, has pointers} or otherwise add generic, fast per-P bump allocators (that would have the benefit of being effective for all code, not just in loops). These seem like more invasive changes to the runtime, though.

Other alternatives include the (over)use of sync.Pool but as was mentioned in the past almost any use of sync.Pool is an indication that the memory management subsystem is lacking in one way or the other.

Footnotes

  1. go version go1.20.5 darwin/amd64, Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz (turbo boost disabled)

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jun 20, 2023
@CAFxX CAFxX changed the title runtime: speed up heap allocations in loops cmd/compile,runtime: speed up heap allocations in loops Jun 20, 2023
@randall77 randall77 added this to the Unplanned milestone Jun 20, 2023
@rip-create-your-account

Opportunities for batch allocation are something that I stumble upon quite regularly and usually there's more than one allocation going on in the loop. Most recently in real code here.

I really don't like doing batch := make([]T, n); obj1 := &batch[0]; obj2 := &batch[1]; ... for the reasons already mentioned in the OP, but it improves performance on all metrics too much to not do it. Please guide us away from the temptation of one-at-a-time processing.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants