runtime: smoother growth rate transition yields great results #64877

shoham-b · 2023-12-27T16:12:28Z

Proposal Details

Proposal

In growslice function, smoothly decrease the growth rate of newCap from x2 to x1.125
This is the benchmarked code change -
Edit: bug fixing the formula

    	        // Growth rate decent range.
		// Use a slope that slowly decreases the growth rate from x2 up to x1.125.
		// The slope is such that it reaches 1x at 1<<26 but stops at 1.125x at 1<<24, in steps of 1/16
		const transitionStart = 1 << 10
		const transitionEnd = 1 << 24
		if oldCap < transitionStart {
			newcap = doublecap
		} else {
			var growthRateMul int
			if oldCap <= transitionEnd {
				growthRateMul = 27 - bits.Len64(uint64(oldCap)) // 27 - log2(oldCap)
			} else {
				growthRateMul = 2
			}
			// Check 0 < newcap to detect overflow
			// and prevent an infinite loop.`
			for 0 < newcap && newcap < newLen {
				newcap += growthRateMul * (newcap >> 4)
			}

Benchmark

bash-3.2$ /usr/local/go/bin/go version
go version go1.21.5 darwin/arm64
bash-3.2$  GOROOT=/usr/local/go  /usr/local/go/bin/go test go-playground/pkg -bench .
goos: darwin
goarch: arm64
pkg: go-playground/pkg
BenchmarkGrowSlice-8   	       1	9178279667 ns/op	78178716512 B/op	   38481 allocs/op
PASS
ok  	go-playground/pkg	9.701s
bash-3.2$  GOROOT=~/git/go GOPATH=~/go ~/git/go/bin/go test go-playground/pkg -bench .
goos: darwin
goarch: arm64
pkg: go-playground/pkg
BenchmarkGrowSlice-8   	       1	8682857792 ns/op	72003354976 B/op	   25826 allocs/op
PASS
ok  	go-playground/pkg	8.951s

So about x1.5 less allocations, x1.08 less memory and x1.05 less time.
The results are consistent with small variability, and also work with different capacity ranges.

This is the benchmark code -

// From runtime/sizeclasses.go:
// We only test allocations above it, since for anything below the "price" for allocation is too small.
const _MaxSmallSize = 32768

func BenchmarkGrowSlice(b *testing.B) {
	baseSlice := make([]byte, _MaxSmallSize)
	debug.SetMemoryLimit(1 << 32)
	b.ReportAllocs()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		// Test the average across a range of target sizes.
		for k := _MaxSmallSize * 10; k < 1<<24; k += 10001 {
			debug.FreeOSMemory()

			testSlice := baseSlice
			for j := 0; j < k; j += 10 {
				// Unroll the loop to reduce dependency on the loop time.
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
			}
		}
	}
}

Background

Up until go1.17, if the capacity was below 1024, the growth rate was x2 and if the capacity was above 1024, the growth rate was x1.25
In go1.18 this was smoothed somewhat https://go-review.googlesource.com/c/go/+/347917
I was not able to find documentation of the rational and reasoning behind that.

Rational

The rational behind growing the capacity exponentially is to reduce the amount of required reallocation times.
It makes the required number of reallocation to grow in amortized complexity of O(log(n)) where n is the length.

But this has a trade-off; the greater the exponent (growth factor) is, the more redundant capacity was allocated and not used.
This grows relative to the last realloaction size hence O(exp(log(n))) ~ O(n)

To mitigate this, go reduced the growth rate of large slices where memory starts to matter.

This change is too abrupt, both before the go1.18 fix and after, causing 2 issues -

the growth is too small for small slices, so too many reallocations are needed.
For big slices the growth rate is too big, so too much memory is used.

The proposal is to gradually decrease the growth factor from x2 to x1.125 across a typical size of 1<<10 up until 1<<24 at which the growth rate will be kept at 1.125

To keep using integers, I quantized it to steps of 1/16.
Keeping a high

Since it's a difference in the exponent, it leads to substantial performance differences.

Further investigation

The golden ratio

Some claim having a growth rate below the golden ratio (~1.61) is optimal for memory reuse.
https://stackoverflow.com/questions/1100311/what-is-the-ideal-growth-rate-for-a-dynamically-allocated-array
So it might be beneficial to quickly decrease to something below it, and only then start the gradual decrease.

Decrease rate

The rate at which the growth factor is decreasing is linear with the number of allocations and exponential with the increasing capacity.
To do this I spread the growth in powers of 2, even though the growth rate slows.
There might be a better strategy, but it might not be worth the complexity.

Keep it high

Keeping the growth rate high at the beginning had tremendously better results, but I don't know the exact form of it.

The text was updated successfully, but these errors were encountered:

shoham-b · 2023-12-27T18:18:25Z

I know I wrote all of the explanation and everything, but actually, just keeping the growth rate at x2 has even better results, so why not? and why lower it in the first place?
If the complexity of space is only O(n) then it shouldn't matter what is the growth rate.

Keeping it at x2 was even better in all parameters than using the golden ratio for anything of capacity above 1024, so something here just wants a high growth rate.

golden ratio benchmark

bash-3.2$  GOROOT=~/git/go GOPATH=~/go ~/git/go/bin/go test go-playground/pkg -bench .
goos: darwin
goarch: arm64
pkg: go-playground/pkg
BenchmarkGrowSlice-8   	       1	8139819000 ns/op	48032196000 B/op	   18695 allocs/op
PASS
ok  	go-playground/pkg	8.511s

Keeping it at 2 benchmark

bash-3.2$  GOROOT=~/git/go GOPATH=~/go ~/git/go/bin/go test go-playground/pkg -bench .
goos: darwin
goarch: arm64
pkg: go-playground/pkg
BenchmarkGrowSlice-8   	       1	7253913416 ns/op	37604270040 B/op	   13737 allocs/op
PASS
ok  	go-playground/pkg	7.465s

It seems like the allocated bytes decrease slower for a bigger growth rate. This could be because as the growth rate is higher, more values are used for the previously grown slice; so the complexity I wrote has to be amended somehow.

randall77 · 2023-12-28T01:38:16Z

I was not able to find documentation of the rational and reasoning behind that.

Documentation is in CL 347917 and the linked discussions (the groups discussion linked to in the CL description, and the graphs linked to in the last comment).

So if I understand correctly, your change is just keeping the growth factor a bit larger for a bit longer?
Perhaps a graph like the one linked to in CL 347917 would make it clearer what this change is proposing.

So about x1.5 less allocations, x1.08 less memory and x1.05 less time.

Sounds like fewer total reallocations, which of course leads to less total memory and work. We'd get an even better effect measuring this way by just doing 2x reallocations always. The trick here is that "total memory" means total allocated memory. It does not reflect total live but unused memory, which will rise as we have fewer larger reallocations and thus more tail waste. How do we know what that effect will be?

randall77 · 2023-12-28T01:38:56Z

I don't think this needs to be in the proposal process, as it doesn't involve any API. We can just discuss it on this issue and decide.

shoham-b · 2023-12-28T10:20:25Z

Added some benchmarks from the metrics package, to measure the tail capacity.

package pkg

import (
	"runtime/debug"
	"runtime/metrics"
	"testing"
)

func GetMetricsForLog() any {
	// Get descriptions for all supported metrics.
	m := []metrics.Sample{
		{Name: "/gc/heap/frees:bytes"},
		{Name: "/gc/heap/allocs:bytes"},
		{Name: "/gc/heap/live:bytes"},
		{Name: "/cpu/classes/gc/total:cpu-seconds"},
	}
	metrics.Read(m)
	return map[string]any{
		"gc_heap_frees":  m[0].Value.Uint64(),
		"gc_heap_allocs": m[1].Value.Uint64(),
		"gc_heap_live":   m[2].Value.Uint64(),
		"cpu_gc_total":   m[3].Value.Float64(),
	}
}

// From runtime/sizeclasses.go:
// We only test allocations above it, since anything below it has a small allocation price.
const _MaxSmallSize = 32768

func BenchmarkGrowSlice(b *testing.B) {
	baseSlice := make([]byte, _MaxSmallSize)
	b.ReportAllocs()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		allSlices := make([][]byte, 0, 1000)
		// Test the average across a range of target sizes.
		for k := _MaxSmallSize * 10; k < 1<<24; k += 10001 {
			debug.FreeOSMemory()
			testSlice := baseSlice
			for j := 0; j < k; j += 10 {
				// Unroll the loop to reduce dependency on the loop time.
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
				testSlice = append(testSlice, 1)
			}
			// keep a pointer to prevent garbage collection
			allSlices = append(allSlices, testSlice)
		}
	}
	b.StopTimer()
	debug.FreeOSMemory()
	b.Log(GetMetricsForLog())
}

bash-3.2$  GOROOT=/usr/local/go  /usr/local/go/bin/go test go-playground/pkg -bench .
goos: darwin
goarch: arm64
pkg: go-playground/pkg
BenchmarkGrowSlice-8   	       1	13316692167 ns/op	78178604488 B/op	   36868 allocs/op
--- BENCH: BenchmarkGrowSlice-8
    growSlice_test.go:59: map[cpu_gc_total:1.841684158 gc_heap_allocs:78178862360 gc_heap_frees:78178704800 gc_heap_live:157872]
PASS
ok  	go-playground/pkg	13.951s
bash-3.2$  GOROOT=~/git/go GOPATH=~/go ~/git/go/bin/go test go-playground/pkg -bench .
goos: darwin
goarch: arm64
pkg: go-playground/pkg
BenchmarkGrowSlice-8   	       1	12797637958 ns/op	72003326144 B/op	   24958 allocs/op
--- BENCH: BenchmarkGrowSlice-8
    growSlice_test.go:59: map[cpu_gc_total:1.736739948 gc_heap_allocs:72003585408 gc_heap_frees:72003421600 gc_heap_live:164096]
PASS
ok  	go-playground/pkg	13.300s

So it does seem to be using a bit more space that is not freed, but also not much of it.

Oh missed that groups discussion. oops.
It emphasizes the need for this, as it doesn't seem to have the exact reason for the current constants.

Also, it has the solution as to why all the phi reasoning would not work as of now - since other slices may point to the previously allocated slice, so it can't be reused.

In general, it seems more aggressive slice growth leads to better time and memory consumption, with a worse live heap performance, so we'd have to assume both the range of and distribution of a typical slice, and the memory limit to desired performance ratio.

shoham-b · 2023-12-28T11:07:17Z

So if I understand correctly, your change is just keeping the growth factor a bit larger for a bit longer?

It's a bit more than that - it's keeping it higher for the start, and lower for the end, to get the best of the two worlds, rather than having some average for the whole thing.
For some other slices reallocations do both - lower memory tail and less reallocations, but I'd like to hear some feedback before finding the exact constants.

Perhaps a graph like the one linked to in CL 347917 would make it clearer what this change is proposing.

As you can see, it shrinks exponentially with the old capacity, for it shrinks linearly with the number of resizes

It also has to be taken into consideration that the higher the capacity is the more "expensive" it is to reallocate as more pages are needed and the copying takes more time, so to some extent it makes more sense to have the higher growth rate for the bigger slices, even though the tail would be significant.

randall77 · 2024-01-09T22:00:31Z

From your benchmark results, it looks like a ~4% cpu improvement at the cost of a ~4% live memory increase.
That doesn't sound particularly compelling. Especially because we care a bit more about space. Using a bit more space might cause your program to completely cease functioning. Using a bit more cpu is seldom as dire.

The benchmark doesn't have much live data itself, so it doesn't also measure the effect that a larger live set has. A larger live set implies a higher GC scanning cost (for slices with element types that have pointers, anyway) and possibly more frequent GCs.

shoham-b added the Proposal label Dec 27, 2023

gopherbot added this to the Proposal milestone Dec 27, 2023

randall77 removed the Proposal label Dec 28, 2023

randall77 changed the title ~~proposal: runtime/slice: smoother growth rate transition yields great results~~ runtime: smoother growth rate transition yields great results Dec 28, 2023

bcmills added Performance compiler/runtime Issues related to the Go compiler and/or runtime. labels Jan 2, 2024

dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: smoother growth rate transition yields great results #64877

runtime: smoother growth rate transition yields great results #64877

shoham-b commented Dec 27, 2023 •

edited

shoham-b commented Dec 27, 2023

randall77 commented Dec 28, 2023

randall77 commented Dec 28, 2023

shoham-b commented Dec 28, 2023 •

edited

shoham-b commented Dec 28, 2023 •

edited

randall77 commented Jan 9, 2024

runtime: smoother growth rate transition yields great results #64877

runtime: smoother growth rate transition yields great results #64877

Comments

shoham-b commented Dec 27, 2023 • edited

Proposal Details

Proposal

Benchmark

Background

Rational

Further investigation

The golden ratio

Decrease rate

Keep it high

shoham-b commented Dec 27, 2023

randall77 commented Dec 28, 2023

randall77 commented Dec 28, 2023

shoham-b commented Dec 28, 2023 • edited

shoham-b commented Dec 28, 2023 • edited

randall77 commented Jan 9, 2024

shoham-b commented Dec 27, 2023 •

edited

shoham-b commented Dec 28, 2023 •

edited

shoham-b commented Dec 28, 2023 •

edited