Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes: bytes.Clone(), using copy is faster than append. #55905

Open
imkos opened this issue Sep 28, 2022 · 6 comments
Open

bytes: bytes.Clone(), using copy is faster than append. #55905

imkos opened this issue Sep 28, 2022 · 6 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@imkos
Copy link

imkos commented Sep 28, 2022

// default
func clone(b []byte) []byte {
	if b == nil {
		return nil
	}
	return append([]byte{}, b...)
}

// fast
func cloneV2(b []byte) []byte {
	if b == nil {
		return nil
	}
	nb := make([]byte, len(b))
	copy(nb, b)
	return nb
}

// normal
func cloneV3(b []byte) []byte {
	if b == nil {
		return nil
	}
	nb := make([]byte, 0, len(b))
	return append(nb, b...)
}

var toTestB = []byte("JFIOSJFOISJFOIAJFKSDFSDlkadjfskfdskdfj31231111111111111111111111111111111111111111111")

func Benchmark_clone(b *testing.B) {
	for i := 0; i < b.N; i++ {
		clone(toTestB)
	}
}

func Benchmark_cloneV2(b *testing.B) {
	for i := 0; i < b.N; i++ {
		cloneV2(toTestB)
	}
}

func Benchmark_cloneV3(b *testing.B) {
	for i := 0; i < b.N; i++ {
		cloneV3(toTestB)
	}
}

Benchmark_clone
Benchmark_clone-6 35159740 74.59 ns/op 96 B/op 1 allocs/op

Benchmark_cloneV2
Benchmark_cloneV2-6 43229692 59.33 ns/op 96 B/op 1 allocs/op

Benchmark_cloneV3
Benchmark_cloneV3-6 39171712 71.48 ns/op 96 B/op 1 allocs/op

@dmitshur dmitshur added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 28, 2022
@dmitshur dmitshur added this to the Backlog milestone Sep 28, 2022
@dmitshur
Copy link
Contributor

CC @icholy, @martisch, @ianlancetaylor.

@dsnet
Copy link
Member

dsnet commented Sep 28, 2022

The title seems to suggest that bytes should use copy rather than append. Perhaps we can view this instead as figuring out why append is slower than copy in this case and optimize that instead?

@martisch
Copy link
Contributor

martisch commented Sep 28, 2022

Generally append needs to find what the smallest allocation size class is that fits the slize and then may allocate a larger slice with more capacity then the original and then may need to zero additional space. So it will often do more work than make+copy with same length as it provides an extra bonus capacity that would otherwise be unused memory.

In addition make+copy is optimized by the compiler: https://go-review.git.corp.google.com/c/go/+/146719
Such a fast path for append that would recognise append([]byte{}, b...) does not exist yet.

While we can make append([]byte{}, b...) especially for bytes faster in general it can not be as fast as make+copy.
The decision to prefer append to have additional capacity (which is slower initially) has already been made for exp/slices.Clone and was copied to bytes.Clone.

Same issue just for slices.Clone: #53643

@go101
Copy link

go101 commented Sep 29, 2022

"make+copy" will also find the smallest allocation size class. So the actual reason why "append" is slower is it will zero additional space.

Now the "make+copy" optimization has many restrictions. Calling make with 3 arguments will not trigger the optimization.

@martisch
Copy link
Contributor

martisch commented Sep 29, 2022

For "make+copy" and append the allocator will find the smallest allocation size class but in addition "append" does it explicitly before calling the allocator because the allocator has no mode to tell allocate as much as possible elements that fit this type for a slice and report back how much was allocated.

"make+copy" just uses the length as is to compute the value for mallocgc:

tomem = et.size * uintptr(tolen)

"append" first calculates the size class explicitly and the length that fits in it before calling mallocgc:

capmem = roundupsize(uintptr(newcap))

Using roundupsize the "append" searches explicitly for the size class before using the allocator which "make+copy" does not.

In general growslice used for append is less specialized than "make+copy" (in the general use where its optimized) and computes more parameters and has to take care of different cases (no additional capacity or additional capacity, exisiting append to slice fits added items) that "make+copy" does not. This is both because the compiler specialises it more to different cases as well as it generally does need todo less.

If this is important enough and does not degrade performance in the general case the following can be evaluated:

  • specialize append more by having the compiler select a specialized implementation due to constraints on length/capacity parameters (e.g. append to a know 0 length slice)
  • specialize append more by having the compiler select a specialized implementation decided at compile time by type (e.g. append for 0 or 1 byte elements, elements without or with pointers) such that the runtime overhead inside append gets reduced
  • add a mode of mallocgc such that it can decide to fill a size class and report the length back (I expect this not to make a great difference and not be a net win as it will add overhead to mallocgc calls that dont need this feature, I also has only one use case so far and thats growslice)

On the high level I think we first need consensus if the issue is to be:

  1. use make+copy in Clone as it will always be faster if same amount of optimization is applied and give up the advantage of extra capacity of append. But then exp/slices.Clone should do the same. cmd/compile: suboptimal cloning/optimization in slices.Clone #53643 seems to suggest we wont do that but should if at all optimize more in the compiler.
  2. a general call to optimize append more (see above) so we do not need to keep open one issue per instance where the difference in performance is observed.

I would think we can deduplicate to the issue #53643 and make a comment there this would also benefit bytes.Clone.

@imkos
Copy link
Author

imkos commented Sep 29, 2022

My idea is that since the internal implementation of Clone(), uses append, it should be optimized as much as possible, because this function will be called a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants