Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: avoid memclr for provable overwritten slices #57153

Open
dsnet opened this issue Dec 8, 2022 · 18 comments
Open

cmd/compile: avoid memclr for provable overwritten slices #57153

dsnet opened this issue Dec 8, 2022 · 18 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@dsnet
Copy link
Member

dsnet commented Dec 8, 2022

Consider the following pattern:

var n int
for _, src := range source {
	n += len(src)
}
b := make([]byte, 0, n)
for _, src := range source {
	b = append(b, src...)
}

which often occurs when merging segmented buffers.

Running this benchmark shows that the CPU time is dominated by runtime.memclrNoHeapPointers, which occupies 44% of the time, while runtime.memmove occupies 36% of the time (this is possibly faster than memclr due to caching effects from the former function).

The compiler should be able to prove that every byte of this slice is overwritten, and thus avoid the call to memclr entirely.

If detecting this pattern is too difficult, we could instead special-case bytes.Join such that the internal call to make avoids the memclr (perhaps by calling a runtime-internal variant of make that avoids zeroing). Thus, the pattern above could be replaced with bytes.Join(source, nil).

\cc @randall77 @martisch @lavalamp

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 8, 2022
@go101
Copy link

go101 commented Dec 8, 2022

It is hard to implement this for all cases. But it might be a good idea to do some special optimizations for the later slices std package.

@martisch
Copy link
Contributor

martisch commented Dec 8, 2022

Generally I would not want to open the Pandora's box of giving std lib functions like bytes.Join special runtime access. Lets make the pattern simple enough to detect and then bytes.Join can have the exact pattern so programmers dont need to remember it.

I suggest

var b []byte
for _, src := range source {
	b = append(b, src...)
}

We can teach the compiler to detect that pattern and avoid unnecessary memclr (but only for non pointer types) as well as do only one allocation for the full b slice with enough room for all appends. This avoids the explicit len computation of b with the same effect of doing only one allocation.

To make it work for bytes.Join we can then expand this to allow an append of an unchanged variable or const in addition in the loop:

for _, src := range source {
	b = append(b, src...)
        b = append(b, var|const...)
}

Then join would be:

for _, src := range source {
	b = append(b, src...)
        b = append(b, var|const...)
}
b = b[:len(b)-len(var|const)]

Yes that over allocates by len(var|const) amount but if that matters we can always fine tune that later to detect that too and avoid it. If we start this lets have simple pattern detection first and expand to more versions when there is sufficient evidence its widely used and has high impact on performance.

@go101
Copy link

go101 commented Dec 8, 2022

Another solution is to add a builtin concat (or merge) function to do the job (the need is common and the builtin slice manipulations are incomplete now).

Some related issues: #23905, #29186, #24926

@lavalamp
Copy link

lavalamp commented Dec 8, 2022

Re:

var b []byte
for _, src := range source {
	b = append(b, src...)
}

Requiring that to be the canonical pattern has the weird property that someone pre-allocating with the "right" size will get worse performance.

When I encountered this my first wish was for append to solve the problem via a double definition:

func append[T any](base []T, args... T)
func append[T any](base []T, args... []T)

(I'm not claiming that's a great solution, just a data point about what one go-user would have found intuitive)

@martisch
Copy link
Contributor

martisch commented Dec 8, 2022

Requiring that to be the canonical pattern has the weird property that someone pre-allocating with the "right" size will get worse performance.

One can after the first iteration that only accepts new slices change to just match:

for _, src := range source {
	b = append(b, src...)
}

and have some checks around existing vs needed capacity that determines how the allocation is made. However it also complicates some things like when b has free capacity but not sufficient much. We then still need to update b with as many copies of slices as would have happened in the append loop, before allocating a new backing array, which requires more care and machinery.

I would still suggest we start with something more restricted and then iterate to be less restrictive than reaching for something more versatile but more complex right away.

func append[T any](base []T, args... []T)

I agree that a dedicated inbuild instead of a pattern makes detection easier and the intention clearer. A language change on the other side has a much higher bar to get accepted. There had been discussions in the past to allow a variadic list of slices to append but afaik they get negative feedback on the grounds of append shouldnt be an even more special language construct.

@lavalamp
Copy link

lavalamp commented Dec 8, 2022

It seems like an alternative approach might be "lazy clearing" -- track an index i such that everything prior to that index doesn't need clearing and everything after it is uninitialized.

After an allocation just avoid clearing the remaining part of the array until one of these conditions is met:

  • A read from the uncleared part of the array
  • A write that doesn't start at or before i (and in this case i is incremented to become the end of the written data)
  • The end of the function
  • A reference to the array potentially escapes

@gopherbot
Copy link

Change https://go.dev/cl/456336 mentions this issue: bytes, strings: avoid unnecessary zero initialization

@dsnet
Copy link
Member Author

dsnet commented Dec 8, 2022

Generally I would not want to open the Pandora's box of giving std lib functions like bytes.Join special runtime acces

Heh... that's exactly the approach I took in https://go.dev/cl/456336. I personally think it's fairly palatable since bytes and strings already depend on unsafe operations (especially through the internal/bytealg) package.

As part of that CL, I also optimized strings.Builder, which will have many downstream benefits. The way strings.Builder uses the []byte would be very difficult for the compiler to prove that the unused capacity never leaks to the end user.

In general, this provides 2x performance gains for sufficiently large buffers. This makes sense since we're removing an entire write pass through the memory.

@go101
Copy link

go101 commented Dec 8, 2022

func append[T any](base []T, args... []T)

has ambiguity if the type argument for T is an interface type.

@dsnet
Copy link
Member Author

dsnet commented Dec 8, 2022

good idea to do some special optimizations for the later slices std package.

I don't think this belongs in the slices package since this optimization only works for element types that do not contain pointers. There would need to be non-trivial runtime magic in the slices package to make this work (even more so than the internal/bytealg approach I took in https://go.dev/cl/456336).

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 8, 2022
@prattmic
Copy link
Member

prattmic commented Dec 8, 2022

cc @golang/compiler

@martisch
Copy link
Contributor

martisch commented Dec 8, 2022

I personally think it's fairly palatable since bytes and strings already depend on unsafe operations (especially through the internal/bytealg) package.

Using unsafe and assembly is not the same as giving special access into the runtime by linknaming. Yes there are other packages that do that like reflect but I dont think bytes and strings are in the same category of being tied to internals. Additionally the bytes.MakeNoZero solves this for one specific type which is also not ideal. This will invite unsafe casts to other types.

Using internal/bytealg will also mean the functions are not copyable elsewhere as the dependency will not be met for anything but the std lib. This gives these functions special access that cant even be gained with unsafe and therefore std lib will win in performance without others just being able to copy the code and keep the same performance.

What I mean with pandorax box is that this invites adding linknames from std packages everywhere into the runtime to special operations for optimizations. "More efficient map copy" -> create a map package and link it to special function in runtime. Thats not to say we should not optimize the functions but I think we should do so by patterns detected in the compiler and replaced by runtime calls, or through builtins and not through linknaming which is not portable/maintainable within the go language as others can not just add their linknames into the runtime.

@dsnet
Copy link
Member Author

dsnet commented Dec 8, 2022

@martisch: In principle, I agree with you. However, I'm not sure compiler optimization is going to fix all cases. For example, it seems like a hard (though solvable) problem statically detecting that the make([]byte, ...) in strings.Builder.Grow never leaks uninitialized bytes to the end user. I fear that we're allowing the perfect become the enemy of the good.

@martisch
Copy link
Contributor

martisch commented Dec 9, 2022

@dsnet My view of "perfect is the enemy of good" is from another angle. We are using the complex case of strings.Builder that is too complex to catch by pattern matching to make the case patterns are not good enough to be used and thereby apply the heavy cannon approach to all use cases. I think we should first cover the cases we can with pattern matching by not using link names.

All those cases then just rely on compiler correctness and dont need to use unsafe.

The case of strings.Builder could be handled as a separate topic. We could discuss exposing in unsafe package a method that allocates memory without zeroing. Im not saying we should or will: but if we give strings.Builder that power without additional safe guards (compiler proofing that this can never leak data), then why not to everyone but have it explicitly unsafe? Then strings.Builder that already uses unsafe can use a public but unsafe function.

I would like to point out another issue. If we now go for milking every drop of performance out of bytes and strings packages than it will be hard to take it back to something fast and good enough later that doesnt rely on unsafe and linknaming when we have pattern matching (which seems preferable to use). So we may lock ourselves in to a different tradeoff of flexibility and maintainability with the inbetween solution then the actual target we may agree on.

@dsnet
Copy link
Member Author

dsnet commented Dec 13, 2022

We are using the complex case of strings.Builder that is too complex to catch by pattern matching to make the case patterns are not good enough to be used and thereby apply the heavy cannon approach to all use cases

I'm using strings.Builder to make the case that we only use runtime-specific optimizations for just "bytes" and "strings", not all use cases. In fact, strings.Builder exists because the community got tired of waiting on promised compiler optimizations in #6714. It's been nearly a decade, and we still don't have that optimization. IMO, strings.Builder is a great addition to stdlib and I've used it to good effect for performance sensitive code.

If we now go for milking every drop of performance out of bytes and strings packages...

If the maintenance cost is light, we should milk every drop of performance out of "bytes" and "strings". https://go.dev/cl/456336 adds a dozen lines of code to the runtime in the same repo. This is not a situation where removing this in the future requires a complicated dance across multiple repos.

"strings" and "bytes" consistently come up as bottlenecks in logic that I write and I've sent various optimizations over the years to make it faster. By having these be faster, I don't have to use "unsafe" in my code. That's good as I don't like using "unsafe".

...then it will be hard to take it back to something fast and good enough later that doesnt rely on unsafe

This seems like a good litmus test then for the compiler to have the right optimization. Once the compiler optimization lands, it seems that the "bytes" changes in https://go.dev/cl/456336 could at least be reverted. It's doubtful the compiler could ever satisfy the use case in strings.Builder, but I've argued in the first paragraph why I think this optimization is appropriate.

(BTW, most of the changed lines in https://go.dev/cl/456336 are unrelated to the optimization itself, but rather to fix an existing integer overflow bug in the packages)

@martisch
Copy link
Contributor

martisch commented Dec 13, 2022

TL;DR: Why not introduce a new function unsafe.AllocateMemoryWithNoPointersWithoutMemClr (name to be bikeshed) to not need linknaming and give all developers the possibility to optimize the code as bytes and strings packages will do similar to how strings.Builder uses unsafe that is available and not restricted to std library?

I'm using strings.Builder to make the case that we only use runtime-specific optimizations for just "bytes" and "strings", not all use cases.

I meant "all" as in other use cases in bytes and strings that can be more locally optimized, dont need strings builder and are easier matches by the patterns as discussed in the beginning.

In fact, strings.Builder exists because the community got tired of waiting on promised compiler optimizations in #6714. It's been nearly a decade, and we still don't have that optimization. IMO, strings.Builder is a great addition to stdlib and I've used it to good effect for performance sensitive code.

One difference with strings.Builder is that anyone can create, copy and modify it with the same base performance. The https://go-review.googlesource.com/c/go/+/456336 does not allow this for the optimized functions without changing and redistributing a new runtime.

We do not disagree that a complex case such as strings.Builder doesnt look like it is going to be supported by the compiler anytime soon. However that does not need to mean we need to abandon adding and using a pattern matching for other bytes and strings use cases and giving all developers the option to code up copying into a slice without memclr.

If the maintenance cost is light, we should milk every drop of performance out of "bytes" and "strings".

Then why not add an unsafe function to allocate without memclr to give all developers to possibility to perforrmance optimize to the max and apply that to their own "bytes" and "string" like functions? This can then also be applied in new generic libraries (e.g. in /exp/) that dont have the luxury to be shipped with std and thereby be linked into the runtime.

"strings" and "bytes" consistently come up as bottlenecks in logic that I write and I've sent various optimizations over the years to make it faster. By having these be faster, I don't have to use "unsafe" in my code. That's good as I don't like using "unsafe".

If we are using unsafe in "bytes" and "strings" by exposing a allocate without memclr function then "your" code wont use "unsafe" either, similar to strings.Builder. If we argue that this applies transitively to called std lib functions then also the approach https://go-review.googlesource.com/c/go/+/456336 is using unsafe just pushed down into a function in the runtime.

Once the compiler optimization lands, it seems that the "bytes" changes in https://go.dev/cl/456336 could at least be reverted.

I think we if go down the route of linknaming into runtime we should agree before that it is ok to remove it if if the compiler catches up and that then small performance differences should not be a blocker to revert the link naming.

@andig
Copy link
Contributor

andig commented Dec 13, 2022

I may misunderstand this. @martisch seems to imply that users will use "linknaming into runtime" once the bytealg function is available. After >5 years of go I'm aware of unsafe (though I've never needed to use it) but I have only a faint idea of linknaming. I seem to recall a Go issue where the go team made clear that using it is "your problem". I may of cause be wrong, but isn't bytealg with potential misuse a much smaller foot gun (given it's not promoted as Go api) than an unsafe function that actually invites you to do so?

@martisch
Copy link
Contributor

martisch commented Dec 13, 2022

I may misunderstand this. @martisch seems to imply that users will use "linknaming into runtime" once the bytealg function is available.

The users of byte and string packages dont need todo anything new. However someone that would want to copy the code into another package outside the std lib would not be able todo so (and have it working) as the the function used to alloc without zeroing would be referenced from a go file in the runtime package. So making this useable elsewhere outside the std lib is not possible without adding another linkname and recompiling runtime.

See the addition "src/runtime/slice.go" made to here: https://go-review.googlesource.com/c/go/+/456336/1/src/runtime/slice.go

@prattmic prattmic added this to the Backlog milestone Dec 14, 2022
gopherbot pushed a commit that referenced this issue Feb 27, 2023
Add bytealg.MakeNoZero that specially allocates a []byte
without zeroing it. It assumes the caller will populate every byte.
From within the bytes and strings packages, we can use
bytealg.MakeNoZero in a way where our logic ensures that
the entire slice is overwritten such that uninitialized bytes
are never leaked to the end user.

We use bytealg.MakeNoZero from within the following functions:

* bytes.Join
* bytes.Repeat
* bytes.ToUpper
* bytes.ToLower
* strings.Builder.Grow

The optimization in strings.Builder transitively benefits the following:

* strings.Join
* strings.Map
* strings.Repeat
* strings.ToUpper
* strings.ToLower
* strings.ToValidUTF8
* strings.Replace
* any user logic that depends on strings.Builder

This optimization is especially notable on large buffers that
do not fit in the CPU cache, such that the cost of
runtime.memclr and runtime.memmove are non-trivial since they are
both limited by the relatively slow speed of physical RAM.

Performance:

	RepeatLarge/256/1             66.0ns ± 3%    64.5ns ± 1%      ~     (p=0.095 n=5+5)
	RepeatLarge/256/16            55.4ns ± 5%    53.1ns ± 3%    -4.17%  (p=0.016 n=5+5)
	RepeatLarge/512/1             95.5ns ± 7%    87.1ns ± 2%    -8.78%  (p=0.008 n=5+5)
	RepeatLarge/512/16            84.4ns ± 9%    76.2ns ± 5%    -9.73%  (p=0.016 n=5+5)
	RepeatLarge/1024/1             161ns ± 4%     144ns ± 7%   -10.45%  (p=0.016 n=5+5)
	RepeatLarge/1024/16            148ns ± 3%     141ns ± 5%      ~     (p=0.095 n=5+5)
	RepeatLarge/2048/1             296ns ± 7%     288ns ± 5%      ~     (p=0.841 n=5+5)
	RepeatLarge/2048/16            298ns ± 8%     281ns ± 5%      ~     (p=0.151 n=5+5)
	RepeatLarge/4096/1             593ns ± 8%     539ns ± 8%    -8.99%  (p=0.032 n=5+5)
	RepeatLarge/4096/16            568ns ±12%     526ns ± 7%      ~     (p=0.056 n=5+5)
	RepeatLarge/8192/1            1.15µs ± 8%    1.08µs ±12%      ~     (p=0.095 n=5+5)
	RepeatLarge/8192/16           1.12µs ± 4%    1.07µs ± 7%      ~     (p=0.310 n=5+5)
	RepeatLarge/8192/4097         1.77ns ± 1%    1.76ns ± 2%      ~     (p=0.310 n=5+5)
	RepeatLarge/16384/1           2.06µs ± 7%    1.94µs ± 5%      ~     (p=0.222 n=5+5)
	RepeatLarge/16384/16          2.02µs ± 4%    1.92µs ± 6%      ~     (p=0.095 n=5+5)
	RepeatLarge/16384/4097        1.50µs ±15%    1.44µs ±11%      ~     (p=0.802 n=5+5)
	RepeatLarge/32768/1           3.90µs ± 8%    3.65µs ±11%      ~     (p=0.151 n=5+5)
	RepeatLarge/32768/16          3.92µs ±14%    3.68µs ±12%      ~     (p=0.222 n=5+5)
	RepeatLarge/32768/4097        3.71µs ± 5%    3.43µs ± 4%    -7.54%  (p=0.032 n=5+5)
	RepeatLarge/65536/1           7.47µs ± 8%    6.88µs ± 9%      ~     (p=0.056 n=5+5)
	RepeatLarge/65536/16          7.29µs ± 4%    6.74µs ± 6%    -7.60%  (p=0.016 n=5+5)
	RepeatLarge/65536/4097        7.90µs ±11%    6.34µs ± 5%   -19.81%  (p=0.008 n=5+5)
	RepeatLarge/131072/1          17.0µs ±18%    14.1µs ± 6%   -17.32%  (p=0.008 n=5+5)
	RepeatLarge/131072/16         15.2µs ± 2%    16.2µs ±17%      ~     (p=0.151 n=5+5)
	RepeatLarge/131072/4097       15.7µs ± 6%    14.8µs ±11%      ~     (p=0.095 n=5+5)
	RepeatLarge/262144/1          30.4µs ± 5%    31.4µs ±13%      ~     (p=0.548 n=5+5)
	RepeatLarge/262144/16         30.1µs ± 4%    30.7µs ±11%      ~     (p=1.000 n=5+5)
	RepeatLarge/262144/4097       31.2µs ± 7%    32.7µs ±13%      ~     (p=0.310 n=5+5)
	RepeatLarge/524288/1          67.5µs ± 9%    63.7µs ± 3%      ~     (p=0.095 n=5+5)
	RepeatLarge/524288/16         67.2µs ± 5%    62.9µs ± 6%      ~     (p=0.151 n=5+5)
	RepeatLarge/524288/4097       65.5µs ± 4%    65.2µs ±18%      ~     (p=0.548 n=5+5)
	RepeatLarge/1048576/1          141µs ± 6%     137µs ±14%      ~     (p=0.421 n=5+5)
	RepeatLarge/1048576/16         140µs ± 2%     134µs ±11%      ~     (p=0.222 n=5+5)
	RepeatLarge/1048576/4097       141µs ± 3%     134µs ±10%      ~     (p=0.151 n=5+5)
	RepeatLarge/2097152/1          258µs ± 2%     271µs ±10%      ~     (p=0.222 n=5+5)
	RepeatLarge/2097152/16         263µs ± 6%     273µs ± 9%      ~     (p=0.151 n=5+5)
	RepeatLarge/2097152/4097       270µs ± 2%     277µs ± 6%      ~     (p=0.690 n=5+5)
	RepeatLarge/4194304/1          684µs ± 3%     467µs ± 6%   -31.69%  (p=0.008 n=5+5)
	RepeatLarge/4194304/16         682µs ± 1%     471µs ± 7%   -30.91%  (p=0.008 n=5+5)
	RepeatLarge/4194304/4097       685µs ± 2%     465µs ±20%   -32.12%  (p=0.008 n=5+5)
	RepeatLarge/8388608/1         1.50ms ± 1%    1.16ms ± 8%   -22.63%  (p=0.008 n=5+5)
	RepeatLarge/8388608/16        1.50ms ± 2%    1.22ms ±17%   -18.49%  (p=0.008 n=5+5)
	RepeatLarge/8388608/4097      1.51ms ± 7%    1.33ms ±11%   -11.56%  (p=0.008 n=5+5)
	RepeatLarge/16777216/1        3.48ms ± 4%    2.66ms ±13%   -23.76%  (p=0.008 n=5+5)
	RepeatLarge/16777216/16       3.37ms ± 3%    2.57ms ±13%   -23.72%  (p=0.008 n=5+5)
	RepeatLarge/16777216/4097     3.38ms ± 9%    2.50ms ±11%   -26.16%  (p=0.008 n=5+5)
	RepeatLarge/33554432/1        7.74ms ± 1%    4.70ms ±19%   -39.31%  (p=0.016 n=4+5)
	RepeatLarge/33554432/16       7.90ms ± 4%    4.78ms ± 9%   -39.50%  (p=0.008 n=5+5)
	RepeatLarge/33554432/4097     7.80ms ± 2%    4.86ms ±11%   -37.60%  (p=0.008 n=5+5)
	RepeatLarge/67108864/1        16.4ms ± 3%     9.7ms ±15%   -41.29%  (p=0.008 n=5+5)
	RepeatLarge/67108864/16       16.5ms ± 1%     9.9ms ±15%   -39.83%  (p=0.008 n=5+5)
	RepeatLarge/67108864/4097     16.5ms ± 1%    11.0ms ±18%   -32.95%  (p=0.008 n=5+5)
	RepeatLarge/134217728/1       35.2ms ±12%    19.2ms ±10%   -45.58%  (p=0.008 n=5+5)
	RepeatLarge/134217728/16      34.6ms ± 6%    19.3ms ± 7%   -44.07%  (p=0.008 n=5+5)
	RepeatLarge/134217728/4097    33.2ms ± 2%    19.3ms ±14%   -41.79%  (p=0.008 n=5+5)
	RepeatLarge/268435456/1       70.9ms ± 2%    36.2ms ± 5%   -48.87%  (p=0.008 n=5+5)
	RepeatLarge/268435456/16      77.4ms ± 7%    36.1ms ± 8%   -53.33%  (p=0.008 n=5+5)
	RepeatLarge/268435456/4097    75.8ms ± 4%    37.0ms ± 4%   -51.15%  (p=0.008 n=5+5)
	RepeatLarge/536870912/1        163ms ±14%      77ms ± 9%   -52.94%  (p=0.008 n=5+5)
	RepeatLarge/536870912/16       156ms ± 4%      76ms ± 6%   -51.42%  (p=0.008 n=5+5)
	RepeatLarge/536870912/4097     151ms ± 2%      76ms ± 6%   -49.64%  (p=0.008 n=5+5)
	RepeatLarge/1073741824/1       293ms ± 5%     149ms ± 8%   -49.18%  (p=0.008 n=5+5)
	RepeatLarge/1073741824/16      308ms ± 9%     150ms ± 8%   -51.19%  (p=0.008 n=5+5)
	RepeatLarge/1073741824/4097    299ms ± 5%     151ms ± 6%   -49.51%  (p=0.008 n=5+5)

Updates #57153

Change-Id: I024553b7e676d6da6408278109ac1fa8def0a802
Reviewed-on: https://go-review.googlesource.com/c/go/+/456336
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Joseph Tsai <joetsai@digital-static.net>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

7 participants