Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/binary: fast-paths in Read and Write allocate despite attempted optimization #27403

Closed
nvanbenschoten opened this issue Aug 31, 2018 · 12 comments
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance Suggested Issues that may be good for new contributors looking for work to do.
Milestone

Comments

@nvanbenschoten
Copy link
Contributor

encoding/binary.Read and encoding/binary.Write both include an attempt to avoid heap allocations by pointing a slice at a stack allocated array with sufficient capacity to fit most basic types.

func Read(r io.Reader, order ByteOrder, data interface{}) error {
// Fast path for basic types and slices.
if n := intDataSize(data); n != 0 {
var b [8]byte
var bs []byte
if n > len(b) {
bs = make([]byte, n)
} else {
bs = b[:n]
}

func Write(w io.Writer, order ByteOrder, data interface{}) error {
// Fast path for basic types and slices.
if n := intDataSize(data); n != 0 {
var b [8]byte
var bs []byte
if n > len(b) {
bs = make([]byte, n)
} else {
bs = b[:n]
}

The "optimization" is fooling itself because the array escapes from the stack and is moved onto the heap anyway. This means that at best the optimization does nothing and at worst the optimization actually results in either an oversized allocation or an entirely unnecessary second allocation.

binary$ go build -gcflags='-m' 2>&1 | grep 'moved to heap'
./binary.go:164:7: moved to heap: b
./binary.go:263:7: moved to heap: b

The reason the array is moved to the heap is because escape analysis cannot prove that the slice pointing at it doesn't itself escape. This is the case because the slice is passed to the provided ByteOrder interface:

type ByteOrder interface {
Uint16([]byte) uint16
Uint32([]byte) uint32
Uint64([]byte) uint64
PutUint16([]byte, uint16)
PutUint32([]byte, uint32)
PutUint64([]byte, uint64)

Escape analysis can't prove that the slice won't escape when passed to an arbitrary interface implementation, even though it can and does know that the slice won't escape if passed to a littleEndian or bigEndian:

binary$ go build -gcflags='-m' 2>&1 | grep 'does not escape'
./binary.go:51:38: littleEndian.Uint16 b does not escape
./binary.go:56:43: littleEndian.PutUint16 b does not escape
./binary.go:62:38: littleEndian.Uint32 b does not escape
./binary.go:67:43: littleEndian.PutUint32 b does not escape
./binary.go:75:38: littleEndian.Uint64 b does not escape
./binary.go:81:43: littleEndian.PutUint64 b does not escape
./binary.go:99:35: bigEndian.Uint16 b does not escape
./binary.go:104:40: bigEndian.PutUint16 b does not escape
./binary.go:110:35: bigEndian.Uint32 b does not escape
./binary.go:115:40: bigEndian.PutUint32 b does not escape
./binary.go:123:35: bigEndian.Uint64 b does not escape
./binary.go:129:40: bigEndian.PutUint64 b does not escape
...

We could get tricky here and specialize the code for LittleEndian and BigEndian ByteOrders so that when they are provided as the desired order the allocation is avoided, as the optimization
intended. However, this would result in a significant amount of code duplication in order to convince escape analysis that the array is safe to leave on the stack. In lieu of that, we should remove the "optimization" entirely because it's not doing any good.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 31, 2018
I noticed in an `alloc_objects` heap profile on a 3-node cluster
restoring tpcc that more than 46% of all allocations were made in
`computeChecksumPostApply`. Specifically, these allocations were all
made in `Replica.sha512`. 28% of allocations were due to protobuf
marshaling of `hlc.LegacyTimestamp`. The other 18% was in `encoding/binary.Write`.

This removes both of these sources of per-key allocations. The first
allocation was avoided by sharing a byte buffer across protobuf marshals.
The second was avoided by removing the call to `binary.Write` (see
golang/go#27403). I confirmed that this is no
longer an issue by looking at heap profiles from before and after in a test
that performed a consistency check.

I plan to follow up on golang/go#27403 and
search for any other offenders in our codebase. I already see a few.

Release note: None
@agnivade
Copy link
Contributor

@nvanbenschoten - For the record, it will be good to get your go version and go env info.

/cc @dr2chase @rasky

@agnivade agnivade added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 31, 2018
@agnivade agnivade added this to the Unplanned milestone Aug 31, 2018
@rasky
Copy link
Member

rasky commented Aug 31, 2018

AFAICT, Go interfaces as designed in Go 1 don't play well with escape analysis. Any object passed through them always escape, period, and there is no way for the Go compiler to fix this without a whole-program compilation which is unlikely to ever happen.

In the specific case of binary.Read, we already noticed in another bug that that function should be forced inline most of the times (at least, anytime it is called with data being a concrete type) because most of its reflection logic could be scraped away by optimizers; it's really meta-programming, happening at runtime for no good reason.

So I think this is a good use case for seeing if Go 2 generics can help design a new byteorder package that uses less runtime constructs (reflections and interfaces) and more compile time constructs, to avoid allocations and achieve more speed.

/cc @ianlancetaylor @rsc

craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Aug 31, 2018
29419: storage: avoid per-kv allocations during consistency checks r=nvanbenschoten a=nvanbenschoten

I noticed in an `alloc_objects` heap profile on a 3-node cluster restoring tpcc that more than 46% of all allocations were made in `computeChecksumPostApply`. Specifically, these allocations were all made in `Replica.sha512`. 28% of allocations were due to protobuf marshaling of `hlc.LegacyTimestamp`. 
The other 18% was in `encoding/binary.Write`.

This removes both of these sources of per-key allocations. The first allocation was avoided by sharing a byte buffer across protobuf marshals. The second was avoided by removing the call to `binary.Write` (see golang/go#27403). I confirmed that this is no longer an issue by looking at heap profiles from before and after in a test that performed a consistency check.

I plan to follow up on golang/go#27403 and search for any other offenders in our codebase. I already see a few.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvanbenschoten
Copy link
Contributor Author

Thanks for the responses. This was in go1.10, but like @rasky mentioned, it's a general problem with Go 1 and its interaction between interfaces and escape analysis. With the current design of binary.ByteOrder, this optimization was and is never going to work.

With that in mind, would you be open to me sending a pull request to rip out this attempted optimization?

So I think this is a good use case for seeing if Go 2 generics can help design a new byteorder package that uses less runtime constructs (reflections and interfaces) and more compile time constructs, to avoid allocations and achieve more speed.

I completely agree and the most recent draft of generics is looking promising. Note though that this will require that generics be implemented at least in part using compile-time specialization, which it sounds like is still an open question.

@rasky
Copy link
Member

rasky commented Aug 31, 2018

Note though that this will require that generics be implemented at least in part using compile-time specialization, which it sounds like is still an open question.

Either that, or we must make sure the compiler is smart enough when instantiating a generic whose type is being used as part of a type switch (is that possible at all?) or a good subset of reflect calls that could get intrinsified.

@josharian
Copy link
Contributor

include an attempt to avoid heap allocations

git blame suggests to me that this was not an optimization attempt but merely an oversight while the code grew: https://golang.org/cl/12694048

If simplifying the code doesn’t cause a performance regression (I don’t see why it would, but I am AFK), that would be a welcome change.

@josharian josharian added Suggested Issues that may be good for new contributors looking for work to do. help wanted NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 1, 2018
@josharian
Copy link
Contributor

Marking as suggested: the simplification appears straightforward, and the only other thing to do is confirm that there is no performance regression.

@iand
Copy link
Contributor

iand commented Sep 3, 2018

I did a little testing on this and found that the use of io.ReadFull(r, bs) in Read and w.Write(bs) in Write moves b to the heap so the change may not be as simple as expected.

@gopherbot
Copy link

Change https://golang.org/cl/133135 mentions this issue: encoding/binary: simplify Read and Write

@josharian
Copy link
Contributor

@iand I'm not sure I follow. I tried it (CL 133135) and it appeared to work nicely. What am I missing?

@nvanbenschoten
Copy link
Contributor Author

@josharian CL 133135 is what I was expecting to be the outcome of this issue. Thanks for beating me to it 😃.

I assume @iand was attempting to specialize Read and Write for bigEndian and littleEndian ByteOrders. In that case, he's correct that in passing the byte slice to the reader or writer we force it onto the heap, regardless of what tricks we pull with the order argument. Because if this, there's not much else we can do here.

@iand
Copy link
Contributor

iand commented Sep 4, 2018

@josharian I was looking into the original suggestion of specialising by Endian to reduce allocations. I didn't look at the fallback of simplifying and accepting that it allocates.

@josharian
Copy link
Contributor

Ah, I see. Thanks, @iand. And yes, if there’s any performance value in specialization it is more likely to come from Writers (particularly Writers that buffer).

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 11, 2018
I noticed in an `alloc_objects` heap profile on a 3-node cluster
restoring tpcc that more than 46% of all allocations were made in
`computeChecksumPostApply`. Specifically, these allocations were all
made in `Replica.sha512`. 28% of allocations were due to protobuf
marshaling of `hlc.LegacyTimestamp`. The other 18% was in `encoding/binary.Write`.

This removes both of these sources of per-key allocations. The first
allocation was avoided by sharing a byte buffer across protobuf marshals.
The second was avoided by removing the call to `binary.Write` (see
golang/go#27403). I confirmed that this is no
longer an issue by looking at heap profiles from before and after in a test
that performed a consistency check.

I plan to follow up on golang/go#27403 and
search for any other offenders in our codebase. I already see a few.

Release note: None
@golang golang locked and limited conversation to collaborators Sep 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance Suggested Issues that may be good for new contributors looking for work to do.
Projects
None yet
Development

No branches or pull requests

6 participants