encoding/binary: fast-paths in Read and Write allocate despite attempted optimization #27403

nvanbenschoten · 2018-08-31T06:41:59Z

encoding/binary.Read and encoding/binary.Write both include an attempt to avoid heap allocations by pointing a slice at a stack allocated array with sufficient capacity to fit most basic types.

go/src/encoding/binary/binary.go

Lines 161 to 170 in 58e970e

    
           func Read(r io.Reader, order ByteOrder, data interface{}) error { 
        
           	// Fast path for basic types and slices. 
        
           	if n := intDataSize(data); n != 0 { 
        
           		var b [8]byte 
        
           		var bs []byte 
        
           		if n > len(b) { 
        
           			bs = make([]byte, n) 
        
           		} else { 
        
           			bs = b[:n] 
        
           		}

go/src/encoding/binary/binary.go

Lines 260 to 269 in 58e970e

    
           func Write(w io.Writer, order ByteOrder, data interface{}) error { 
        
           	// Fast path for basic types and slices. 
        
           	if n := intDataSize(data); n != 0 { 
        
           		var b [8]byte 
        
           		var bs []byte 
        
           		if n > len(b) { 
        
           			bs = make([]byte, n) 
        
           		} else { 
        
           			bs = b[:n] 
        
           		}

The "optimization" is fooling itself because the array escapes from the stack and is moved onto the heap anyway. This means that at best the optimization does nothing and at worst the optimization actually results in either an oversized allocation or an entirely unnecessary second allocation.

binary$ go build -gcflags='-m' 2>&1 | grep 'moved to heap'
./binary.go:164:7: moved to heap: b
./binary.go:263:7: moved to heap: b

The reason the array is moved to the heap is because escape analysis cannot prove that the slice pointing at it doesn't itself escape. This is the case because the slice is passed to the provided ByteOrder interface:

go/src/encoding/binary/binary.go

Lines 33 to 39 in 58e970e

    
           type ByteOrder interface { 
        
           	Uint16([]byte) uint16 
        
           	Uint32([]byte) uint32 
        
           	Uint64([]byte) uint64 
        
           	PutUint16([]byte, uint16) 
        
           	PutUint32([]byte, uint32) 
        
           	PutUint64([]byte, uint64)

Escape analysis can't prove that the slice won't escape when passed to an arbitrary interface implementation, even though it can and does know that the slice won't escape if passed to a littleEndian or bigEndian:

binary$ go build -gcflags='-m' 2>&1 | grep 'does not escape'
./binary.go:51:38: littleEndian.Uint16 b does not escape
./binary.go:56:43: littleEndian.PutUint16 b does not escape
./binary.go:62:38: littleEndian.Uint32 b does not escape
./binary.go:67:43: littleEndian.PutUint32 b does not escape
./binary.go:75:38: littleEndian.Uint64 b does not escape
./binary.go:81:43: littleEndian.PutUint64 b does not escape
./binary.go:99:35: bigEndian.Uint16 b does not escape
./binary.go:104:40: bigEndian.PutUint16 b does not escape
./binary.go:110:35: bigEndian.Uint32 b does not escape
./binary.go:115:40: bigEndian.PutUint32 b does not escape
./binary.go:123:35: bigEndian.Uint64 b does not escape
./binary.go:129:40: bigEndian.PutUint64 b does not escape
...

We could get tricky here and specialize the code for LittleEndian and BigEndian ByteOrders so that when they are provided as the desired order the allocation is avoided, as the optimization
intended. However, this would result in a significant amount of code duplication in order to convince escape analysis that the array is safe to leave on the stack. In lieu of that, we should remove the "optimization" entirely because it's not doing any good.

The text was updated successfully, but these errors were encountered:

I noticed in an `alloc_objects` heap profile on a 3-node cluster restoring tpcc that more than 46% of all allocations were made in `computeChecksumPostApply`. Specifically, these allocations were all made in `Replica.sha512`. 28% of allocations were due to protobuf marshaling of `hlc.LegacyTimestamp`. The other 18% was in `encoding/binary.Write`. This removes both of these sources of per-key allocations. The first allocation was avoided by sharing a byte buffer across protobuf marshals. The second was avoided by removing the call to `binary.Write` (see golang/go#27403). I confirmed that this is no longer an issue by looking at heap profiles from before and after in a test that performed a consistency check. I plan to follow up on golang/go#27403 and search for any other offenders in our codebase. I already see a few. Release note: None

agnivade · 2018-08-31T08:38:40Z

@nvanbenschoten - For the record, it will be good to get your go version and go env info.

/cc @dr2chase @rasky

rasky · 2018-08-31T09:15:55Z

AFAICT, Go interfaces as designed in Go 1 don't play well with escape analysis. Any object passed through them always escape, period, and there is no way for the Go compiler to fix this without a whole-program compilation which is unlikely to ever happen.

In the specific case of binary.Read, we already noticed in another bug that that function should be forced inline most of the times (at least, anytime it is called with data being a concrete type) because most of its reflection logic could be scraped away by optimizers; it's really meta-programming, happening at runtime for no good reason.

So I think this is a good use case for seeing if Go 2 generics can help design a new byteorder package that uses less runtime constructs (reflections and interfaces) and more compile time constructs, to avoid allocations and achieve more speed.

/cc @ianlancetaylor @rsc

29419: storage: avoid per-kv allocations during consistency checks r=nvanbenschoten a=nvanbenschoten I noticed in an `alloc_objects` heap profile on a 3-node cluster restoring tpcc that more than 46% of all allocations were made in `computeChecksumPostApply`. Specifically, these allocations were all made in `Replica.sha512`. 28% of allocations were due to protobuf marshaling of `hlc.LegacyTimestamp`. The other 18% was in `encoding/binary.Write`. This removes both of these sources of per-key allocations. The first allocation was avoided by sharing a byte buffer across protobuf marshals. The second was avoided by removing the call to `binary.Write` (see golang/go#27403). I confirmed that this is no longer an issue by looking at heap profiles from before and after in a test that performed a consistency check. I plan to follow up on golang/go#27403 and search for any other offenders in our codebase. I already see a few. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvanbenschoten · 2018-08-31T15:51:58Z

Thanks for the responses. This was in go1.10, but like @rasky mentioned, it's a general problem with Go 1 and its interaction between interfaces and escape analysis. With the current design of binary.ByteOrder, this optimization was and is never going to work.

With that in mind, would you be open to me sending a pull request to rip out this attempted optimization?

So I think this is a good use case for seeing if Go 2 generics can help design a new byteorder package that uses less runtime constructs (reflections and interfaces) and more compile time constructs, to avoid allocations and achieve more speed.

I completely agree and the most recent draft of generics is looking promising. Note though that this will require that generics be implemented at least in part using compile-time specialization, which it sounds like is still an open question.

rasky · 2018-08-31T16:07:43Z

Note though that this will require that generics be implemented at least in part using compile-time specialization, which it sounds like is still an open question.

Either that, or we must make sure the compiler is smart enough when instantiating a generic whose type is being used as part of a type switch (is that possible at all?) or a good subset of reflect calls that could get intrinsified.

josharian · 2018-09-01T04:10:47Z

include an attempt to avoid heap allocations

git blame suggests to me that this was not an optimization attempt but merely an oversight while the code grew: https://golang.org/cl/12694048

If simplifying the code doesn’t cause a performance regression (I don’t see why it would, but I am AFK), that would be a welcome change.

josharian · 2018-09-01T04:12:51Z

Marking as suggested: the simplification appears straightforward, and the only other thing to do is confirm that there is no performance regression.

iand · 2018-09-03T09:07:19Z

I did a little testing on this and found that the use of io.ReadFull(r, bs) in Read and w.Write(bs) in Write moves b to the heap so the change may not be as simple as expected.

gopherbot · 2018-09-04T01:28:07Z

Change https://golang.org/cl/133135 mentions this issue: encoding/binary: simplify Read and Write

josharian · 2018-09-04T01:28:32Z

@iand I'm not sure I follow. I tried it (CL 133135) and it appeared to work nicely. What am I missing?

nvanbenschoten · 2018-09-04T03:30:58Z

@josharian CL 133135 is what I was expecting to be the outcome of this issue. Thanks for beating me to it 😃.

I assume @iand was attempting to specialize Read and Write for bigEndian and littleEndian ByteOrders. In that case, he's correct that in passing the byte slice to the reader or writer we force it onto the heap, regardless of what tricks we pull with the order argument. Because if this, there's not much else we can do here.

iand · 2018-09-04T07:41:40Z

@josharian I was looking into the original suggestion of specialising by Endian to reduce allocations. I didn't look at the fallback of simplifying and accepting that it allocates.

josharian · 2018-09-04T14:54:18Z

Ah, I see. Thanks, @iand. And yes, if there’s any performance value in specialization it is more likely to come from Writers (particularly Writers that buffer).

I noticed in an `alloc_objects` heap profile on a 3-node cluster restoring tpcc that more than 46% of all allocations were made in `computeChecksumPostApply`. Specifically, these allocations were all made in `Replica.sha512`. 28% of allocations were due to protobuf marshaling of `hlc.LegacyTimestamp`. The other 18% was in `encoding/binary.Write`. This removes both of these sources of per-key allocations. The first allocation was avoided by sharing a byte buffer across protobuf marshals. The second was avoided by removing the call to `binary.Write` (see golang/go#27403). I confirmed that this is no longer an issue by looking at heap profiles from before and after in a test that performed a consistency check. I plan to follow up on golang/go#27403 and search for any other offenders in our codebase. I already see a few. Release note: None

nvanbenschoten mentioned this issue Aug 31, 2018

storage: avoid per-kv allocations during consistency checks cockroachdb/cockroach#29419

Merged

agnivade added Performance NeedsInvestigation labels Aug 31, 2018

agnivade added this to the Unplanned milestone Aug 31, 2018

josharian added Suggested help wanted NeedsFix and removed NeedsInvestigation labels Sep 1, 2018

gopherbot closed this as completed in 2179e49 Sep 4, 2018

nvanbenschoten mentioned this issue Sep 11, 2018

release-2.1: storage: avoid per-kv allocations during consistency checks cockroachdb/cockroach#30053

Merged

golang locked and limited conversation to collaborators Sep 4, 2019

gopherbot added the FrozenDueToAge label Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/binary: fast-paths in Read and Write allocate despite attempted optimization #27403

encoding/binary: fast-paths in Read and Write allocate despite attempted optimization #27403

nvanbenschoten commented Aug 31, 2018

agnivade commented Aug 31, 2018

rasky commented Aug 31, 2018

nvanbenschoten commented Aug 31, 2018

rasky commented Aug 31, 2018

josharian commented Sep 1, 2018

josharian commented Sep 1, 2018

iand commented Sep 3, 2018 •

edited

Loading

gopherbot commented Sep 4, 2018

josharian commented Sep 4, 2018

nvanbenschoten commented Sep 4, 2018

iand commented Sep 4, 2018

josharian commented Sep 4, 2018

encoding/binary: fast-paths in Read and Write allocate despite attempted optimization #27403

encoding/binary: fast-paths in Read and Write allocate despite attempted optimization #27403

Comments

nvanbenschoten commented Aug 31, 2018

agnivade commented Aug 31, 2018

rasky commented Aug 31, 2018

nvanbenschoten commented Aug 31, 2018

rasky commented Aug 31, 2018

josharian commented Sep 1, 2018

josharian commented Sep 1, 2018

iand commented Sep 3, 2018 • edited Loading

gopherbot commented Sep 4, 2018

josharian commented Sep 4, 2018

nvanbenschoten commented Sep 4, 2018

iand commented Sep 4, 2018

josharian commented Sep 4, 2018

iand commented Sep 3, 2018 •

edited

Loading