cmd/compile: make encoding/binary.LittleEndian.* have negligible inlining cost on amd64 #42958

josharian · 2020-12-02T22:25:59Z

encoding/binary.LittleEndian.* typically compile down to a single instruction on amd64. Their inlining cost is very high. I suggest we add a hack to the inlining budget to consider them to be an intrinsic, for the purposes of cost. Probably the same for other platforms as well.

cc @mdempsky

mvdan · 2020-12-03T08:52:33Z

Are there any other such std APIs that we should teach the inliner about? For example, I imagine many math/bits APIs also compile to one or two amd64 instructions.

cespare · 2020-12-03T09:25:29Z

Unsafe conversions (see #42739).

egonelbre · 2020-12-03T12:39:49Z

Related issue (#42788)

randall77 · 2020-12-03T16:18:58Z

Most of the math/bits APIs are semantically inlined, so they aren't really in the same bucket.

"semantically inlined" = the compiler treats these APIs specially. It throws away their Go implementation and uses a compiler-provided one. The good part is that it always optimizes (unless you give -N, I think?). The bad part is that if you copy the implementation into your own function, you don't get the optimizations.

encoding/binary is not semantically inlined. They just reduce to good code using standard optimizations. The good part of that is if you copy an encoding/binary body into your own function, the optimization still happens.

Not particularly relevant to this issue, other than to note that math/bits probably doesn't need to be included here.

bcmills · 2020-12-03T18:26:28Z

encoding/binary is not semantically inlined. They just reduce to good code using standard optimizations.

My understanding is that inlining cost is currently based on the function AST. Would it make sense to compute a “compiled inlining cost” and stash that somewhere (maybe in the export data) for more accurate cost estimates? (Is there already a separate issue for that, beyond #17566?)

randall77 · 2020-12-03T18:48:48Z

I don't think there is a separate issue.
Yes, cost estimates based on AST size certainly suck.
We could record with each function the count or bytes of instructions that it compiles to, which would be better.
Within a package this can get tricky because we inline everything before we compile anything, but cross-package that works.

mdempsky · 2020-12-03T18:50:47Z

Would it make sense to compute a “compiled inlining cost” and stash that somewhere (maybe in the export data) for more accurate cost estimates?

The long term aspiration is to do inlining after we're in SSA form and have had a chance to apply some initial optimizations. I think that's a ways off still though.

In the mean time, I think some ad-hoc special cases for notable standard library functions seems fine. Or maybe a std-only //go:inlinecost directive that allows use to override the inlining cost.

We could record with each function the count or bytes of instructions that it compiles to, which would be better.

That's an interesting idea.

josharian · 2020-12-03T19:24:17Z

We could record with each function the count or bytes of instructions that it compiles to, which would be better.

I’ve experimented with this.

It yields fairly drastically different inlining results—appends are suddenly crazy expensive, as is code that can panic, because the cold code paths count too. It’s probably still better, but I didn’t pursue it because of the inconsistency vs intra-package inlining.

dsnet · 2020-12-04T00:22:50Z

Whatever happens, I think we should avoid special casing just encoding/binary.LittleEndian.*. There are a number of equivalent functions that avoid the encoding/binary package as a dependency that may want to benefit from this as well (e.g., "compress/flate".load64)

josharian · 2020-12-25T05:07:25Z

@dsnet I hear you but special casing is easy and can happen ~immediately. Recognizing equivalent forms is definitely not easy in anything resembling the current inliner. These routines are used heavily in high performance code because it is one way to do something like vectorization. I'd rather not make the perfect to the enemy of the good here.

This allows IP manipulation operations to operate on IPs as two registers, which empirically leads to significant speedups, in particular on OOO superscalars where the two halves can be processed in parallel. You might expect that we could keep the representation as [16]byte, and do a cycle of BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient code. Unfortunately, due to a variety of missing optimizations in the Go compiler, that is not the case and that code turns into byte-wise operations. On the other hand, converting to a uint64 pair at construction results in efficient construction (a pair of MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit pairs). See also: - golang/go#41663 - golang/go#41684 - golang/go#42958 - The discussion in #63 name old time/op new time/op delta StdIPv4-8 146ns ± 2% 141ns ± 2% -3.42% (p=0.016 n=5+5) IPv4-8 120ns ± 1% 107ns ± 2% -10.65% (p=0.008 n=5+5) IPv4_inline-8 120ns ± 0% 118ns ± 1% -1.67% (p=0.016 n=4+5) StdIPv6-8 211ns ± 2% 215ns ± 1% +2.18% (p=0.008 n=5+5) IPv6-8 281ns ± 1% 252ns ± 1% -10.19% (p=0.008 n=5+5) IPv4Contains-8 11.8ns ± 4% 4.7ns ± 2% -60.00% (p=0.008 n=5+5) ParseIPv4-8 68.1ns ± 4% 78.8ns ± 1% +15.74% (p=0.008 n=5+5) ParseIPv6-8 419ns ± 1% 409ns ± 0% -2.40% (p=0.016 n=4+5) StdParseIPv4-8 73.7ns ± 1% 88.8ns ± 2% +20.50% (p=0.008 n=5+5) StdParseIPv6-8 132ns ± 2% 134ns ± 1% ~ (p=0.079 n=5+5) IPPrefixMasking/IPv4_/32-8 36.3ns ± 3% 4.8ns ± 4% -86.72% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/17-8 39.0ns ± 0% 4.8ns ± 3% -87.78% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/0-8 36.9ns ± 2% 4.8ns ± 4% -87.07% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/128-8 32.7ns ± 1% 4.7ns ± 2% -85.47% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/65-8 39.8ns ± 1% 4.7ns ± 1% -88.13% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/0-8 40.7ns ± 1% 4.7ns ± 2% -88.41% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/128-8 136ns ± 3% 5ns ± 2% -96.53% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 142ns ± 2% 5ns ± 1% -96.65% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 143ns ± 2% 5ns ± 3% -96.67% (p=0.008 n=5+5) IPSetFuzz-8 22.7µs ± 2% 16.4µs ± 2% -27.84% (p=0.008 n=5+5) name old alloc/op new alloc/op delta StdIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4-8 0.00B 0.00B ~ (all equal) IPv4_inline-8 0.00B 0.00B ~ (all equal) StdIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4Contains-8 0.00B 0.00B ~ (all equal) ParseIPv4-8 0.00B 0.00B ~ (all equal) ParseIPv6-8 48.0B ± 0% 48.0B ± 0% ~ (all equal) StdParseIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) StdParseIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPSetFuzz-8 2.60kB ± 0% 2.60kB ± 0% ~ (p=1.000 n=5+5) name old allocs/op new allocs/op delta StdIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4-8 0.00 0.00 ~ (all equal) IPv4_inline-8 0.00 0.00 ~ (all equal) StdIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4Contains-8 0.00 0.00 ~ (all equal) ParseIPv4-8 0.00 0.00 ~ (all equal) ParseIPv6-8 3.00 ± 0% 3.00 ± 0% ~ (all equal) StdParseIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) StdParseIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPSetFuzz-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) Signed-off-by: David Anderson <dave@natulte.net>

This allows IP manipulation operations to operate on IPs as two registers, which empirically leads to significant speedups, in particular on OOO superscalars where the two halves can be processed in parallel. You might expect that we could keep the representation as [16]byte, and do a cycle of BigEndian.Uint64+tweak+BigEndian.PutUint64, and that would compile down to efficient code. Unfortunately, due to a variety of missing optimizations in the Go compiler, that is not the case and that code turns into byte-wise operations. On the other hand, converting to a uint64 pair at construction results in efficient construction (a pair of MOVQ+BSWAPQ) and efficient twiddling operations (single-cycle arithmetic on 64-bit pairs). See also: - golang/go#41663 - golang/go#41684 - golang/go#42958 - The discussion in inetaf#63 name old time/op new time/op delta StdIPv4-8 146ns ± 2% 141ns ± 2% -3.42% (p=0.016 n=5+5) IPv4-8 120ns ± 1% 107ns ± 2% -10.65% (p=0.008 n=5+5) IPv4_inline-8 120ns ± 0% 118ns ± 1% -1.67% (p=0.016 n=4+5) StdIPv6-8 211ns ± 2% 215ns ± 1% +2.18% (p=0.008 n=5+5) IPv6-8 281ns ± 1% 252ns ± 1% -10.19% (p=0.008 n=5+5) IPv4Contains-8 11.8ns ± 4% 4.7ns ± 2% -60.00% (p=0.008 n=5+5) ParseIPv4-8 68.1ns ± 4% 78.8ns ± 1% +15.74% (p=0.008 n=5+5) ParseIPv6-8 419ns ± 1% 409ns ± 0% -2.40% (p=0.016 n=4+5) StdParseIPv4-8 73.7ns ± 1% 88.8ns ± 2% +20.50% (p=0.008 n=5+5) StdParseIPv6-8 132ns ± 2% 134ns ± 1% ~ (p=0.079 n=5+5) IPPrefixMasking/IPv4_/32-8 36.3ns ± 3% 4.8ns ± 4% -86.72% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/17-8 39.0ns ± 0% 4.8ns ± 3% -87.78% (p=0.008 n=5+5) IPPrefixMasking/IPv4_/0-8 36.9ns ± 2% 4.8ns ± 4% -87.07% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/128-8 32.7ns ± 1% 4.7ns ± 2% -85.47% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/65-8 39.8ns ± 1% 4.7ns ± 1% -88.13% (p=0.008 n=5+5) IPPrefixMasking/IPv6_/0-8 40.7ns ± 1% 4.7ns ± 2% -88.41% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/128-8 136ns ± 3% 5ns ± 2% -96.53% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 142ns ± 2% 5ns ± 1% -96.65% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 143ns ± 2% 5ns ± 3% -96.67% (p=0.008 n=5+5) IPSetFuzz-8 22.7µs ± 2% 16.4µs ± 2% -27.84% (p=0.008 n=5+5) name old alloc/op new alloc/op delta StdIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4-8 0.00B 0.00B ~ (all equal) IPv4_inline-8 0.00B 0.00B ~ (all equal) StdIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPv4Contains-8 0.00B 0.00B ~ (all equal) ParseIPv4-8 0.00B 0.00B ~ (all equal) ParseIPv6-8 48.0B ± 0% 48.0B ± 0% ~ (all equal) StdParseIPv4-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) StdParseIPv6-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00B 0.00B ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 16.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5) IPSetFuzz-8 2.60kB ± 0% 2.60kB ± 0% ~ (p=1.000 n=5+5) name old allocs/op new allocs/op delta StdIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4-8 0.00 0.00 ~ (all equal) IPv4_inline-8 0.00 0.00 ~ (all equal) StdIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPv4Contains-8 0.00 0.00 ~ (all equal) ParseIPv4-8 0.00 0.00 ~ (all equal) ParseIPv6-8 3.00 ± 0% 3.00 ± 0% ~ (all equal) StdParseIPv4-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) StdParseIPv6-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) IPPrefixMasking/IPv4_/32-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/17-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv4_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/128-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/65-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_/0-8 0.00 0.00 ~ (all equal) IPPrefixMasking/IPv6_zone_/128-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/65-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPPrefixMasking/IPv6_zone_/0-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) IPSetFuzz-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) Signed-off-by: David Anderson <dave@natulte.net> Signed-off-by: Stefan Majer <stefan.majer@gmail.com>

dsnet · 2020-12-27T05:26:54Z

Random thought: Should little-endian and big-endian functionality be moved to math/bits?

With the acceptance of #395, you could imagine an API that operates on fixed-width arrays rather than pointers:

func LoadUint32LE(v [4]byte) uint32
func StoreUint32LE(v uint32) [4]byte

gopherbot · 2021-09-14T17:20:01Z

Change https://golang.org/cl/349931 mentions this issue: cmd/compile: make encoding/binary loads/stores cheaper to inline

gopherbot · 2022-09-15T07:37:32Z

Change https://go.dev/cl/431015 mentions this issue: cmd/compile: make encoding/binary appends cheaper to inline

Go 1.19 introduce new append-like APIs in package encoding/binary, this change teaches the inliner to treat calls to these methods as cheap, so that code using them will be more inlineable. Updates #42958 Change-Id: Ie3dd4906e285430f435bdedbf8a11fdffce9302d Reviewed-on: https://go-review.googlesource.com/c/go/+/431015 Auto-Submit: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Jenny Rakoczy <jenny@golang.org> Reviewed-by: Jenny Rakoczy <jenny@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Jenny Rakoczy <jenny@golang.org> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: Keith Randall <khr@google.com>

josharian added the Performance label Dec 2, 2020

ALTree added the NeedsDecision label Dec 3, 2020

josharian mentioned this issue Dec 25, 2020

Change the representation of IP to a pair of uint64s. inetaf/netaddr#63

Merged

dsnet mentioned this issue Aug 24, 2021

Access to uint64 pair? inetaf/netaddr#214

Open

gopherbot closed this as completed in 17c513e Oct 6, 2021

subtle-byte mentioned this issue Aug 12, 2022

cmd/compile: low function inline cost #54407

Closed

dsnet mentioned this issue Sep 19, 2022

encoding/binary: should have minimal dependency on reflect #54097

Open

golang locked and limited conversation to collaborators Sep 15, 2023

gopherbot added the FrozenDueToAge label Sep 15, 2023

dmitshur added this to the Go1.18 milestone May 13, 2024

dmitshur added NeedsFix and removed NeedsDecision labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: make encoding/binary.LittleEndian.* have negligible inlining cost on amd64 #42958

cmd/compile: make encoding/binary.LittleEndian.* have negligible inlining cost on amd64 #42958

josharian commented Dec 2, 2020

mvdan commented Dec 3, 2020

cespare commented Dec 3, 2020

egonelbre commented Dec 3, 2020

randall77 commented Dec 3, 2020

bcmills commented Dec 3, 2020

randall77 commented Dec 3, 2020

mdempsky commented Dec 3, 2020

josharian commented Dec 3, 2020

dsnet commented Dec 4, 2020

josharian commented Dec 25, 2020

dsnet commented Dec 27, 2020

gopherbot commented Sep 14, 2021

gopherbot commented Sep 15, 2022

cmd/compile: make encoding/binary.LittleEndian.* have negligible inlining cost on amd64 #42958

cmd/compile: make encoding/binary.LittleEndian.* have negligible inlining cost on amd64 #42958

Comments

josharian commented Dec 2, 2020

mvdan commented Dec 3, 2020

cespare commented Dec 3, 2020

egonelbre commented Dec 3, 2020

randall77 commented Dec 3, 2020

bcmills commented Dec 3, 2020

randall77 commented Dec 3, 2020

mdempsky commented Dec 3, 2020

josharian commented Dec 3, 2020

dsnet commented Dec 4, 2020

josharian commented Dec 25, 2020

dsnet commented Dec 27, 2020

gopherbot commented Sep 14, 2021

gopherbot commented Sep 15, 2022