Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: optimize large structs #24416

Open
randall77 opened this issue Mar 15, 2018 · 9 comments
Open

cmd/compile: optimize large structs #24416

randall77 opened this issue Mar 15, 2018 · 9 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@randall77
Copy link
Contributor

The compiler currently compiles large structs conservatively. If a struct type has more than 4 fields (or a few other conditions), we treat that type as unSSAable. All operations on variables of that type go to the stack, as if their address was taken.

This is suboptimal in various ways. For example:

type T struct {
	a, b, c, d int
}

func f(x *T) {
	t := T{}
	*x = t
}

type U struct {
	a, b, c, d, e int
}

func g(x *U) {
	u := U{}
	*x = u
}

f is compiled optimally, to:

	XORPS	X0, X0
	MOVQ	"".x+8(SP), AX
	MOVUPS	X0, (AX)
	MOVUPS	X0, 16(AX)
	RET

g is quite a bit worse:

	MOVQ	BP, 40(SP)
	LEAQ	40(SP), BP
	MOVQ	$0, "".u(SP)
	XORPS	X0, X0
	MOVUPS	X0, "".u+8(SP)
	MOVUPS	X0, "".u+24(SP)
	MOVQ	"".u(SP), AX
	MOVQ	"".x+56(SP), CX
	MOVQ	AX, (CX)
	LEAQ	8(CX), DI
	LEAQ	"".u+8(SP), SI
	DUFFCOPY	$868
	MOVQ	40(SP), BP
	ADDQ	$48, SP
	RET

We zero a temporary variable on the stack, then copy it to the destination.

We should process large structs through SSA as well. This will require a fair amount of work in the SSA backend to introduce struct builders, selectors of arbitrary width, stack allocation of large types, maybe heap allocation if they are really huge, etc.

Arrays of size > 1 are in a similar state, but they are somewhat harder because non-constant indexes add an additional complication.

@josharian
Copy link
Contributor

Just to ask an obvious question, is the problem that we want to handle all n, or just that n<=4 is too small? If the latter, and a reasonable cap is say 16, it might be easier to just extend the current approach, maybe using a bit of codegen magic. Probably we should Do It Right, but worth asking...

@randall77
Copy link
Contributor Author

My intent is to handle all n.

@TocarIP
Copy link
Contributor

TocarIP commented Mar 21, 2018

Looks like this should also help #20859

@andybons andybons added this to the Go1.11 milestone Mar 26, 2018
@andybons andybons added the NeedsFix The path to resolution is known, but the work has not been done. label Mar 26, 2018
@andybons andybons modified the milestones: Go1.11, Unplanned Mar 26, 2018
@gopherbot
Copy link

Change https://golang.org/cl/106495 mentions this issue: cmd/compile: add some generic composite type optimizations

gopherbot pushed a commit that referenced this issue May 8, 2018
Propagate values through some wide Zero/Move operations. Among
other things this allows us to optimize some kinds of array
initialization. For example, the following code no longer
requires a temporary be allocated on the stack. Instead it
writes the values directly into the return value.

func f(i uint32) [4]uint32 {
    return [4]uint32{i, i+1, i+2, i+3}
}

The return value is unnecessarily cleared but removing that is
probably a task for dead store analysis (I think it needs to
be able to match multiple Store ops to wide Zero ops).

In order to reliably remove stack variables that are rendered
unnecessary by these new rules I've added a new generic version
of the unread autos elimination pass.

These rules are triggered more than 5000 times when building and
testing the standard library.

Updates #15925 (fixes for arrays of up to 4 elements).
Updates #24386 (fixes for up to 4 kept elements).
Updates #24416.

compilebench results:

name       old time/op       new time/op       delta
Template         353ms ± 5%        359ms ± 3%    ~     (p=0.143 n=10+10)
Unicode          219ms ± 1%        217ms ± 4%    ~     (p=0.740 n=7+10)
GoTypes          1.26s ± 1%        1.26s ± 2%    ~     (p=0.549 n=9+10)
Compiler         6.00s ± 1%        6.08s ± 1%  +1.42%  (p=0.000 n=9+8)
SSA              15.3s ± 2%        15.6s ± 1%  +2.43%  (p=0.000 n=10+10)
Flate            237ms ± 2%        240ms ± 2%  +1.31%  (p=0.015 n=10+10)
GoParser         285ms ± 1%        285ms ± 1%    ~     (p=0.878 n=8+8)
Reflect          797ms ± 3%        807ms ± 2%    ~     (p=0.065 n=9+10)
Tar              334ms ± 0%        335ms ± 4%    ~     (p=0.460 n=8+10)
XML              419ms ± 0%        423ms ± 1%  +0.91%  (p=0.001 n=7+9)
StdCmd           46.0s ± 0%        46.4s ± 0%  +0.85%  (p=0.000 n=9+9)

name       old user-time/op  new user-time/op  delta
Template         337ms ± 3%        346ms ± 5%    ~     (p=0.053 n=9+10)
Unicode          205ms ±10%        205ms ± 8%    ~     (p=1.000 n=10+10)
GoTypes          1.22s ± 2%        1.21s ± 3%    ~     (p=0.436 n=10+10)
Compiler         5.85s ± 1%        5.93s ± 0%  +1.46%  (p=0.000 n=10+8)
SSA              14.9s ± 1%        15.3s ± 1%  +2.62%  (p=0.000 n=10+10)
Flate            229ms ± 4%        228ms ± 6%    ~     (p=0.796 n=10+10)
GoParser         271ms ± 3%        275ms ± 4%    ~     (p=0.165 n=10+10)
Reflect          779ms ± 5%        775ms ± 2%    ~     (p=0.971 n=10+10)
Tar              317ms ± 4%        319ms ± 5%    ~     (p=0.853 n=10+10)
XML              404ms ± 4%        409ms ± 5%    ~     (p=0.436 n=10+10)

name       old alloc/op      new alloc/op      delta
Template        34.9MB ± 0%       35.0MB ± 0%  +0.26%  (p=0.000 n=10+10)
Unicode         29.3MB ± 0%       29.3MB ± 0%  +0.02%  (p=0.000 n=10+10)
GoTypes          115MB ± 0%        115MB ± 0%  +0.30%  (p=0.000 n=10+10)
Compiler         519MB ± 0%        521MB ± 0%  +0.30%  (p=0.000 n=10+10)
SSA             1.55GB ± 0%       1.57GB ± 0%  +1.34%  (p=0.000 n=10+9)
Flate           24.1MB ± 0%       24.2MB ± 0%  +0.10%  (p=0.000 n=10+10)
GoParser        28.1MB ± 0%       28.1MB ± 0%  +0.07%  (p=0.000 n=10+10)
Reflect         78.7MB ± 0%       78.7MB ± 0%  +0.03%  (p=0.000 n=8+10)
Tar             34.4MB ± 0%       34.5MB ± 0%  +0.12%  (p=0.000 n=10+10)
XML             43.2MB ± 0%       43.2MB ± 0%  +0.13%  (p=0.000 n=10+10)

name       old allocs/op     new allocs/op     delta
Template          330k ± 0%         330k ± 0%  -0.01%  (p=0.017 n=10+10)
Unicode           337k ± 0%         337k ± 0%  +0.01%  (p=0.000 n=9+10)
GoTypes          1.15M ± 0%        1.15M ± 0%  +0.03%  (p=0.000 n=10+10)
Compiler         4.77M ± 0%        4.77M ± 0%  +0.03%  (p=0.000 n=9+10)
SSA              12.5M ± 0%        12.6M ± 0%  +1.16%  (p=0.000 n=10+10)
Flate             221k ± 0%         221k ± 0%  +0.05%  (p=0.000 n=9+10)
GoParser          275k ± 0%         275k ± 0%  +0.01%  (p=0.014 n=10+9)
Reflect           944k ± 0%         944k ± 0%  -0.02%  (p=0.000 n=10+10)
Tar               324k ± 0%         323k ± 0%  -0.12%  (p=0.000 n=10+10)
XML               384k ± 0%         384k ± 0%  -0.01%  (p=0.001 n=10+10)

name       old object-bytes  new object-bytes  delta
Template         476kB ± 0%        476kB ± 0%  -0.04%  (p=0.000 n=10+10)
Unicode          218kB ± 0%        218kB ± 0%    ~     (all equal)
GoTypes         1.58MB ± 0%       1.58MB ± 0%  -0.04%  (p=0.000 n=10+10)
Compiler        6.25MB ± 0%       6.24MB ± 0%  -0.09%  (p=0.000 n=10+10)
SSA             15.9MB ± 0%       16.1MB ± 0%  +1.22%  (p=0.000 n=10+10)
Flate            304kB ± 0%        304kB ± 0%  -0.13%  (p=0.000 n=10+10)
GoParser         370kB ± 0%        370kB ± 0%  -0.00%  (p=0.000 n=10+10)
Reflect         1.27MB ± 0%       1.27MB ± 0%  -0.12%  (p=0.000 n=10+10)
Tar              421kB ± 0%        419kB ± 0%  -0.64%  (p=0.000 n=10+10)
XML              518kB ± 0%        517kB ± 0%  -0.12%  (p=0.000 n=10+10)

name       old export-bytes  new export-bytes  delta
Template        16.7kB ± 0%       16.7kB ± 0%    ~     (all equal)
Unicode         6.52kB ± 0%       6.52kB ± 0%    ~     (all equal)
GoTypes         29.2kB ± 0%       29.2kB ± 0%    ~     (all equal)
Compiler        88.0kB ± 0%       88.0kB ± 0%    ~     (all equal)
SSA              109kB ± 0%        109kB ± 0%    ~     (all equal)
Flate           4.49kB ± 0%       4.49kB ± 0%    ~     (all equal)
GoParser        8.10kB ± 0%       8.10kB ± 0%    ~     (all equal)
Reflect         7.71kB ± 0%       7.71kB ± 0%    ~     (all equal)
Tar             9.15kB ± 0%       9.15kB ± 0%    ~     (all equal)
XML             12.3kB ± 0%       12.3kB ± 0%    ~     (all equal)

name       old text-bytes    new text-bytes    delta
HelloSize        676kB ± 0%        672kB ± 0%  -0.59%  (p=0.000 n=10+10)
CmdGoSize       7.26MB ± 0%       7.24MB ± 0%  -0.18%  (p=0.000 n=10+10)

name       old data-bytes    new data-bytes    delta
HelloSize       10.2kB ± 0%       10.2kB ± 0%    ~     (all equal)
CmdGoSize        248kB ± 0%        248kB ± 0%    ~     (all equal)

name       old bss-bytes     new bss-bytes     delta
HelloSize        125kB ± 0%        125kB ± 0%    ~     (all equal)
CmdGoSize        145kB ± 0%        145kB ± 0%    ~     (all equal)

name       old exe-bytes     new exe-bytes     delta
HelloSize       1.46MB ± 0%       1.45MB ± 0%  -0.31%  (p=0.000 n=10+10)
CmdGoSize       14.7MB ± 0%       14.7MB ± 0%  -0.17%  (p=0.000 n=10+10)

Change-Id: Ic72b0c189dd542f391e1c9ab88a76e9148dc4285
Reviewed-on: https://go-review.googlesource.com/106495
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/206937 mentions this issue: cmd/compile: optimize big structs

bradfitz added a commit to inetaf/netaddr that referenced this issue Dec 31, 2020
If golang/go#24416 gets fixed we could revert this.

name                              old time/op    new time/op    delta
StdIPv4-4                            361ns ± 9%     386ns ± 8%     ~     (p=0.167 n=5+5)
IPv4-4                               273ns ±12%     301ns ±14%     ~     (p=0.238 n=5+5)
IPv4_inline-4                        251ns ± 5%     262ns ±10%     ~     (p=0.730 n=5+5)
StdIPv6-4                            507ns ±18%     490ns ±13%     ~     (p=0.651 n=5+5)
IPv6-4                               414ns ± 8%     416ns ± 6%     ~     (p=1.000 n=5+5)
IPv4Contains-4                      30.1ns ± 4%    13.1ns ±12%  -56.41%  (p=0.008 n=5+5)
ParseIP/v4-4                        55.9ns ± 5%    49.5ns ±10%  -11.35%  (p=0.032 n=5+5)
ParseIP/v6-4                         244ns ± 9%     217ns ± 8%  -10.76%  (p=0.040 n=5+5)
ParseIP/v6_ellipsis-4                157ns ±12%     157ns ± 7%     ~     (p=0.730 n=5+5)
ParseIP/v6_v4-4                      191ns ±12%     168ns ±19%     ~     (p=0.310 n=5+5)
ParseIP/v6_zone-4                    464ns ±21%     448ns ±12%     ~     (p=0.841 n=5+5)
StdParseIP/v4-4                      186ns ± 5%     186ns ±13%     ~     (p=1.000 n=5+5)
StdParseIP/v6-4                      362ns ± 7%     330ns ± 5%   -8.68%  (p=0.016 n=5+5)
StdParseIP/v6_ellipsis-4             258ns ±17%     280ns ± 6%     ~     (p=0.254 n=5+5)
StdParseIP/v6_v4-4                   427ns ±28%     457ns ±15%     ~     (p=0.548 n=5+5)
StdParseIP/v6_zone-4                 285ns ±17%     305ns ±13%     ~     (p=0.421 n=5+5)
IPString/v4-4                        153ns ±21%     122ns ±17%  -19.79%  (p=0.048 n=5+5)
IPString/v6-4                        356ns ± 3%     345ns ±22%     ~     (p=0.151 n=5+5)
IPString/v6_ellipsis-4               346ns ±13%     329ns ±20%     ~     (p=0.548 n=5+5)
IPString/v6_v4-4                     309ns ±15%     266ns ±11%  -13.90%  (p=0.032 n=5+5)
IPString/v6_zone-4                   362ns ±18%     330ns ±21%     ~     (p=0.421 n=5+5)
IPPrefixMasking/IPv4_/32-4          54.8ns ± 4%    15.1ns ± 4%  -72.44%  (p=0.008 n=5+5)
IPPrefixMasking/IPv4_/17-4          54.9ns ± 4%    14.6ns ± 5%  -73.38%  (p=0.008 n=5+5)
IPPrefixMasking/IPv4_/0-4           55.0ns ± 4%    13.5ns ± 6%  -75.50%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/128-4         55.0ns ± 1%    15.5ns ± 7%  -71.79%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/65-4          54.2ns ± 2%    14.6ns ± 5%  -73.04%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/0-4           55.4ns ± 2%    14.9ns ± 9%  -73.06%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/128-4    54.4ns ± 1%    15.0ns ±13%  -72.44%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/65-4     55.5ns ± 2%    14.8ns ±16%  -73.31%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/0-4      54.8ns ± 4%    14.7ns ± 8%  -73.23%  (p=0.008 n=5+5)
ParseIPPort/v4-4                     118ns ± 8%     103ns ± 9%  -12.88%  (p=0.008 n=5+5)
ParseIPPort/v6-4                     318ns ±12%     262ns ± 6%  -17.38%  (p=0.008 n=5+5)
ParseIPPort/v6_ellipsis-4            243ns ±10%     240ns ±19%     ~     (p=0.841 n=5+5)
ParseIPPort/v6_v4-4                  279ns ±14%     237ns ±16%  -14.91%  (p=0.040 n=5+5)
ParseIPPort/v6_zone-4                578ns ± 9%     546ns ± 7%     ~     (p=0.222 n=5+5)
IPRangePrefix-4                      142ns ± 5%      61ns ±11%  -56.74%  (p=0.008 n=5+5)
IPSetFuzz-4                         41.0µs ±16%    33.3µs ±15%  -18.87%  (p=0.016 n=5+5)

Fixes #102

Signed-off-by: Brad Fitzpatrick <brad@danga.com>
danderson pushed a commit to inetaf/netaddr that referenced this issue Dec 31, 2020
If golang/go#24416 gets fixed we could revert this.

name                              old time/op    new time/op    delta
StdIPv4-4                            361ns ± 9%     386ns ± 8%     ~     (p=0.167 n=5+5)
IPv4-4                               273ns ±12%     301ns ±14%     ~     (p=0.238 n=5+5)
IPv4_inline-4                        251ns ± 5%     262ns ±10%     ~     (p=0.730 n=5+5)
StdIPv6-4                            507ns ±18%     490ns ±13%     ~     (p=0.651 n=5+5)
IPv6-4                               414ns ± 8%     416ns ± 6%     ~     (p=1.000 n=5+5)
IPv4Contains-4                      30.1ns ± 4%    13.1ns ±12%  -56.41%  (p=0.008 n=5+5)
ParseIP/v4-4                        55.9ns ± 5%    49.5ns ±10%  -11.35%  (p=0.032 n=5+5)
ParseIP/v6-4                         244ns ± 9%     217ns ± 8%  -10.76%  (p=0.040 n=5+5)
ParseIP/v6_ellipsis-4                157ns ±12%     157ns ± 7%     ~     (p=0.730 n=5+5)
ParseIP/v6_v4-4                      191ns ±12%     168ns ±19%     ~     (p=0.310 n=5+5)
ParseIP/v6_zone-4                    464ns ±21%     448ns ±12%     ~     (p=0.841 n=5+5)
StdParseIP/v4-4                      186ns ± 5%     186ns ±13%     ~     (p=1.000 n=5+5)
StdParseIP/v6-4                      362ns ± 7%     330ns ± 5%   -8.68%  (p=0.016 n=5+5)
StdParseIP/v6_ellipsis-4             258ns ±17%     280ns ± 6%     ~     (p=0.254 n=5+5)
StdParseIP/v6_v4-4                   427ns ±28%     457ns ±15%     ~     (p=0.548 n=5+5)
StdParseIP/v6_zone-4                 285ns ±17%     305ns ±13%     ~     (p=0.421 n=5+5)
IPString/v4-4                        153ns ±21%     122ns ±17%  -19.79%  (p=0.048 n=5+5)
IPString/v6-4                        356ns ± 3%     345ns ±22%     ~     (p=0.151 n=5+5)
IPString/v6_ellipsis-4               346ns ±13%     329ns ±20%     ~     (p=0.548 n=5+5)
IPString/v6_v4-4                     309ns ±15%     266ns ±11%  -13.90%  (p=0.032 n=5+5)
IPString/v6_zone-4                   362ns ±18%     330ns ±21%     ~     (p=0.421 n=5+5)
IPPrefixMasking/IPv4_/32-4          54.8ns ± 4%    15.1ns ± 4%  -72.44%  (p=0.008 n=5+5)
IPPrefixMasking/IPv4_/17-4          54.9ns ± 4%    14.6ns ± 5%  -73.38%  (p=0.008 n=5+5)
IPPrefixMasking/IPv4_/0-4           55.0ns ± 4%    13.5ns ± 6%  -75.50%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/128-4         55.0ns ± 1%    15.5ns ± 7%  -71.79%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/65-4          54.2ns ± 2%    14.6ns ± 5%  -73.04%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_/0-4           55.4ns ± 2%    14.9ns ± 9%  -73.06%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/128-4    54.4ns ± 1%    15.0ns ±13%  -72.44%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/65-4     55.5ns ± 2%    14.8ns ±16%  -73.31%  (p=0.008 n=5+5)
IPPrefixMasking/IPv6_zone_/0-4      54.8ns ± 4%    14.7ns ± 8%  -73.23%  (p=0.008 n=5+5)
ParseIPPort/v4-4                     118ns ± 8%     103ns ± 9%  -12.88%  (p=0.008 n=5+5)
ParseIPPort/v6-4                     318ns ±12%     262ns ± 6%  -17.38%  (p=0.008 n=5+5)
ParseIPPort/v6_ellipsis-4            243ns ±10%     240ns ±19%     ~     (p=0.841 n=5+5)
ParseIPPort/v6_v4-4                  279ns ±14%     237ns ±16%  -14.91%  (p=0.040 n=5+5)
ParseIPPort/v6_zone-4                578ns ± 9%     546ns ± 7%     ~     (p=0.222 n=5+5)
IPRangePrefix-4                      142ns ± 5%      61ns ±11%  -56.74%  (p=0.008 n=5+5)
IPSetFuzz-4                         41.0µs ±16%    33.3µs ±15%  -18.87%  (p=0.016 n=5+5)

Fixes #102

Signed-off-by: Brad Fitzpatrick <brad@danga.com>
@go101
Copy link

go101 commented May 12, 2021

It looks, for Go toolchain v1.16, structs with 8 int fields and [8]int values have already been viewed as small-size values.

@randall77
Copy link
Contributor Author

It definitely compiles to better code now, but I think the underlying issue is still there. I think it just needs a better test case.

package p

type T struct {
	a, b, c, d int
}

type U struct {
	a, b, c, d, e int
}

func f() int {
	t := T{1, 2, 3, 4}
	return t.c
}
func g() int {
	u := U{1, 2, 3, 4, 5}
	return u.c
}

f is compiles to MOVL $3, AX; RET. g compiles to

SUBQ	$48, SP
MOVQ	BP, 40(SP)
LEAQ	40(SP), BP
MOVQ	$1, "".u(SP)
MOVQ	$2, "".u+8(SP)
MOVQ	$3, "".u+16(SP)
MOVQ	$4, "".u+24(SP)
MOVQ	$5, "".u+32(SP)
MOVQ	"".u+16(SP), AX
MOVQ	40(SP), BP
ADDQ	$48, SP
RET

@quasilyte
Copy link
Contributor

quasilyte commented Nov 9, 2021

I'm interested to know what's the status on this?

The CL206937 was abandoned but it's unclear why.

Are there any extra insights that came from that experiment?

@randall77
Copy link
Contributor Author

It's not abandoned, it just isn't finished. It had issues. Performance wasn't good, and there were pathological cases, like large fields of even larger structs weren't handled well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

8 participants