cmd/compile/internal/ssa: perform safe code de-duplication #24936

quasilyte · 2018-04-18T18:42:47Z

There are cases where duplicated code sequences can be merged without performance loss or invalidation of panic line info.

The most obvious are BlockRet that do not contain anything that may cause run time panic.

Given this code:

package foo

func f1(v *int, y int) int {
	x := 214
	if *v == 0 {
		return x + y
	}
	if *v == 1 {
		return x + y
	}
	if *v == 2 {
		return x - y
	}
	if *v == 3 {
		return x - y
	}
	return 0
}

We currently get this output:

"".f1 STEXT nosplit size=115 args=0x18 locals=0x0
00000 (foo.go:3)	TEXT	"".f1(SB), NOSPLIT, $0-24
00000 (foo.go:3)	MOVQ	"".v+8(SP), AX
00005 (foo.go:5)	MOVQ	(AX), AX
00008 (foo.go:5)	TESTQ	AX, AX
00011 (foo.go:5)	JEQ	98
00013 (foo.go:8)	CMPQ	AX, $1
00017 (foo.go:8)	JEQ	81
00019 (foo.go:11)	CMPQ	AX, $2
00023 (foo.go:11)	JEQ	61
00025 (foo.go:14)	CMPQ	AX, $3
00029 (foo.go:14)	JNE	51
00031 (foo.go:15)	MOVQ	"".y+16(SP), AX
00036 (foo.go:15)	ADDQ	$-214, AX
00042 (foo.go:15)	NEGQ	AX
00045 (foo.go:15)	MOVQ	AX, "".~r2+24(SP)
00050 (foo.go:15)	RET
00051 (foo.go:17)	MOVQ	$0, "".~r2+24(SP)
00060 (foo.go:17)	RET
00061 (foo.go:12)	MOVQ	"".y+16(SP), AX
00066 (foo.go:12)	ADDQ	$-214, AX
00072 (foo.go:12)	NEGQ	AX
00075 (foo.go:12)	MOVQ	AX, "".~r2+24(SP)
00080 (foo.go:12)	RET
00081 (foo.go:9)	MOVQ	"".y+16(SP), AX
00086 (foo.go:9)	ADDQ	$214, AX
00092 (foo.go:9)	MOVQ	AX, "".~r2+24(SP)
00097 (foo.go:9)	RET
00098 (foo.go:6)	MOVQ	"".y+16(SP), AX
00103 (foo.go:6)	ADDQ	$214, AX
00109 (foo.go:6)	MOVQ	AX, "".~r2+24(SP)
00114 (foo.go:6)	RET

If we were merging epilogue sequences, we could get this:

"".f1 STEXT nosplit size=78 args=0x18 locals=0x0
00000 (foo.go:3)	TEXT	"".f1(SB), NOSPLIT, $0-24
00000 (foo.go:3)	MOVQ	"".v+8(SP), AX
00005 (foo.go:5)	MOVQ	(AX), AX
00008 (foo.go:5)	TESTQ	AX, AX
00011 (foo.go:5)	JEQ	61
00013 (foo.go:8)	CMPQ	AX, $1
00017 (foo.go:8)	JEQ	61
00019 (foo.go:11)	CMPQ	AX, $2
00023 (foo.go:11)	JEQ	41
00025 (foo.go:14)	CMPQ	AX, $3
00029 (foo.go:14)	JEQ	41
00031 (foo.go:17)	MOVQ	$0, "".~r2+24(SP)
00040 (foo.go:17)	RET
00041 (foo.go:12)	MOVQ	"".y+16(SP), AX
00046 (foo.go:12)	ADDQ	$-214, AX
00052 (foo.go:12)	NEGQ	AX
00055 (foo.go:12)	MOVQ	AX, "".~r2+24(SP)
00060 (foo.go:12)	RET
00061 (foo.go:6)	MOVQ	"".y+16(SP), AX
00066 (foo.go:6)	ADDQ	$214, AX
00072 (foo.go:6)	MOVQ	AX, "".~r2+24(SP)
00077 (foo.go:6)	RET

Proposed implementation will follow soon in form of CL.

Detailed bincmp info for go binary: bincmp.txt.

Main objectives/restrictions:

Do not introduce additional branches. Only change existing jump target to merged return.
Do not change program observable behavior (in particular, stack traces should also remain the same, source locations should be the same).

If any of those are not respected by implementation, it's a bug.

The text was updated successfully, but these errors were encountered:

gopherbot · 2018-04-18T18:44:17Z

Change https://golang.org/cl/107895 mentions this issue: cmd/compile/internal/ssa: add dedup pass

bradfitz · 2018-04-18T18:52:46Z

/cc @randall77 @aclements

quasilyte · 2018-04-18T18:55:00Z

/cc @josharian @TocarIP

aclements · 2018-04-19T16:05:01Z

This is a really interesting optimization. My main concern (other than build speed impact, which you mentioned on the CL was in the noise) is the effect on debuggability. For example, if a user tries to set a breakpoint on a line that's been merged with another block, I'm not sure what happens, but I can't imagine how that would still work "correctly".

/cc @dr2chase

aclements · 2018-04-19T16:07:58Z

Could you give some examples of where this helps in std or cmd (or elsewhere)? I find the synthetic example you gave above unconvincing as motivation because that could be written as a switch statement with the identical cases merged, which would be both clearer code and compile to the better output.

quasilyte · 2018-04-19T16:19:30Z

@aclements, see attached file in original message (bincmp.txt).

For example, runtime.atoi function:

// atoi parses an int from a string s.
// The bool result reports whether s is a number
// representable by a value of type int.
func atoi(s string) (int, bool) {
	if s == "" {
		return 0, false
	}

	neg := false
	if s[0] == '-' {
		neg = true
		s = s[1:]
	}

	un := uint(0)
	for i := 0; i < len(s); i++ {
		c := s[i]
		if c < '0' || c > '9' {
			return 0, false
		}
		if un > maxUint/10 {
			// overflow
			return 0, false
		}
		un *= 10
		un1 := un + uint(c) - '0'
		if un1 < un {
			// overflow
			return 0, false
		}
		un = un1
	}

	if !neg && un > uint(maxInt) {
		return 0, false
	}
	if neg && un > uint(maxInt)+1 {
		return 0, false
	}

	n := int(un)
	if neg {
		n = -n
	}

	return n, true
}

The return 0, false repeated multiple times.
Machine code for that return is reused 6 times.
The size impact is:

runtime.atoi  delta=-73 old=295 new=222 diff=-24.75%

Five epilogue sequences of this form are removed:

MOVQ  $0, "".~r0+8(SP)
MOVB  $0, "".~r1+16(SP)
RET

bradfitz · 2018-04-19T17:03:34Z

Also, where are profiling samples attributed?

quasilyte · 2018-04-19T17:45:18Z

@bradfitz, I've seen a couple regressions in go1 and tried to find explanation for them before posting the results. I also want to re-run benchmarks on different machine with better configuration to get less noise.
Working on it. Same goes for compilation speed impact info.

My main concern (other than build speed impact, which you mentioned on the CL was in the noise) is the effect on debuggability. For example, if a user tries to set a breakpoint on a line that's been merged with another block, I'm not sure what happens, but I can't imagine how that would still work "correctly".

If this would require additional debug mappings, that could kill binary size win, but maybe code that is being executed still will be smaller and, hopefully, more I-cache friendly.
I'll try to find better answer for this topic in near time.

aclements · 2018-04-19T17:58:28Z

Thanks for the std example.

I don't think additional debug mappings can solve this. My concern is that the information is gone. If I try to set a breakpoint on the last return 0, false in atoi to see when that happens, it's going to trigger no matter which return 0, false actually gets hit and there's no way to debugger can disambiguate that [1] or tell me which one was actually hit.

[1] Okay, not strictly true: one could imagine synthesizing earlier breakpoints to determine which path was taken (or using the branch target buffer on x86), but no debugger does that and there's no way to express this in DWARF.

quasilyte · 2018-04-20T12:12:55Z

go1 benchmarks

$ benchstat old_go1.txt new_go1.txt

name                     old time/op    new time/op    delta
BinaryTree17-4              4.34s ± 1%     4.36s ± 1%    ~     (p=0.095 n=5+5)
Fannkuch11-4                4.06s ± 0%     4.32s ± 0%  +6.46%  (p=0.008 n=5+5)
FmtFprintfEmpty-4          66.6ns ± 2%    67.3ns ± 3%    ~     (p=0.254 n=5+5)
FmtFprintfString-4          125ns ± 3%     117ns ± 2%  -6.10%  (p=0.008 n=5+5)
FmtFprintfInt-4             120ns ± 0%     123ns ± 2%  +2.83%  (p=0.008 n=5+5)
FmtFprintfIntInt-4          185ns ± 0%     185ns ± 1%    ~     (p=0.968 n=4+5)
FmtFprintfPrefixedInt-4     229ns ± 0%     224ns ± 1%  -2.18%  (p=0.016 n=4+5)
FmtFprintfFloat-4           375ns ± 0%     376ns ± 0%  +0.27%  (p=0.029 n=4+4)
FmtManyArgs-4               776ns ± 0%     760ns ± 0%  -2.09%  (p=0.008 n=5+5)
GobDecode-4                10.6ms ± 0%    10.6ms ± 0%    ~     (p=0.841 n=5+5)
GobEncode-4                8.77ms ± 0%    8.82ms ± 0%  +0.54%  (p=0.016 n=5+5)
Gzip-4                      371ms ± 0%     370ms ± 0%  -0.34%  (p=0.008 n=5+5)
Gunzip-4                   60.2ms ± 0%    59.9ms ± 0%  -0.42%  (p=0.032 n=5+5)
HTTPClientServer-4          127µs ± 2%     128µs ± 1%    ~     (p=0.690 n=5+5)
JSONEncode-4               18.9ms ± 0%    19.1ms ± 0%  +1.27%  (p=0.008 n=5+5)
JSONDecode-4               83.1ms ± 0%    84.9ms ± 1%  +2.19%  (p=0.008 n=5+5)
Mandelbrot200-4            6.77ms ± 0%    6.76ms ± 0%    ~     (p=0.690 n=5+5)
GoParse-4                  5.29ms ± 0%    5.32ms ± 0%  +0.45%  (p=0.008 n=5+5)
RegexpMatchEasy0_32-4       128ns ± 2%     127ns ± 1%    ~     (p=0.238 n=5+5)
RegexpMatchEasy0_1K-4       317ns ± 0%     307ns ± 0%  -3.03%  (p=0.000 n=5+4)
RegexpMatchEasy1_32-4       118ns ± 0%     117ns ± 0%  -0.85%  (p=0.029 n=4+4)
RegexpMatchEasy1_1K-4       540ns ± 1%     527ns ± 0%  -2.48%  (p=0.008 n=5+5)
RegexpMatchMedium_32-4      182ns ± 1%     174ns ± 0%  -4.19%  (p=0.008 n=5+5)
RegexpMatchMedium_1K-4     57.5µs ± 0%    55.7µs ± 0%  -3.13%  (p=0.008 n=5+5)
RegexpMatchHard_32-4       2.62µs ± 0%    2.60µs ± 0%  -0.85%  (p=0.008 n=5+5)
RegexpMatchHard_1K-4       78.7µs ± 0%    78.4µs ± 0%  -0.41%  (p=0.016 n=5+4)
Revcomp-4                   591ms ± 2%     599ms ± 2%    ~     (p=0.095 n=5+5)
Template-4                 90.9ms ± 0%    90.5ms ± 0%  -0.40%  (p=0.016 n=5+5)
TimeParse-4                 462ns ± 0%     460ns ± 0%  -0.43%  (p=0.008 n=5+5)
TimeFormat-4                513ns ± 0%     518ns ± 0%  +0.97%  (p=0.016 n=4+5)
[Geo mean]                 75.6µs         75.3µs       -0.35%

name                     old speed      new speed      delta
GobDecode-4              72.7MB/s ± 0%  72.7MB/s ± 0%    ~     (p=0.889 n=5+5)
GobEncode-4              87.5MB/s ± 0%  87.0MB/s ± 0%  -0.54%  (p=0.016 n=5+5)
Gzip-4                   52.3MB/s ± 0%  52.5MB/s ± 0%  +0.34%  (p=0.008 n=5+5)
Gunzip-4                  322MB/s ± 0%   324MB/s ± 0%  +0.43%  (p=0.032 n=5+5)
JSONEncode-4              103MB/s ± 0%   102MB/s ± 0%  -1.25%  (p=0.008 n=5+5)
JSONDecode-4             23.3MB/s ± 0%  22.9MB/s ± 1%  -2.13%  (p=0.008 n=5+5)
GoParse-4                10.9MB/s ± 0%  10.9MB/s ± 0%  -0.46%  (p=0.008 n=5+5)
RegexpMatchEasy0_32-4     250MB/s ± 1%   251MB/s ± 1%    ~     (p=0.548 n=5+5)
RegexpMatchEasy0_1K-4    3.23GB/s ± 0%  3.33GB/s ± 0%  +2.97%  (p=0.008 n=5+5)
RegexpMatchEasy1_32-4     271MB/s ± 1%   272MB/s ± 1%    ~     (p=0.095 n=5+5)
RegexpMatchEasy1_1K-4    1.90GB/s ± 1%  1.94GB/s ± 0%  +2.53%  (p=0.008 n=5+5)
RegexpMatchMedium_32-4   5.49MB/s ± 1%  5.73MB/s ± 0%  +4.37%  (p=0.008 n=5+5)
RegexpMatchMedium_1K-4   17.8MB/s ± 0%  18.4MB/s ± 0%  +3.23%  (p=0.008 n=5+5)
RegexpMatchHard_32-4     12.2MB/s ± 0%  12.3MB/s ± 0%  +0.87%  (p=0.008 n=5+5)
RegexpMatchHard_1K-4     13.0MB/s ± 0%  13.1MB/s ± 0%  +0.41%  (p=0.016 n=5+4)
Revcomp-4                 430MB/s ± 2%   425MB/s ± 2%    ~     (p=0.095 n=5+5)
Template-4               21.3MB/s ± 0%  21.4MB/s ± 0%  +0.40%  (p=0.016 n=5+5)
[Geo mean]               78.6MB/s       79.1MB/s       +0.63%

compilebench

$ benchstat ssa_old.txt ssa_new.txt

name       old time/op       new time/op       delta
Template         304ms ± 1%        310ms ± 2%  +2.05%  (p=0.016 n=5+5)
Unicode          142ms ± 3%        142ms ± 2%    ~     (p=1.000 n=5+5)
GoTypes          1.07s ± 5%        1.06s ± 1%    ~     (p=0.222 n=5+5)
Compiler         4.85s ± 2%        4.83s ± 0%    ~     (p=1.000 n=5+5)
SSA              11.9s ± 1%        11.9s ± 2%    ~     (p=1.000 n=5+5)
Flate            207ms ± 1%        206ms ± 1%    ~     (p=0.690 n=5+5)
GoParser         253ms ± 2%        252ms ± 1%    ~     (p=1.000 n=5+5)
Reflect          683ms ± 1%        681ms ± 0%    ~     (p=0.310 n=5+5)
Tar              293ms ± 1%        291ms ± 1%    ~     (p=0.095 n=5+5)
XML              364ms ± 1%        366ms ± 1%    ~     (p=0.310 n=5+5)
StdCmd           31.8s ± 4%        31.8s ± 4%    ~     (p=0.841 n=5+5)

name       old user-time/op  new user-time/op  delta
Template         384ms ± 6%        399ms ± 5%    ~     (p=0.198 n=5+5)
Unicode          193ms ± 7%        192ms ± 6%    ~     (p=0.810 n=5+5)
GoTypes          1.35s ± 2%        1.34s ± 2%    ~     (p=0.357 n=5+5)
Compiler         5.96s ± 2%        6.08s ± 0%  +2.05%  (p=0.032 n=5+5)
SSA              14.8s ± 1%        14.9s ± 1%    ~     (p=0.310 n=5+5)
Flate            266ms ± 4%        264ms ± 6%    ~     (p=0.952 n=5+5)
GoParser         324ms ± 2%        322ms ± 4%    ~     (p=0.841 n=5+5)
Reflect          856ms ± 1%        858ms ± 3%    ~     (p=0.508 n=5+5)
Tar              374ms ± 5%        372ms ± 2%    ~     (p=0.460 n=5+5)
XML              462ms ± 1%        461ms ± 5%    ~     (p=0.841 n=5+5)

name       old text-bytes    new text-bytes    delta
HelloSize        677kB ± 0%        674kB ± 0%  -0.48%  (p=0.008 n=5+5)
CmdGoSize       7.25MB ± 0%       7.22MB ± 0%  -0.44%  (p=0.008 n=5+5)

name       old data-bytes    new data-bytes    delta
HelloSize       9.91kB ± 0%       9.91kB ± 0%    ~     (all equal)
CmdGoSize        248kB ± 0%        248kB ± 0%    ~     (all equal)

name       old bss-bytes     new bss-bytes     delta
HelloSize        125kB ± 0%        125kB ± 0%    ~     (all equal)
CmdGoSize        145kB ± 0%        145kB ± 0%    ~     (all equal)

name       old exe-bytes     new exe-bytes     delta
HelloSize       1.46MB ± 0%       1.44MB ± 0%  -0.84%  (p=0.008 n=5+5)
CmdGoSize       14.7MB ± 0%       14.6MB ± 0%  -0.31%  (p=0.008 n=5+5)

randall77 · 2018-04-21T00:22:39Z

Would it be enough if we share all but the first instruction in the duplicate basic blocks? That would leave one instruction on which to hang a line number.

I think it would be useful to share epilogues, especially now that they are longer because we use frame pointers everywhere. Sharing epilogues doesn't have any adverse debugging problems. I even wrote a CL for sharing epilogues once upon a time, but never completed it: https://go-review.googlesource.com/c/go/+/33634

dr2chase · 2018-04-21T02:17:11Z

When do we decide on that "first instruction"?

It would be complicated, for example, if you found common tails, then ran a phase of dead code elimination or CSE that vanished the "first instructions". Perhaps just before register allocation? Or do we prefer after, to provide greater freedom in register selection (which would probably cause tails to be much less common just because of the vagaries of register allocation).

aclements · 2018-04-24T20:53:41Z

Given the small impact on binary size, and the small (and varied) performance improvement, I don't think this is worth the harm it does to debuggability and profile fidelity. Debuggability is one of our top priorities (partly because we did significant harm to it in recent releases and discovered how much of a problem that was). And reducing profile fidelity makes it harder for users to make more fundamental improvements to their code's performance.

It may be worth revisiting epilogue deduplication, though, as @randall77 pointed out. I wouldn't expect nearly as much win from it as, say, in C, since Go doesn't have callee-save registers that need to be restored other than the frame pointer. But it could still be a win without any downsides.

bradfitz added this to the Go1.11 milestone Apr 18, 2018

quasilyte closed this as completed Apr 24, 2018

golang locked and limited conversation to collaborators Apr 24, 2019

gopherbot added the FrozenDueToAge label Apr 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile/internal/ssa: perform safe code de-duplication #24936

cmd/compile/internal/ssa: perform safe code de-duplication #24936

quasilyte commented Apr 18, 2018 •

edited

gopherbot commented Apr 18, 2018

bradfitz commented Apr 18, 2018

quasilyte commented Apr 18, 2018

aclements commented Apr 19, 2018

aclements commented Apr 19, 2018

quasilyte commented Apr 19, 2018 •

edited

bradfitz commented Apr 19, 2018

quasilyte commented Apr 19, 2018

aclements commented Apr 19, 2018

quasilyte commented Apr 20, 2018

randall77 commented Apr 21, 2018

dr2chase commented Apr 21, 2018

aclements commented Apr 24, 2018

cmd/compile/internal/ssa: perform safe code de-duplication #24936

cmd/compile/internal/ssa: perform safe code de-duplication #24936

Comments

quasilyte commented Apr 18, 2018 • edited

gopherbot commented Apr 18, 2018

bradfitz commented Apr 18, 2018

quasilyte commented Apr 18, 2018

aclements commented Apr 19, 2018

aclements commented Apr 19, 2018

quasilyte commented Apr 19, 2018 • edited

bradfitz commented Apr 19, 2018

quasilyte commented Apr 19, 2018

aclements commented Apr 19, 2018

quasilyte commented Apr 20, 2018

randall77 commented Apr 21, 2018

dr2chase commented Apr 21, 2018

aclements commented Apr 24, 2018

quasilyte commented Apr 18, 2018 •

edited

quasilyte commented Apr 19, 2018 •

edited