Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile/internal/ssa: perform safe code de-duplication #24936

Closed
quasilyte opened this issue Apr 18, 2018 · 13 comments
Closed

cmd/compile/internal/ssa: perform safe code de-duplication #24936

quasilyte opened this issue Apr 18, 2018 · 13 comments
Milestone

Comments

@quasilyte
Copy link
Contributor

quasilyte commented Apr 18, 2018

There are cases where duplicated code sequences can be merged without performance loss or invalidation of panic line info.

The most obvious are BlockRet that do not contain anything that may cause run time panic.

Given this code:

package foo

func f1(v *int, y int) int {
	x := 214
	if *v == 0 {
		return x + y
	}
	if *v == 1 {
		return x + y
	}
	if *v == 2 {
		return x - y
	}
	if *v == 3 {
		return x - y
	}
	return 0
}

We currently get this output:

"".f1 STEXT nosplit size=115 args=0x18 locals=0x0
00000 (foo.go:3)	TEXT	"".f1(SB), NOSPLIT, $0-24
00000 (foo.go:3)	MOVQ	"".v+8(SP), AX
00005 (foo.go:5)	MOVQ	(AX), AX
00008 (foo.go:5)	TESTQ	AX, AX
00011 (foo.go:5)	JEQ	98
00013 (foo.go:8)	CMPQ	AX, $1
00017 (foo.go:8)	JEQ	81
00019 (foo.go:11)	CMPQ	AX, $2
00023 (foo.go:11)	JEQ	61
00025 (foo.go:14)	CMPQ	AX, $3
00029 (foo.go:14)	JNE	51
00031 (foo.go:15)	MOVQ	"".y+16(SP), AX
00036 (foo.go:15)	ADDQ	$-214, AX
00042 (foo.go:15)	NEGQ	AX
00045 (foo.go:15)	MOVQ	AX, "".~r2+24(SP)
00050 (foo.go:15)	RET
00051 (foo.go:17)	MOVQ	$0, "".~r2+24(SP)
00060 (foo.go:17)	RET
00061 (foo.go:12)	MOVQ	"".y+16(SP), AX
00066 (foo.go:12)	ADDQ	$-214, AX
00072 (foo.go:12)	NEGQ	AX
00075 (foo.go:12)	MOVQ	AX, "".~r2+24(SP)
00080 (foo.go:12)	RET
00081 (foo.go:9)	MOVQ	"".y+16(SP), AX
00086 (foo.go:9)	ADDQ	$214, AX
00092 (foo.go:9)	MOVQ	AX, "".~r2+24(SP)
00097 (foo.go:9)	RET
00098 (foo.go:6)	MOVQ	"".y+16(SP), AX
00103 (foo.go:6)	ADDQ	$214, AX
00109 (foo.go:6)	MOVQ	AX, "".~r2+24(SP)
00114 (foo.go:6)	RET

If we were merging epilogue sequences, we could get this:

"".f1 STEXT nosplit size=78 args=0x18 locals=0x0
00000 (foo.go:3)	TEXT	"".f1(SB), NOSPLIT, $0-24
00000 (foo.go:3)	MOVQ	"".v+8(SP), AX
00005 (foo.go:5)	MOVQ	(AX), AX
00008 (foo.go:5)	TESTQ	AX, AX
00011 (foo.go:5)	JEQ	61
00013 (foo.go:8)	CMPQ	AX, $1
00017 (foo.go:8)	JEQ	61
00019 (foo.go:11)	CMPQ	AX, $2
00023 (foo.go:11)	JEQ	41
00025 (foo.go:14)	CMPQ	AX, $3
00029 (foo.go:14)	JEQ	41
00031 (foo.go:17)	MOVQ	$0, "".~r2+24(SP)
00040 (foo.go:17)	RET
00041 (foo.go:12)	MOVQ	"".y+16(SP), AX
00046 (foo.go:12)	ADDQ	$-214, AX
00052 (foo.go:12)	NEGQ	AX
00055 (foo.go:12)	MOVQ	AX, "".~r2+24(SP)
00060 (foo.go:12)	RET
00061 (foo.go:6)	MOVQ	"".y+16(SP), AX
00066 (foo.go:6)	ADDQ	$214, AX
00072 (foo.go:6)	MOVQ	AX, "".~r2+24(SP)
00077 (foo.go:6)	RET

Proposed implementation will follow soon in form of CL.

Detailed bincmp info for go binary: bincmp.txt.

Main objectives/restrictions:

  • Do not introduce additional branches. Only change existing jump target to merged return.
  • Do not change program observable behavior (in particular, stack traces should also remain the same, source locations should be the same).

If any of those are not respected by implementation, it's a bug.

@gopherbot
Copy link

Change https://golang.org/cl/107895 mentions this issue: cmd/compile/internal/ssa: add dedup pass

@bradfitz
Copy link
Contributor

/cc @randall77 @aclements

@bradfitz bradfitz added this to the Go1.11 milestone Apr 18, 2018
@quasilyte
Copy link
Contributor Author

/cc @josharian @TocarIP

@aclements
Copy link
Member

This is a really interesting optimization. My main concern (other than build speed impact, which you mentioned on the CL was in the noise) is the effect on debuggability. For example, if a user tries to set a breakpoint on a line that's been merged with another block, I'm not sure what happens, but I can't imagine how that would still work "correctly".

/cc @dr2chase

@aclements
Copy link
Member

Could you give some examples of where this helps in std or cmd (or elsewhere)? I find the synthetic example you gave above unconvincing as motivation because that could be written as a switch statement with the identical cases merged, which would be both clearer code and compile to the better output.

@quasilyte
Copy link
Contributor Author

quasilyte commented Apr 19, 2018

@aclements, see attached file in original message (bincmp.txt).

For example, runtime.atoi function:

// atoi parses an int from a string s.
// The bool result reports whether s is a number
// representable by a value of type int.
func atoi(s string) (int, bool) {
	if s == "" {
		return 0, false
	}

	neg := false
	if s[0] == '-' {
		neg = true
		s = s[1:]
	}

	un := uint(0)
	for i := 0; i < len(s); i++ {
		c := s[i]
		if c < '0' || c > '9' {
			return 0, false
		}
		if un > maxUint/10 {
			// overflow
			return 0, false
		}
		un *= 10
		un1 := un + uint(c) - '0'
		if un1 < un {
			// overflow
			return 0, false
		}
		un = un1
	}

	if !neg && un > uint(maxInt) {
		return 0, false
	}
	if neg && un > uint(maxInt)+1 {
		return 0, false
	}

	n := int(un)
	if neg {
		n = -n
	}

	return n, true
}

The return 0, false repeated multiple times.
Machine code for that return is reused 6 times.
The size impact is:

runtime.atoi  delta=-73 old=295 new=222 diff=-24.75%

Five epilogue sequences of this form are removed:

MOVQ  $0, "".~r0+8(SP)
MOVB  $0, "".~r1+16(SP)
RET

@bradfitz
Copy link
Contributor

Also, where are profiling samples attributed?

@quasilyte
Copy link
Contributor Author

@bradfitz, I've seen a couple regressions in go1 and tried to find explanation for them before posting the results. I also want to re-run benchmarks on different machine with better configuration to get less noise.
Working on it. Same goes for compilation speed impact info.

My main concern (other than build speed impact, which you mentioned on the CL was in the noise) is the effect on debuggability. For example, if a user tries to set a breakpoint on a line that's been merged with another block, I'm not sure what happens, but I can't imagine how that would still work "correctly".

If this would require additional debug mappings, that could kill binary size win, but maybe code that is being executed still will be smaller and, hopefully, more I-cache friendly.
I'll try to find better answer for this topic in near time.

@aclements
Copy link
Member

Thanks for the std example.

I don't think additional debug mappings can solve this. My concern is that the information is gone. If I try to set a breakpoint on the last return 0, false in atoi to see when that happens, it's going to trigger no matter which return 0, false actually gets hit and there's no way to debugger can disambiguate that [1] or tell me which one was actually hit.

[1] Okay, not strictly true: one could imagine synthesizing earlier breakpoints to determine which path was taken (or using the branch target buffer on x86), but no debugger does that and there's no way to express this in DWARF.

@quasilyte
Copy link
Contributor Author

go1 benchmarks

$ benchstat old_go1.txt new_go1.txt

name                     old time/op    new time/op    delta
BinaryTree17-4              4.34s ± 1%     4.36s ± 1%    ~     (p=0.095 n=5+5)
Fannkuch11-4                4.06s ± 0%     4.32s ± 0%  +6.46%  (p=0.008 n=5+5)
FmtFprintfEmpty-4          66.6ns ± 2%    67.3ns ± 3%    ~     (p=0.254 n=5+5)
FmtFprintfString-4          125ns ± 3%     117ns ± 2%  -6.10%  (p=0.008 n=5+5)
FmtFprintfInt-4             120ns ± 0%     123ns ± 2%  +2.83%  (p=0.008 n=5+5)
FmtFprintfIntInt-4          185ns ± 0%     185ns ± 1%    ~     (p=0.968 n=4+5)
FmtFprintfPrefixedInt-4     229ns ± 0%     224ns ± 1%  -2.18%  (p=0.016 n=4+5)
FmtFprintfFloat-4           375ns ± 0%     376ns ± 0%  +0.27%  (p=0.029 n=4+4)
FmtManyArgs-4               776ns ± 0%     760ns ± 0%  -2.09%  (p=0.008 n=5+5)
GobDecode-4                10.6ms ± 0%    10.6ms ± 0%    ~     (p=0.841 n=5+5)
GobEncode-4                8.77ms ± 0%    8.82ms ± 0%  +0.54%  (p=0.016 n=5+5)
Gzip-4                      371ms ± 0%     370ms ± 0%  -0.34%  (p=0.008 n=5+5)
Gunzip-4                   60.2ms ± 0%    59.9ms ± 0%  -0.42%  (p=0.032 n=5+5)
HTTPClientServer-4          127µs ± 2%     128µs ± 1%    ~     (p=0.690 n=5+5)
JSONEncode-4               18.9ms ± 0%    19.1ms ± 0%  +1.27%  (p=0.008 n=5+5)
JSONDecode-4               83.1ms ± 0%    84.9ms ± 1%  +2.19%  (p=0.008 n=5+5)
Mandelbrot200-4            6.77ms ± 0%    6.76ms ± 0%    ~     (p=0.690 n=5+5)
GoParse-4                  5.29ms ± 0%    5.32ms ± 0%  +0.45%  (p=0.008 n=5+5)
RegexpMatchEasy0_32-4       128ns ± 2%     127ns ± 1%    ~     (p=0.238 n=5+5)
RegexpMatchEasy0_1K-4       317ns ± 0%     307ns ± 0%  -3.03%  (p=0.000 n=5+4)
RegexpMatchEasy1_32-4       118ns ± 0%     117ns ± 0%  -0.85%  (p=0.029 n=4+4)
RegexpMatchEasy1_1K-4       540ns ± 1%     527ns ± 0%  -2.48%  (p=0.008 n=5+5)
RegexpMatchMedium_32-4      182ns ± 1%     174ns ± 0%  -4.19%  (p=0.008 n=5+5)
RegexpMatchMedium_1K-4     57.5µs ± 0%    55.7µs ± 0%  -3.13%  (p=0.008 n=5+5)
RegexpMatchHard_32-4       2.62µs ± 0%    2.60µs ± 0%  -0.85%  (p=0.008 n=5+5)
RegexpMatchHard_1K-4       78.7µs ± 0%    78.4µs ± 0%  -0.41%  (p=0.016 n=5+4)
Revcomp-4                   591ms ± 2%     599ms ± 2%    ~     (p=0.095 n=5+5)
Template-4                 90.9ms ± 0%    90.5ms ± 0%  -0.40%  (p=0.016 n=5+5)
TimeParse-4                 462ns ± 0%     460ns ± 0%  -0.43%  (p=0.008 n=5+5)
TimeFormat-4                513ns ± 0%     518ns ± 0%  +0.97%  (p=0.016 n=4+5)
[Geo mean]                 75.6µs         75.3µs       -0.35%

name                     old speed      new speed      delta
GobDecode-4              72.7MB/s ± 0%  72.7MB/s ± 0%    ~     (p=0.889 n=5+5)
GobEncode-4              87.5MB/s ± 0%  87.0MB/s ± 0%  -0.54%  (p=0.016 n=5+5)
Gzip-4                   52.3MB/s ± 0%  52.5MB/s ± 0%  +0.34%  (p=0.008 n=5+5)
Gunzip-4                  322MB/s ± 0%   324MB/s ± 0%  +0.43%  (p=0.032 n=5+5)
JSONEncode-4              103MB/s ± 0%   102MB/s ± 0%  -1.25%  (p=0.008 n=5+5)
JSONDecode-4             23.3MB/s ± 0%  22.9MB/s ± 1%  -2.13%  (p=0.008 n=5+5)
GoParse-4                10.9MB/s ± 0%  10.9MB/s ± 0%  -0.46%  (p=0.008 n=5+5)
RegexpMatchEasy0_32-4     250MB/s ± 1%   251MB/s ± 1%    ~     (p=0.548 n=5+5)
RegexpMatchEasy0_1K-4    3.23GB/s ± 0%  3.33GB/s ± 0%  +2.97%  (p=0.008 n=5+5)
RegexpMatchEasy1_32-4     271MB/s ± 1%   272MB/s ± 1%    ~     (p=0.095 n=5+5)
RegexpMatchEasy1_1K-4    1.90GB/s ± 1%  1.94GB/s ± 0%  +2.53%  (p=0.008 n=5+5)
RegexpMatchMedium_32-4   5.49MB/s ± 1%  5.73MB/s ± 0%  +4.37%  (p=0.008 n=5+5)
RegexpMatchMedium_1K-4   17.8MB/s ± 0%  18.4MB/s ± 0%  +3.23%  (p=0.008 n=5+5)
RegexpMatchHard_32-4     12.2MB/s ± 0%  12.3MB/s ± 0%  +0.87%  (p=0.008 n=5+5)
RegexpMatchHard_1K-4     13.0MB/s ± 0%  13.1MB/s ± 0%  +0.41%  (p=0.016 n=5+4)
Revcomp-4                 430MB/s ± 2%   425MB/s ± 2%    ~     (p=0.095 n=5+5)
Template-4               21.3MB/s ± 0%  21.4MB/s ± 0%  +0.40%  (p=0.016 n=5+5)
[Geo mean]               78.6MB/s       79.1MB/s       +0.63%

compilebench

$ benchstat ssa_old.txt ssa_new.txt

name       old time/op       new time/op       delta
Template         304ms ± 1%        310ms ± 2%  +2.05%  (p=0.016 n=5+5)
Unicode          142ms ± 3%        142ms ± 2%    ~     (p=1.000 n=5+5)
GoTypes          1.07s ± 5%        1.06s ± 1%    ~     (p=0.222 n=5+5)
Compiler         4.85s ± 2%        4.83s ± 0%    ~     (p=1.000 n=5+5)
SSA              11.9s ± 1%        11.9s ± 2%    ~     (p=1.000 n=5+5)
Flate            207ms ± 1%        206ms ± 1%    ~     (p=0.690 n=5+5)
GoParser         253ms ± 2%        252ms ± 1%    ~     (p=1.000 n=5+5)
Reflect          683ms ± 1%        681ms ± 0%    ~     (p=0.310 n=5+5)
Tar              293ms ± 1%        291ms ± 1%    ~     (p=0.095 n=5+5)
XML              364ms ± 1%        366ms ± 1%    ~     (p=0.310 n=5+5)
StdCmd           31.8s ± 4%        31.8s ± 4%    ~     (p=0.841 n=5+5)

name       old user-time/op  new user-time/op  delta
Template         384ms ± 6%        399ms ± 5%    ~     (p=0.198 n=5+5)
Unicode          193ms ± 7%        192ms ± 6%    ~     (p=0.810 n=5+5)
GoTypes          1.35s ± 2%        1.34s ± 2%    ~     (p=0.357 n=5+5)
Compiler         5.96s ± 2%        6.08s ± 0%  +2.05%  (p=0.032 n=5+5)
SSA              14.8s ± 1%        14.9s ± 1%    ~     (p=0.310 n=5+5)
Flate            266ms ± 4%        264ms ± 6%    ~     (p=0.952 n=5+5)
GoParser         324ms ± 2%        322ms ± 4%    ~     (p=0.841 n=5+5)
Reflect          856ms ± 1%        858ms ± 3%    ~     (p=0.508 n=5+5)
Tar              374ms ± 5%        372ms ± 2%    ~     (p=0.460 n=5+5)
XML              462ms ± 1%        461ms ± 5%    ~     (p=0.841 n=5+5)

name       old text-bytes    new text-bytes    delta
HelloSize        677kB ± 0%        674kB ± 0%  -0.48%  (p=0.008 n=5+5)
CmdGoSize       7.25MB ± 0%       7.22MB ± 0%  -0.44%  (p=0.008 n=5+5)

name       old data-bytes    new data-bytes    delta
HelloSize       9.91kB ± 0%       9.91kB ± 0%    ~     (all equal)
CmdGoSize        248kB ± 0%        248kB ± 0%    ~     (all equal)

name       old bss-bytes     new bss-bytes     delta
HelloSize        125kB ± 0%        125kB ± 0%    ~     (all equal)
CmdGoSize        145kB ± 0%        145kB ± 0%    ~     (all equal)

name       old exe-bytes     new exe-bytes     delta
HelloSize       1.46MB ± 0%       1.44MB ± 0%  -0.84%  (p=0.008 n=5+5)
CmdGoSize       14.7MB ± 0%       14.6MB ± 0%  -0.31%  (p=0.008 n=5+5)

@randall77
Copy link
Contributor

Would it be enough if we share all but the first instruction in the duplicate basic blocks? That would leave one instruction on which to hang a line number.

I think it would be useful to share epilogues, especially now that they are longer because we use frame pointers everywhere. Sharing epilogues doesn't have any adverse debugging problems. I even wrote a CL for sharing epilogues once upon a time, but never completed it: https://go-review.googlesource.com/c/go/+/33634

@dr2chase
Copy link
Contributor

When do we decide on that "first instruction"?

It would be complicated, for example, if you found common tails, then ran a phase of dead code elimination or CSE that vanished the "first instructions". Perhaps just before register allocation? Or do we prefer after, to provide greater freedom in register selection (which would probably cause tails to be much less common just because of the vagaries of register allocation).

@aclements
Copy link
Member

Given the small impact on binary size, and the small (and varied) performance improvement, I don't think this is worth the harm it does to debuggability and profile fidelity. Debuggability is one of our top priorities (partly because we did significant harm to it in recent releases and discovered how much of a problem that was). And reducing profile fidelity makes it harder for users to make more fundamental improvements to their code's performance.

It may be worth revisiting epilogue deduplication, though, as @randall77 pointed out. I wouldn't expect nearly as much win from it as, say, in C, since Go doesn't have callee-save registers that need to be restored other than the frame pointer. But it could still be a win without any downsides.

@golang golang locked and limited conversation to collaborators Apr 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants