-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: needlessly early return value loads in runtime.scanobject #19195
Comments
CL https://golang.org/cl/37300 mentions this issue. |
The tighten pass moves values closer to the blocks in which they are used. Prior to this CL, values that took a memory arg were never moved, because that could cause multiple memory values to be live across blocks. However, there are in fact cases in which such a value can safely be moved across blocks. The most common such case is loading the return values from a function call when only one control flow path actually uses those return values. In that case, we were unnecessarily eagerly loading all return values from the stack. This CL delays the decision about whether a value with a memory arg can be moved until we know where we'd like to move it to. Once we know the target block, we can check whether moving the value will cause problems. This CL uses a simple check: If an existing value in the target block already uses the same memory arg, and the value cannot fault, then it must be ok to add another use of it. This is enough to improve the situation in scanobject. Before: 0x025d 00605 (mgcmark.go:1180) CALL "".heapBitsForObject(SB) 0x0262 00610 (mgcmark.go:1180) MOVL 40(SP), AX 0x0266 00614 (mgcmark.go:1180) MOVQ 24(SP), CX 0x026b 00619 (mgcmark.go:1180) MOVQ 48(SP), DX 0x0270 00624 (mgcmark.go:1180) MOVQ 56(SP), BX 0x0275 00629 (mgcmark.go:1180) MOVQ 32(SP), SI 0x027a 00634 (mgcmark.go:1180) TESTQ CX, CX 0x027d 00637 (mgcmark.go:1180) JNE $0, 687 0x027f 00639 (mgcmark.go:1178) MOVQ "".arena_used+96(SP), AX 0x0284 00644 (mgcmark.go:1160) MOVL "".h.shift+68(SP), CX 0x0288 00648 (mgcmark.go:1178) MOVQ ""..autotmp_2642+112(SP), DX 0x028d 00653 (mgcmark.go:1174) MOVQ "".b+168(FP), BX 0x0295 00661 (mgcmark.go:1160) MOVQ "".h.bitp+128(SP), SI 0x029d 00669 (mgcmark.go:1153) MOVQ "".n+80(SP), DI 0x02a2 00674 (mgcmark.go:1153) MOVQ "".i+88(SP), R8 0x02a7 00679 (mgcmark.go:1160) MOVL CX, R9 0x02aa 00682 (mgcmark.go:1178) JMP 552 0x02af 00687 (mgcmark.go:1181) MOVQ CX, (SP) 0x02b3 00691 (mgcmark.go:1181) MOVQ "".b+168(FP), CX 0x02bb 00699 (mgcmark.go:1181) MOVQ CX, 8(SP) 0x02c0 00704 (mgcmark.go:1181) MOVQ "".i+88(SP), DI 0x02c5 00709 (mgcmark.go:1181) MOVQ DI, 16(SP) 0x02ca 00714 (mgcmark.go:1181) MOVQ SI, 24(SP) 0x02cf 00719 (mgcmark.go:1181) MOVL AX, 32(SP) 0x02d3 00723 (mgcmark.go:1181) MOVQ DX, 40(SP) 0x02d8 00728 (mgcmark.go:1181) MOVQ "".gcw+176(FP), AX 0x02e0 00736 (mgcmark.go:1181) MOVQ AX, 48(SP) 0x02e5 00741 (mgcmark.go:1181) MOVQ BX, 56(SP) 0x02ea 00746 (mgcmark.go:1181) PCDATA $0, $3 0x02ea 00746 (mgcmark.go:1181) CALL "".greyobject(SB) After: 0x025d 00605 (mgcmark.go:1180) CALL "".heapBitsForObject(SB) 0x0262 00610 (mgcmark.go:1180) MOVQ 24(SP), AX 0x0267 00615 (mgcmark.go:1180) TESTQ AX, AX 0x026a 00618 (mgcmark.go:1180) JNE $0, 665 0x026c 00620 (mgcmark.go:1178) MOVQ "".arena_used+96(SP), AX 0x0271 00625 (mgcmark.go:1160) MOVL "".h.shift+68(SP), CX 0x0275 00629 (mgcmark.go:1178) MOVQ ""..autotmp_2642+112(SP), DX 0x027a 00634 (mgcmark.go:1174) MOVQ "".b+168(FP), BX 0x0282 00642 (mgcmark.go:1160) MOVQ "".h.bitp+128(SP), SI 0x028a 00650 (mgcmark.go:1153) MOVQ "".n+80(SP), DI 0x028f 00655 (mgcmark.go:1153) MOVQ "".i+88(SP), R8 0x0294 00660 (mgcmark.go:1160) MOVL CX, R9 0x0297 00663 (mgcmark.go:1178) JMP 552 0x0299 00665 (mgcmark.go:1180) MOVL 40(SP), CX 0x029d 00669 (mgcmark.go:1180) MOVQ 48(SP), DX 0x02a2 00674 (mgcmark.go:1180) MOVQ 56(SP), BX 0x02a7 00679 (mgcmark.go:1180) MOVQ 32(SP), SI 0x02ac 00684 (mgcmark.go:1181) MOVQ AX, (SP) 0x02b0 00688 (mgcmark.go:1181) MOVQ "".b+168(FP), AX 0x02b8 00696 (mgcmark.go:1181) MOVQ AX, 8(SP) 0x02bd 00701 (mgcmark.go:1181) MOVQ "".i+88(SP), DI 0x02c2 00706 (mgcmark.go:1181) MOVQ DI, 16(SP) 0x02c7 00711 (mgcmark.go:1181) MOVQ SI, 24(SP) 0x02cc 00716 (mgcmark.go:1181) MOVL CX, 32(SP) 0x02d0 00720 (mgcmark.go:1181) MOVQ DX, 40(SP) 0x02d5 00725 (mgcmark.go:1181) MOVQ "".gcw+176(FP), CX 0x02dd 00733 (mgcmark.go:1181) MOVQ CX, 48(SP) 0x02e2 00738 (mgcmark.go:1181) MOVQ BX, 56(SP) 0x02e7 00743 (mgcmark.go:1181) PCDATA $0, $3 0x02e7 00743 (mgcmark.go:1181) CALL "".greyobject(SB) Observe that after this CL, the return values from heapBitsForObject are only loaded in a single control flow path. Updates golang#19195 The optimization appears to pay for itself, compile-time-wise: name old time/op new time/op delta Template 199ms ± 4% 198ms ± 3% -0.71% (p=0.036 n=49+45) Unicode 90.0ms ± 5% 89.3ms ± 5% -0.76% (p=0.022 n=50+49) GoTypes 558ms ± 4% 560ms ± 4% ~ (p=0.204 n=49+48) Compiler 2.50s ± 3% 2.51s ± 3% ~ (p=0.087 n=50+49) SSA 3.97s ± 3% 3.98s ± 3% ~ (p=0.843 n=49+50) Flate 126ms ± 3% 126ms ± 4% ~ (p=0.134 n=48+48) GoParser 148ms ± 3% 149ms ± 5% ~ (p=0.128 n=46+50) Reflect 365ms ± 4% 366ms ± 4% ~ (p=0.237 n=50+49) Tar 110ms ± 4% 109ms ± 5% -1.22% (p=0.006 n=50+47) XML 206ms ± 5% 207ms ± 3% ~ (p=0.145 n=49+50) compress/lzw benchmarks: name old time/op new time/op delta Decoder/1e4-8 85.0µs ± 3% 85.6µs ± 2% +0.74% (p=0.000 n=36+36) Decoder/1e5-8 811µs ± 2% 810µs ± 2% ~ (p=0.134 n=37+37) Decoder/1e6-8 8.00ms ± 2% 7.99ms ± 1% ~ (p=0.648 n=39+39) Encoder/1e4-8 174µs ± 1% 170µs ± 1% -2.41% (p=0.000 n=38+37) Encoder/1e5-8 1.57ms ± 2% 1.54ms ± 2% -2.21% (p=0.000 n=37+37) Encoder/1e6-8 15.7ms ± 1% 15.4ms ± 2% -2.10% (p=0.000 n=38+39) Change-Id: I486e26f096e55e6bb1cc794026664b1bc46a9391
I've been doing some work on this front. Branch is here: https://github.com/philhofer/go/tree/store-forward On that branch, that particular bit of code compiles to:
Alias analysis plus tighten plus a load-sinking pass is clever enough to push the loads down until they absolutely need to be loaded. In this particular example the shuffling within the destination basic block doesn't help much, but it appears to help most go programs. Geomean improvement in go1bench is about 6% right now. |
This change introduces alias analysis into the ssa backend, and uses it to tighten some loads with their uses. Since stack loads are non-faulting, tighten is now often able to postpone loading function return values until their uses. Given code like res, ok := fn() if !ok { return } // use res The 'res' return value won't be loaded until after examining 'ok.' Fixed golang#19195 Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61
This change introduces alias analysis into the ssa backend, and uses it to tighten some loads with their uses. Since stack loads are non-faulting, tighten is now often able to postpone loading function return values until their uses. Given code like res, ok := fn() if !ok { return } // use res The 'res' return value won't be loaded until after examining 'ok.' Fixed golang#19195 Benchmarks on linux/arm: name old time/op new time/op delta BinaryTree17-4 31.1s ± 0% 30.9s ± 0% -0.39% (p=0.001 n=9+8) Fannkuch11-4 13.9s ± 0% 14.1s ± 0% +0.90% (p=0.000 n=8+9) FmtFprintfEmpty-4 666ns ± 5% 674ns ± 4% ~ (p=0.424 n=10+10) FmtFprintfString-4 1.09µs ± 4% 1.12µs ± 4% +2.62% (p=0.027 n=10+10) FmtFprintfInt-4 1.22µs ± 3% 1.16µs ± 3% -4.37% (p=0.000 n=10+10) FmtFprintfIntInt-4 1.74µs ± 2% 1.64µs ± 3% -6.00% (p=0.000 n=10+10) FmtFprintfPrefixedInt-4 1.75µs ± 1% 1.69µs ± 3% -3.80% (p=0.000 n=9+10) FmtFprintfFloat-4 3.27µs ± 1% 3.41µs ± 1% +4.33% (p=0.000 n=10+9) FmtManyArgs-4 6.49µs ± 1% 6.23µs ± 2% -4.10% (p=0.000 n=10+10) GobDecode-4 87.1ms ± 1% 75.5ms ± 1% -13.37% (p=0.000 n=10+10) GobEncode-4 69.4ms ± 1% 69.7ms ± 1% ~ (p=0.190 n=10+10) Gzip-4 3.56s ± 1% 3.57s ± 1% ~ (p=0.053 n=10+9) Gunzip-4 446ms ± 2% 442ms ± 2% ~ (p=0.123 n=10+10) HTTPClientServer-4 1.51ms ± 1% 1.55ms ± 3% +2.26% (p=0.001 n=8+9) JSONEncode-4 191ms ± 1% 175ms ± 2% -8.33% (p=0.000 n=10+10) JSONDecode-4 798ms ± 1% 835ms ± 1% +4.65% (p=0.000 n=10+10) Mandelbrot200-4 33.6ms ± 0% 33.6ms ± 0% ~ (p=0.068 n=8+10) GoParse-4 42.4ms ± 1% 42.5ms ± 1% ~ (p=0.190 n=10+10) RegexpMatchEasy0_32-4 829ns ± 1% 853ns ± 1% +2.98% (p=0.000 n=9+8) RegexpMatchEasy0_1K-4 4.04µs ± 1% 4.03µs ± 1% ~ (p=0.986 n=10+10) RegexpMatchEasy1_32-4 889ns ± 2% 900ns ± 5% ~ (p=0.566 n=10+10) RegexpMatchEasy1_1K-4 6.01µs ± 2% 6.15µs ± 2% +2.29% (p=0.000 n=9+9) RegexpMatchMedium_32-4 1.35µs ± 3% 1.39µs ± 4% +2.26% (p=0.018 n=9+10) RegexpMatchMedium_1K-4 357µs ± 9% 352µs ± 2% ~ (p=0.968 n=10+9) RegexpMatchHard_32-4 22.2µs ± 6% 22.6µs ± 6% ~ (p=0.161 n=9+9) RegexpMatchHard_1K-4 652µs ± 4% 664µs ± 4% +1.91% (p=0.028 n=9+10) Revcomp-4 51.4ms ± 1% 51.3ms ± 2% ~ (p=0.353 n=10+10) Template-4 1.17s ± 2% 1.06s ± 2% -9.39% (p=0.000 n=10+10) TimeParse-4 4.44µs ± 1% 4.46µs ± 1% +0.50% (p=0.003 n=9+10) TimeFormat-4 9.30µs ± 1% 9.33µs ± 1% ~ (p=0.197 n=10+10) [Geo mean] 557µs 553µs -0.82% Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61
This change introduces alias analysis into the ssa backend, and uses it to tighten some loads with their uses. Since stack loads are non-faulting, tighten is now often able to postpone loading function return values until their uses. Given code like res, ok := fn() if !ok { return } // use res The 'res' return value won't be loaded until after examining 'ok.' Fixed golang#19195 Benchmarks on linux/arm: name old time/op new time/op delta BinaryTree17-4 31.1s ± 0% 30.9s ± 0% -0.39% (p=0.001 n=9+8) Fannkuch11-4 13.9s ± 0% 14.1s ± 0% +0.90% (p=0.000 n=8+9) FmtFprintfEmpty-4 666ns ± 5% 674ns ± 4% ~ (p=0.424 n=10+10) FmtFprintfString-4 1.09µs ± 4% 1.12µs ± 4% +2.62% (p=0.027 n=10+10) FmtFprintfInt-4 1.22µs ± 3% 1.16µs ± 3% -4.37% (p=0.000 n=10+10) FmtFprintfIntInt-4 1.74µs ± 2% 1.64µs ± 3% -6.00% (p=0.000 n=10+10) FmtFprintfPrefixedInt-4 1.75µs ± 1% 1.69µs ± 3% -3.80% (p=0.000 n=9+10) FmtFprintfFloat-4 3.27µs ± 1% 3.41µs ± 1% +4.33% (p=0.000 n=10+9) FmtManyArgs-4 6.49µs ± 1% 6.23µs ± 2% -4.10% (p=0.000 n=10+10) GobDecode-4 87.1ms ± 1% 75.5ms ± 1% -13.37% (p=0.000 n=10+10) GobEncode-4 69.4ms ± 1% 69.7ms ± 1% ~ (p=0.190 n=10+10) Gzip-4 3.56s ± 1% 3.57s ± 1% ~ (p=0.053 n=10+9) Gunzip-4 446ms ± 2% 442ms ± 2% ~ (p=0.123 n=10+10) HTTPClientServer-4 1.51ms ± 1% 1.55ms ± 3% +2.26% (p=0.001 n=8+9) JSONEncode-4 191ms ± 1% 175ms ± 2% -8.33% (p=0.000 n=10+10) JSONDecode-4 798ms ± 1% 835ms ± 1% +4.65% (p=0.000 n=10+10) Mandelbrot200-4 33.6ms ± 0% 33.6ms ± 0% ~ (p=0.068 n=8+10) GoParse-4 42.4ms ± 1% 42.5ms ± 1% ~ (p=0.190 n=10+10) RegexpMatchEasy0_32-4 829ns ± 1% 853ns ± 1% +2.98% (p=0.000 n=9+8) RegexpMatchEasy0_1K-4 4.04µs ± 1% 4.03µs ± 1% ~ (p=0.986 n=10+10) RegexpMatchEasy1_32-4 889ns ± 2% 900ns ± 5% ~ (p=0.566 n=10+10) RegexpMatchEasy1_1K-4 6.01µs ± 2% 6.15µs ± 2% +2.29% (p=0.000 n=9+9) RegexpMatchMedium_32-4 1.35µs ± 3% 1.39µs ± 4% +2.26% (p=0.018 n=9+10) RegexpMatchMedium_1K-4 357µs ± 9% 352µs ± 2% ~ (p=0.968 n=10+9) RegexpMatchHard_32-4 22.2µs ± 6% 22.6µs ± 6% ~ (p=0.161 n=9+9) RegexpMatchHard_1K-4 652µs ± 4% 664µs ± 4% +1.91% (p=0.028 n=9+10) Revcomp-4 51.4ms ± 1% 51.3ms ± 2% ~ (p=0.353 n=10+10) Template-4 1.17s ± 2% 1.06s ± 2% -9.39% (p=0.000 n=10+10) TimeParse-4 4.44µs ± 1% 4.46µs ± 1% +0.50% (p=0.003 n=9+10) TimeFormat-4 9.30µs ± 1% 9.33µs ± 1% ~ (p=0.197 n=10+10) [Geo mean] 557µs 553µs -0.82% Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61
This change introduces alias analysis into the ssa backend, and uses it to tighten some loads with their uses. Since stack loads are non-faulting, tighten is now often able to postpone loading function return values until their uses. Given code like res, ok := fn() if !ok { return } // use res The 'res' return value won't be loaded until after examining 'ok.' Fixed golang#19195 Benchmarks on linux/arm: name old time/op new time/op delta BinaryTree17-4 31.1s ± 0% 30.9s ± 0% -0.39% (p=0.001 n=9+8) Fannkuch11-4 13.9s ± 0% 14.1s ± 0% +0.90% (p=0.000 n=8+9) FmtFprintfEmpty-4 666ns ± 5% 674ns ± 4% ~ (p=0.424 n=10+10) FmtFprintfString-4 1.09µs ± 4% 1.12µs ± 4% +2.62% (p=0.027 n=10+10) FmtFprintfInt-4 1.22µs ± 3% 1.16µs ± 3% -4.37% (p=0.000 n=10+10) FmtFprintfIntInt-4 1.74µs ± 2% 1.64µs ± 3% -6.00% (p=0.000 n=10+10) FmtFprintfPrefixedInt-4 1.75µs ± 1% 1.69µs ± 3% -3.80% (p=0.000 n=9+10) FmtFprintfFloat-4 3.27µs ± 1% 3.41µs ± 1% +4.33% (p=0.000 n=10+9) FmtManyArgs-4 6.49µs ± 1% 6.23µs ± 2% -4.10% (p=0.000 n=10+10) GobDecode-4 87.1ms ± 1% 75.5ms ± 1% -13.37% (p=0.000 n=10+10) GobEncode-4 69.4ms ± 1% 69.7ms ± 1% ~ (p=0.190 n=10+10) Gzip-4 3.56s ± 1% 3.57s ± 1% ~ (p=0.053 n=10+9) Gunzip-4 446ms ± 2% 442ms ± 2% ~ (p=0.123 n=10+10) HTTPClientServer-4 1.51ms ± 1% 1.55ms ± 3% +2.26% (p=0.001 n=8+9) JSONEncode-4 191ms ± 1% 175ms ± 2% -8.33% (p=0.000 n=10+10) JSONDecode-4 798ms ± 1% 835ms ± 1% +4.65% (p=0.000 n=10+10) Mandelbrot200-4 33.6ms ± 0% 33.6ms ± 0% ~ (p=0.068 n=8+10) GoParse-4 42.4ms ± 1% 42.5ms ± 1% ~ (p=0.190 n=10+10) RegexpMatchEasy0_32-4 829ns ± 1% 853ns ± 1% +2.98% (p=0.000 n=9+8) RegexpMatchEasy0_1K-4 4.04µs ± 1% 4.03µs ± 1% ~ (p=0.986 n=10+10) RegexpMatchEasy1_32-4 889ns ± 2% 900ns ± 5% ~ (p=0.566 n=10+10) RegexpMatchEasy1_1K-4 6.01µs ± 2% 6.15µs ± 2% +2.29% (p=0.000 n=9+9) RegexpMatchMedium_32-4 1.35µs ± 3% 1.39µs ± 4% +2.26% (p=0.018 n=9+10) RegexpMatchMedium_1K-4 357µs ± 9% 352µs ± 2% ~ (p=0.968 n=10+9) RegexpMatchHard_32-4 22.2µs ± 6% 22.6µs ± 6% ~ (p=0.161 n=9+9) RegexpMatchHard_1K-4 652µs ± 4% 664µs ± 4% +1.91% (p=0.028 n=9+10) Revcomp-4 51.4ms ± 1% 51.3ms ± 2% ~ (p=0.353 n=10+10) Template-4 1.17s ± 2% 1.06s ± 2% -9.39% (p=0.000 n=10+10) TimeParse-4 4.44µs ± 1% 4.46µs ± 1% +0.50% (p=0.003 n=9+10) TimeFormat-4 9.30µs ± 1% 9.33µs ± 1% ~ (p=0.197 n=10+10) [Geo mean] 557µs 553µs -0.82% Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61
CL https://golang.org/cl/38448 mentions this issue. |
This change introduces alias analysis into the ssa backend, and uses it to tighten some loads with their uses. Given code like res, ok := fn() if !ok { return } // use res The 'res' return value won't be loaded until after examining 'ok.' Alias analysis also enables tightening of loads from constant memory (strings) and certain autotmp stack slots. A follow-up CL will introduce alias analysis to dead-store elimination. Another will introduce load elimination. Fixed golang#19195 Benchmarks on linux/arm: name old time/op new time/op delta BinaryTree17-4 31.1s ± 0% 30.9s ± 0% -0.39% (p=0.001 n=9+8) Fannkuch11-4 13.9s ± 0% 14.1s ± 0% +0.90% (p=0.000 n=8+9) FmtFprintfEmpty-4 666ns ± 5% 674ns ± 4% ~ (p=0.424 n=10+10) FmtFprintfString-4 1.09µs ± 4% 1.12µs ± 4% +2.62% (p=0.027 n=10+10) FmtFprintfInt-4 1.22µs ± 3% 1.16µs ± 3% -4.37% (p=0.000 n=10+10) FmtFprintfIntInt-4 1.74µs ± 2% 1.64µs ± 3% -6.00% (p=0.000 n=10+10) FmtFprintfPrefixedInt-4 1.75µs ± 1% 1.69µs ± 3% -3.80% (p=0.000 n=9+10) FmtFprintfFloat-4 3.27µs ± 1% 3.41µs ± 1% +4.33% (p=0.000 n=10+9) FmtManyArgs-4 6.49µs ± 1% 6.23µs ± 2% -4.10% (p=0.000 n=10+10) GobDecode-4 87.1ms ± 1% 75.5ms ± 1% -13.37% (p=0.000 n=10+10) GobEncode-4 69.4ms ± 1% 69.7ms ± 1% ~ (p=0.190 n=10+10) Gzip-4 3.56s ± 1% 3.57s ± 1% ~ (p=0.053 n=10+9) Gunzip-4 446ms ± 2% 442ms ± 2% ~ (p=0.123 n=10+10) HTTPClientServer-4 1.51ms ± 1% 1.55ms ± 3% +2.26% (p=0.001 n=8+9) JSONEncode-4 191ms ± 1% 175ms ± 2% -8.33% (p=0.000 n=10+10) JSONDecode-4 798ms ± 1% 835ms ± 1% +4.65% (p=0.000 n=10+10) Mandelbrot200-4 33.6ms ± 0% 33.6ms ± 0% ~ (p=0.068 n=8+10) GoParse-4 42.4ms ± 1% 42.5ms ± 1% ~ (p=0.190 n=10+10) RegexpMatchEasy0_32-4 829ns ± 1% 853ns ± 1% +2.98% (p=0.000 n=9+8) RegexpMatchEasy0_1K-4 4.04µs ± 1% 4.03µs ± 1% ~ (p=0.986 n=10+10) RegexpMatchEasy1_32-4 889ns ± 2% 900ns ± 5% ~ (p=0.566 n=10+10) RegexpMatchEasy1_1K-4 6.01µs ± 2% 6.15µs ± 2% +2.29% (p=0.000 n=9+9) RegexpMatchMedium_32-4 1.35µs ± 3% 1.39µs ± 4% +2.26% (p=0.018 n=9+10) RegexpMatchMedium_1K-4 357µs ± 9% 352µs ± 2% ~ (p=0.968 n=10+9) RegexpMatchHard_32-4 22.2µs ± 6% 22.6µs ± 6% ~ (p=0.161 n=9+9) RegexpMatchHard_1K-4 652µs ± 4% 664µs ± 4% +1.91% (p=0.028 n=9+10) Revcomp-4 51.4ms ± 1% 51.3ms ± 2% ~ (p=0.353 n=10+10) Template-4 1.17s ± 2% 1.06s ± 2% -9.39% (p=0.000 n=10+10) TimeParse-4 4.44µs ± 1% 4.46µs ± 1% +0.50% (p=0.003 n=9+10) TimeFormat-4 9.30µs ± 1% 9.33µs ± 1% ~ (p=0.197 n=10+10) [Geo mean] 557µs 553µs -0.82% Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61
This change introduces alias analysis into the ssa backend, and uses it to tighten some loads with their uses. Given code like res, ok := fn() if !ok { return } // use res The 'res' return value won't be loaded until after examining 'ok.' Alias analysis also enables tightening of loads from constant memory (strings) and certain autotmp stack slots. A follow-up CL will introduce alias analysis to dead-store elimination. Another will introduce load elimination. Fixed golang#19195 Benchmarks on linux/arm: name old time/op new time/op delta BinaryTree17-4 31.1s ± 0% 30.9s ± 0% -0.39% (p=0.001 n=9+8) Fannkuch11-4 13.9s ± 0% 14.1s ± 0% +0.90% (p=0.000 n=8+9) FmtFprintfEmpty-4 666ns ± 5% 674ns ± 4% ~ (p=0.424 n=10+10) FmtFprintfString-4 1.09µs ± 4% 1.12µs ± 4% +2.62% (p=0.027 n=10+10) FmtFprintfInt-4 1.22µs ± 3% 1.16µs ± 3% -4.37% (p=0.000 n=10+10) FmtFprintfIntInt-4 1.74µs ± 2% 1.64µs ± 3% -6.00% (p=0.000 n=10+10) FmtFprintfPrefixedInt-4 1.75µs ± 1% 1.69µs ± 3% -3.80% (p=0.000 n=9+10) FmtFprintfFloat-4 3.27µs ± 1% 3.41µs ± 1% +4.33% (p=0.000 n=10+9) FmtManyArgs-4 6.49µs ± 1% 6.23µs ± 2% -4.10% (p=0.000 n=10+10) GobDecode-4 87.1ms ± 1% 75.5ms ± 1% -13.37% (p=0.000 n=10+10) GobEncode-4 69.4ms ± 1% 69.7ms ± 1% ~ (p=0.190 n=10+10) Gzip-4 3.56s ± 1% 3.57s ± 1% ~ (p=0.053 n=10+9) Gunzip-4 446ms ± 2% 442ms ± 2% ~ (p=0.123 n=10+10) HTTPClientServer-4 1.51ms ± 1% 1.55ms ± 3% +2.26% (p=0.001 n=8+9) JSONEncode-4 191ms ± 1% 175ms ± 2% -8.33% (p=0.000 n=10+10) JSONDecode-4 798ms ± 1% 835ms ± 1% +4.65% (p=0.000 n=10+10) Mandelbrot200-4 33.6ms ± 0% 33.6ms ± 0% ~ (p=0.068 n=8+10) GoParse-4 42.4ms ± 1% 42.5ms ± 1% ~ (p=0.190 n=10+10) RegexpMatchEasy0_32-4 829ns ± 1% 853ns ± 1% +2.98% (p=0.000 n=9+8) RegexpMatchEasy0_1K-4 4.04µs ± 1% 4.03µs ± 1% ~ (p=0.986 n=10+10) RegexpMatchEasy1_32-4 889ns ± 2% 900ns ± 5% ~ (p=0.566 n=10+10) RegexpMatchEasy1_1K-4 6.01µs ± 2% 6.15µs ± 2% +2.29% (p=0.000 n=9+9) RegexpMatchMedium_32-4 1.35µs ± 3% 1.39µs ± 4% +2.26% (p=0.018 n=9+10) RegexpMatchMedium_1K-4 357µs ± 9% 352µs ± 2% ~ (p=0.968 n=10+9) RegexpMatchHard_32-4 22.2µs ± 6% 22.6µs ± 6% ~ (p=0.161 n=9+9) RegexpMatchHard_1K-4 652µs ± 4% 664µs ± 4% +1.91% (p=0.028 n=9+10) Revcomp-4 51.4ms ± 1% 51.3ms ± 2% ~ (p=0.353 n=10+10) Template-4 1.17s ± 2% 1.06s ± 2% -9.39% (p=0.000 n=10+10) TimeParse-4 4.44µs ± 1% 4.46µs ± 1% +0.50% (p=0.003 n=9+10) TimeFormat-4 9.30µs ± 1% 9.33µs ± 1% ~ (p=0.197 n=10+10) [Geo mean] 557µs 553µs -0.82% Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61
This change introduces alias analysis into the ssa backend, and uses it to tighten some loads with their uses. Given code like res, ok := fn() if !ok { return } // use res The 'res' return value won't be loaded until after examining 'ok.' Alias analysis also enables tightening of loads from constant memory (strings) and certain autotmp stack slots. A follow-up CL will introduce alias analysis to dead-store elimination. Another will introduce load elimination. Fixed golang#19195 Benchmarks on linux/arm: name old time/op new time/op delta BinaryTree17-4 31.1s ± 0% 30.9s ± 0% -0.39% (p=0.001 n=9+8) Fannkuch11-4 13.9s ± 0% 14.1s ± 0% +0.90% (p=0.000 n=8+9) FmtFprintfEmpty-4 666ns ± 5% 674ns ± 4% ~ (p=0.424 n=10+10) FmtFprintfString-4 1.09µs ± 4% 1.12µs ± 4% +2.62% (p=0.027 n=10+10) FmtFprintfInt-4 1.22µs ± 3% 1.16µs ± 3% -4.37% (p=0.000 n=10+10) FmtFprintfIntInt-4 1.74µs ± 2% 1.64µs ± 3% -6.00% (p=0.000 n=10+10) FmtFprintfPrefixedInt-4 1.75µs ± 1% 1.69µs ± 3% -3.80% (p=0.000 n=9+10) FmtFprintfFloat-4 3.27µs ± 1% 3.41µs ± 1% +4.33% (p=0.000 n=10+9) FmtManyArgs-4 6.49µs ± 1% 6.23µs ± 2% -4.10% (p=0.000 n=10+10) GobDecode-4 87.1ms ± 1% 75.5ms ± 1% -13.37% (p=0.000 n=10+10) GobEncode-4 69.4ms ± 1% 69.7ms ± 1% ~ (p=0.190 n=10+10) Gzip-4 3.56s ± 1% 3.57s ± 1% ~ (p=0.053 n=10+9) Gunzip-4 446ms ± 2% 442ms ± 2% ~ (p=0.123 n=10+10) HTTPClientServer-4 1.51ms ± 1% 1.55ms ± 3% +2.26% (p=0.001 n=8+9) JSONEncode-4 191ms ± 1% 175ms ± 2% -8.33% (p=0.000 n=10+10) JSONDecode-4 798ms ± 1% 835ms ± 1% +4.65% (p=0.000 n=10+10) Mandelbrot200-4 33.6ms ± 0% 33.6ms ± 0% ~ (p=0.068 n=8+10) GoParse-4 42.4ms ± 1% 42.5ms ± 1% ~ (p=0.190 n=10+10) RegexpMatchEasy0_32-4 829ns ± 1% 853ns ± 1% +2.98% (p=0.000 n=9+8) RegexpMatchEasy0_1K-4 4.04µs ± 1% 4.03µs ± 1% ~ (p=0.986 n=10+10) RegexpMatchEasy1_32-4 889ns ± 2% 900ns ± 5% ~ (p=0.566 n=10+10) RegexpMatchEasy1_1K-4 6.01µs ± 2% 6.15µs ± 2% +2.29% (p=0.000 n=9+9) RegexpMatchMedium_32-4 1.35µs ± 3% 1.39µs ± 4% +2.26% (p=0.018 n=9+10) RegexpMatchMedium_1K-4 357µs ± 9% 352µs ± 2% ~ (p=0.968 n=10+9) RegexpMatchHard_32-4 22.2µs ± 6% 22.6µs ± 6% ~ (p=0.161 n=9+9) RegexpMatchHard_1K-4 652µs ± 4% 664µs ± 4% +1.91% (p=0.028 n=9+10) Revcomp-4 51.4ms ± 1% 51.3ms ± 2% ~ (p=0.353 n=10+10) Template-4 1.17s ± 2% 1.06s ± 2% -9.39% (p=0.000 n=10+10) TimeParse-4 4.44µs ± 1% 4.46µs ± 1% +0.50% (p=0.003 n=9+10) TimeFormat-4 9.30µs ± 1% 9.33µs ± 1% ~ (p=0.197 n=10+10) [Geo mean] 557µs 553µs -0.82% Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61
This change introduces alias analysis into the ssa backend, and uses it to tighten some loads with their uses. Given code like res, ok := fn() if !ok { return } // use res The 'res' return value won't be loaded until after examining 'ok.' Alias analysis also enables tightening of loads from constant memory (strings) and certain autotmp stack slots. A follow-up CL will introduce alias analysis to dead-store elimination. Another will introduce load elimination. Fixed golang#19195 Benchmarks on linux/arm: name old time/op new time/op delta BinaryTree17-4 31.1s ± 0% 30.9s ± 0% -0.39% (p=0.001 n=9+8) Fannkuch11-4 13.9s ± 0% 14.1s ± 0% +0.90% (p=0.000 n=8+9) FmtFprintfEmpty-4 666ns ± 5% 674ns ± 4% ~ (p=0.424 n=10+10) FmtFprintfString-4 1.09µs ± 4% 1.12µs ± 4% +2.62% (p=0.027 n=10+10) FmtFprintfInt-4 1.22µs ± 3% 1.16µs ± 3% -4.37% (p=0.000 n=10+10) FmtFprintfIntInt-4 1.74µs ± 2% 1.64µs ± 3% -6.00% (p=0.000 n=10+10) FmtFprintfPrefixedInt-4 1.75µs ± 1% 1.69µs ± 3% -3.80% (p=0.000 n=9+10) FmtFprintfFloat-4 3.27µs ± 1% 3.41µs ± 1% +4.33% (p=0.000 n=10+9) FmtManyArgs-4 6.49µs ± 1% 6.23µs ± 2% -4.10% (p=0.000 n=10+10) GobDecode-4 87.1ms ± 1% 75.5ms ± 1% -13.37% (p=0.000 n=10+10) GobEncode-4 69.4ms ± 1% 69.7ms ± 1% ~ (p=0.190 n=10+10) Gzip-4 3.56s ± 1% 3.57s ± 1% ~ (p=0.053 n=10+9) Gunzip-4 446ms ± 2% 442ms ± 2% ~ (p=0.123 n=10+10) HTTPClientServer-4 1.51ms ± 1% 1.55ms ± 3% +2.26% (p=0.001 n=8+9) JSONEncode-4 191ms ± 1% 175ms ± 2% -8.33% (p=0.000 n=10+10) JSONDecode-4 798ms ± 1% 835ms ± 1% +4.65% (p=0.000 n=10+10) Mandelbrot200-4 33.6ms ± 0% 33.6ms ± 0% ~ (p=0.068 n=8+10) GoParse-4 42.4ms ± 1% 42.5ms ± 1% ~ (p=0.190 n=10+10) RegexpMatchEasy0_32-4 829ns ± 1% 853ns ± 1% +2.98% (p=0.000 n=9+8) RegexpMatchEasy0_1K-4 4.04µs ± 1% 4.03µs ± 1% ~ (p=0.986 n=10+10) RegexpMatchEasy1_32-4 889ns ± 2% 900ns ± 5% ~ (p=0.566 n=10+10) RegexpMatchEasy1_1K-4 6.01µs ± 2% 6.15µs ± 2% +2.29% (p=0.000 n=9+9) RegexpMatchMedium_32-4 1.35µs ± 3% 1.39µs ± 4% +2.26% (p=0.018 n=9+10) RegexpMatchMedium_1K-4 357µs ± 9% 352µs ± 2% ~ (p=0.968 n=10+9) RegexpMatchHard_32-4 22.2µs ± 6% 22.6µs ± 6% ~ (p=0.161 n=9+9) RegexpMatchHard_1K-4 652µs ± 4% 664µs ± 4% +1.91% (p=0.028 n=9+10) Revcomp-4 51.4ms ± 1% 51.3ms ± 2% ~ (p=0.353 n=10+10) Template-4 1.17s ± 2% 1.06s ± 2% -9.39% (p=0.000 n=10+10) TimeParse-4 4.44µs ± 1% 4.46µs ± 1% +0.50% (p=0.003 n=9+10) TimeFormat-4 9.30µs ± 1% 9.33µs ± 1% ~ (p=0.197 n=10+10) [Geo mean] 557µs 553µs -0.82% Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61
The tighten pass moves values closer to the blocks in which they are used. Prior to this CL, values that took a memory arg were never moved, because that could cause multiple memory values to be live across blocks. However, there are in fact cases in which such a value can safely be moved across blocks. The most common such case is loading the return values from a function call when only one control flow path actually uses those return values. In that case, we were unnecessarily eagerly loading all return values from the stack. This CL delays the decision about whether a value with a memory arg can be moved until we know where we'd like to move it to. Once we know the target block, we can check whether moving the value will cause problems. This CL uses a simple check: If an existing value in the target block already uses the same memory arg, and the value cannot fault, then it must be ok to add another use of it. This is enough to improve the situation in scanobject. Before: 0x025d 00605 (mgcmark.go:1180) CALL "".heapBitsForObject(SB) 0x0262 00610 (mgcmark.go:1180) MOVL 40(SP), AX 0x0266 00614 (mgcmark.go:1180) MOVQ 24(SP), CX 0x026b 00619 (mgcmark.go:1180) MOVQ 48(SP), DX 0x0270 00624 (mgcmark.go:1180) MOVQ 56(SP), BX 0x0275 00629 (mgcmark.go:1180) MOVQ 32(SP), SI 0x027a 00634 (mgcmark.go:1180) TESTQ CX, CX 0x027d 00637 (mgcmark.go:1180) JNE $0, 687 0x027f 00639 (mgcmark.go:1178) MOVQ "".arena_used+96(SP), AX 0x0284 00644 (mgcmark.go:1160) MOVL "".h.shift+68(SP), CX 0x0288 00648 (mgcmark.go:1178) MOVQ ""..autotmp_2642+112(SP), DX 0x028d 00653 (mgcmark.go:1174) MOVQ "".b+168(FP), BX 0x0295 00661 (mgcmark.go:1160) MOVQ "".h.bitp+128(SP), SI 0x029d 00669 (mgcmark.go:1153) MOVQ "".n+80(SP), DI 0x02a2 00674 (mgcmark.go:1153) MOVQ "".i+88(SP), R8 0x02a7 00679 (mgcmark.go:1160) MOVL CX, R9 0x02aa 00682 (mgcmark.go:1178) JMP 552 0x02af 00687 (mgcmark.go:1181) MOVQ CX, (SP) 0x02b3 00691 (mgcmark.go:1181) MOVQ "".b+168(FP), CX 0x02bb 00699 (mgcmark.go:1181) MOVQ CX, 8(SP) 0x02c0 00704 (mgcmark.go:1181) MOVQ "".i+88(SP), DI 0x02c5 00709 (mgcmark.go:1181) MOVQ DI, 16(SP) 0x02ca 00714 (mgcmark.go:1181) MOVQ SI, 24(SP) 0x02cf 00719 (mgcmark.go:1181) MOVL AX, 32(SP) 0x02d3 00723 (mgcmark.go:1181) MOVQ DX, 40(SP) 0x02d8 00728 (mgcmark.go:1181) MOVQ "".gcw+176(FP), AX 0x02e0 00736 (mgcmark.go:1181) MOVQ AX, 48(SP) 0x02e5 00741 (mgcmark.go:1181) MOVQ BX, 56(SP) 0x02ea 00746 (mgcmark.go:1181) PCDATA $0, $3 0x02ea 00746 (mgcmark.go:1181) CALL "".greyobject(SB) After: 0x025d 00605 (mgcmark.go:1180) CALL "".heapBitsForObject(SB) 0x0262 00610 (mgcmark.go:1180) MOVQ 24(SP), AX 0x0267 00615 (mgcmark.go:1180) TESTQ AX, AX 0x026a 00618 (mgcmark.go:1180) JNE $0, 665 0x026c 00620 (mgcmark.go:1178) MOVQ "".arena_used+96(SP), AX 0x0271 00625 (mgcmark.go:1160) MOVL "".h.shift+68(SP), CX 0x0275 00629 (mgcmark.go:1178) MOVQ ""..autotmp_2642+112(SP), DX 0x027a 00634 (mgcmark.go:1174) MOVQ "".b+168(FP), BX 0x0282 00642 (mgcmark.go:1160) MOVQ "".h.bitp+128(SP), SI 0x028a 00650 (mgcmark.go:1153) MOVQ "".n+80(SP), DI 0x028f 00655 (mgcmark.go:1153) MOVQ "".i+88(SP), R8 0x0294 00660 (mgcmark.go:1160) MOVL CX, R9 0x0297 00663 (mgcmark.go:1178) JMP 552 0x0299 00665 (mgcmark.go:1180) MOVL 40(SP), CX 0x029d 00669 (mgcmark.go:1180) MOVQ 48(SP), DX 0x02a2 00674 (mgcmark.go:1180) MOVQ 56(SP), BX 0x02a7 00679 (mgcmark.go:1180) MOVQ 32(SP), SI 0x02ac 00684 (mgcmark.go:1181) MOVQ AX, (SP) 0x02b0 00688 (mgcmark.go:1181) MOVQ "".b+168(FP), AX 0x02b8 00696 (mgcmark.go:1181) MOVQ AX, 8(SP) 0x02bd 00701 (mgcmark.go:1181) MOVQ "".i+88(SP), DI 0x02c2 00706 (mgcmark.go:1181) MOVQ DI, 16(SP) 0x02c7 00711 (mgcmark.go:1181) MOVQ SI, 24(SP) 0x02cc 00716 (mgcmark.go:1181) MOVL CX, 32(SP) 0x02d0 00720 (mgcmark.go:1181) MOVQ DX, 40(SP) 0x02d5 00725 (mgcmark.go:1181) MOVQ "".gcw+176(FP), CX 0x02dd 00733 (mgcmark.go:1181) MOVQ CX, 48(SP) 0x02e2 00738 (mgcmark.go:1181) MOVQ BX, 56(SP) 0x02e7 00743 (mgcmark.go:1181) PCDATA $0, $3 0x02e7 00743 (mgcmark.go:1181) CALL "".greyobject(SB) Observe that after this CL, the return values from heapBitsForObject are only loaded in a single control flow path. Updates golang#19195 The optimization appears to pay for itself, compile-time-wise: name old time/op new time/op delta Template 199ms ± 4% 198ms ± 3% -0.71% (p=0.036 n=49+45) Unicode 90.0ms ± 5% 89.3ms ± 5% -0.76% (p=0.022 n=50+49) GoTypes 558ms ± 4% 560ms ± 4% ~ (p=0.204 n=49+48) Compiler 2.50s ± 3% 2.51s ± 3% ~ (p=0.087 n=50+49) SSA 3.97s ± 3% 3.98s ± 3% ~ (p=0.843 n=49+50) Flate 126ms ± 3% 126ms ± 4% ~ (p=0.134 n=48+48) GoParser 148ms ± 3% 149ms ± 5% ~ (p=0.128 n=46+50) Reflect 365ms ± 4% 366ms ± 4% ~ (p=0.237 n=50+49) Tar 110ms ± 4% 109ms ± 5% -1.22% (p=0.006 n=50+47) XML 206ms ± 5% 207ms ± 3% ~ (p=0.145 n=49+50) compress/lzw benchmarks: name old time/op new time/op delta Decoder/1e4-8 85.0µs ± 3% 85.6µs ± 2% +0.74% (p=0.000 n=36+36) Decoder/1e5-8 811µs ± 2% 810µs ± 2% ~ (p=0.134 n=37+37) Decoder/1e6-8 8.00ms ± 2% 7.99ms ± 1% ~ (p=0.648 n=39+39) Encoder/1e4-8 174µs ± 1% 170µs ± 1% -2.41% (p=0.000 n=38+37) Encoder/1e5-8 1.57ms ± 2% 1.54ms ± 2% -2.21% (p=0.000 n=37+37) Encoder/1e6-8 15.7ms ± 1% 15.4ms ± 2% -2.10% (p=0.000 n=38+39) Change-Id: I486e26f096e55e6bb1cc794026664b1bc46a9391
runtime.scanobject contains this bit of code:
This compiles to (excerpt):
The loads at instructions 0x025e, 0x0267, 0x026c, and 0x0271 are too early. They're loading return values from heapBitsForObject, but if the call to greyobject isn't necessary (if the jump at 0x0279 isn't taken), their values are unnecessary and will be overwritten. This is easy enough to see in the original code as well.
The loads are scheduled where they are due to memory ordering. But maybe there's something we can do to improve the situation, perhaps using the knowledge that stack slots and function return values are always disjoint in memory?
cc @randall77 @dr2chase @cherrymui
The text was updated successfully, but these errors were encountered: