-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: restore walkinrange optimization (by moving to SSA) #30645
Comments
(This is also an indicator that this optimization needs an explicit asm test.) |
It appears that's the case. Given
on 178a2c4 (a little before that CL):
on a2ace8e (after):
And the generated code contains two |
Not sure if this is exactly what you're after but I've been tinkering with some simple logical optimizations for control flow graphs (it's very rough code at the moment, I was just experimenting, but it does work): https://go-review.googlesource.com/c/go/+/165998/ This can do optimizations such as this for disjunctions (and similar optimizations for conjunctions):
It also has some optimizations to clean up some uses of
I scheduled it after prove since I was a bit worried it might obscure some facts. Unfortunately it is quite unpleasant adding new optimizations, especially if you want to add handle multiple bit widths. So I was thinking of extending the rewrite rules a bit so you could write something like this and have a new
|
Yes, indeed, something like this.
Makes sense. The right place to document that is an entry in passOrder.
Indeed. That’s why I wrote this optimization during walk the first time around. :)
I’m AFK for a while (which is also why I haven’t replied to your comments on the unsafe.Unreachable issue), and it’s not quite obvious to me whether this would work in practice. It’d be lovely if it did. Perhaps we could also move the CMOV optimizations into such a form as well. In a pinch we could also create a new kind of rewrite rule + rulegen + runner for this kind of optimization, one geared specifically to this kind of optimization. |
An acceptRange is used to check whether a byte falls in a range. The existing code used the naive way: lo <= x <= hi. Because bytes are unsigned, this can be rewritten as x-lo <= hi-lo. The compiler does this optimization (modulo golang#30645), but only when lo and hi are constants. In this case, lo and hi are effectively constants, but the compiler can't see that. Change-Id: I9a4a87393915168bd59e28c8048e85e4c688aca5
@josharian would you still like to do this for 1.13? |
I think this is important to do for 1.13, yes. It looked to me like maybe @mundaym was going to do it. I guess I'll take a look at it. |
Change https://golang.org/cl/178197 mentions this issue: |
I'm going to punt this to 1.14 after all. I finally have a working version, based very loosely on @mundaym's work. But it is definitely non-trivial, and Keith would (very reasonably) prefer to wait for 1.14 for CL 178197, on which my CL depends. I'll see about cleaning up and mailing my CL sooner rather than later, so it can be ready when the tree re-opens or in case of need. |
While working on #30645, I noticed that many instances in which the walkinrange optimization could apply were not even being considered. This was because of extraneous blocks in the CFG, of the type that shortcircuit normally removes. The change improves the shortcircuit pass to handle most of those cases. (There are a few that can only be reasonably detected later in compilation, after other optimizations have been run, but not enough to be worth chasing.) Notable changes: * Instead of calculating live-across-blocks values, use v.Uses == 1. This is cheaper and more straightforward. v.Uses did not exist when this pass was initially written. * Incorporate a fusePlain and loop until stable. This is necessary to find many of the instances. * Allow Copy and Not wrappers around Phi values. This significantly increases effectiveness. * Allow removal of all preds, creating a dead block. The previous pass stopped unnecessarily at one pred. * Use phielimValue during cleanup instead of manually setting the op to OpCopy. The result is marginally faster compilation and smaller code. name old time/op new time/op delta Template 213ms ± 2% 212ms ± 2% -0.63% (p=0.002 n=49+48) Unicode 90.0ms ± 2% 89.8ms ± 2% ~ (p=0.122 n=48+48) GoTypes 710ms ± 3% 711ms ± 2% ~ (p=0.433 n=45+49) Compiler 3.23s ± 2% 3.22s ± 2% ~ (p=0.124 n=47+49) SSA 10.0s ± 1% 10.0s ± 1% -0.43% (p=0.000 n=48+50) Flate 135ms ± 3% 135ms ± 2% ~ (p=0.311 n=49+49) GoParser 158ms ± 2% 158ms ± 2% ~ (p=0.757 n=48+48) Reflect 447ms ± 2% 447ms ± 2% ~ (p=0.815 n=49+48) Tar 189ms ± 2% 189ms ± 3% ~ (p=0.530 n=47+49) XML 251ms ± 3% 250ms ± 1% -0.75% (p=0.002 n=49+48) [Geo mean] 427ms 426ms -0.25% name old user-time/op new user-time/op delta Template 265ms ± 2% 265ms ± 2% ~ (p=0.969 n=48+50) Unicode 119ms ± 6% 119ms ± 6% ~ (p=0.738 n=50+50) GoTypes 923ms ± 2% 925ms ± 2% ~ (p=0.057 n=43+47) Compiler 4.37s ± 2% 4.37s ± 2% ~ (p=0.691 n=50+46) SSA 13.4s ± 1% 13.4s ± 1% ~ (p=0.282 n=42+49) Flate 162ms ± 2% 162ms ± 2% ~ (p=0.774 n=48+50) GoParser 186ms ± 2% 186ms ± 3% ~ (p=0.213 n=47+47) Reflect 572ms ± 2% 573ms ± 3% ~ (p=0.303 n=50+49) Tar 240ms ± 3% 240ms ± 2% ~ (p=0.939 n=46+44) XML 302ms ± 2% 302ms ± 2% ~ (p=0.399 n=47+47) [Geo mean] 540ms 541ms +0.07% name old alloc/op new alloc/op delta Template 36.8MB ± 0% 36.7MB ± 0% -0.42% (p=0.008 n=5+5) Unicode 28.1MB ± 0% 28.1MB ± 0% ~ (p=0.151 n=5+5) GoTypes 124MB ± 0% 124MB ± 0% -0.26% (p=0.008 n=5+5) Compiler 571MB ± 0% 566MB ± 0% -0.84% (p=0.008 n=5+5) SSA 1.86GB ± 0% 1.85GB ± 0% -0.58% (p=0.008 n=5+5) Flate 22.8MB ± 0% 22.8MB ± 0% -0.17% (p=0.008 n=5+5) GoParser 27.3MB ± 0% 27.3MB ± 0% -0.20% (p=0.008 n=5+5) Reflect 79.5MB ± 0% 79.3MB ± 0% -0.20% (p=0.008 n=5+5) Tar 34.7MB ± 0% 34.6MB ± 0% -0.42% (p=0.008 n=5+5) XML 45.4MB ± 0% 45.3MB ± 0% -0.29% (p=0.008 n=5+5) [Geo mean] 80.0MB 79.7MB -0.34% name old allocs/op new allocs/op delta Template 378k ± 0% 377k ± 0% -0.22% (p=0.008 n=5+5) Unicode 339k ± 0% 339k ± 0% ~ (p=0.643 n=5+5) GoTypes 1.36M ± 0% 1.36M ± 0% -0.10% (p=0.008 n=5+5) Compiler 5.51M ± 0% 5.50M ± 0% -0.13% (p=0.008 n=5+5) SSA 17.5M ± 0% 17.5M ± 0% -0.14% (p=0.008 n=5+5) Flate 234k ± 0% 234k ± 0% -0.04% (p=0.008 n=5+5) GoParser 299k ± 0% 299k ± 0% -0.05% (p=0.008 n=5+5) Reflect 978k ± 0% 979k ± 0% +0.02% (p=0.016 n=5+5) Tar 351k ± 0% 351k ± 0% -0.04% (p=0.008 n=5+5) XML 435k ± 0% 435k ± 0% -0.11% (p=0.008 n=5+5) [Geo mean] 840k 840k -0.08% file before after Δ % go 14794788 14770212 -24576 -0.166% addr2line 4203688 4199592 -4096 -0.097% api 5954056 5941768 -12288 -0.206% asm 4862704 4846320 -16384 -0.337% cgo 4778920 4770728 -8192 -0.171% compile 24001568 23923792 -77776 -0.324% cover 5198440 5190248 -8192 -0.158% dist 3595248 3587056 -8192 -0.228% doc 4618504 4610312 -8192 -0.177% fix 3337416 3333320 -4096 -0.123% link 6120408 6116312 -4096 -0.067% nm 4149064 4140872 -8192 -0.197% objdump 4555608 4547416 -8192 -0.180% pprof 14616324 14595844 -20480 -0.140% test2json 2766328 2762232 -4096 -0.148% trace 11638844 11622460 -16384 -0.141% vet 8274936 8258552 -16384 -0.198% total 132520780 132270972 -249808 -0.189% Change-Id: Ifcd235a2a6e5f13ed5c93e62523e2ef61321fccf Reviewed-on: https://go-review.googlesource.com/c/go/+/178197 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
While working on golang#30645, I noticed that many instances in which the walkinrange optimization could apply were not even being considered. This was because of extraneous blocks in the CFG, of the type that shortcircuit normally removes. The change improves the shortcircuit pass to handle most of those cases. (There are a few that can only be reasonably detected later in compilation, after other optimizations have been run, but not enough to be worth chasing.) Notable changes: * Instead of calculating live-across-blocks values, use v.Uses == 1. This is cheaper and more straightforward. v.Uses did not exist when this pass was initially written. * Incorporate a fusePlain and loop until stable. This is necessary to find many of the instances. * Allow Copy and Not wrappers around Phi values. This significantly increases effectiveness. * Allow removal of all preds, creating a dead block. The previous pass stopped unnecessarily at one pred. * Use phielimValue during cleanup instead of manually setting the op to OpCopy. The result is marginally faster compilation and smaller code. name old time/op new time/op delta Template 213ms ± 2% 212ms ± 2% -0.63% (p=0.002 n=49+48) Unicode 90.0ms ± 2% 89.8ms ± 2% ~ (p=0.122 n=48+48) GoTypes 710ms ± 3% 711ms ± 2% ~ (p=0.433 n=45+49) Compiler 3.23s ± 2% 3.22s ± 2% ~ (p=0.124 n=47+49) SSA 10.0s ± 1% 10.0s ± 1% -0.43% (p=0.000 n=48+50) Flate 135ms ± 3% 135ms ± 2% ~ (p=0.311 n=49+49) GoParser 158ms ± 2% 158ms ± 2% ~ (p=0.757 n=48+48) Reflect 447ms ± 2% 447ms ± 2% ~ (p=0.815 n=49+48) Tar 189ms ± 2% 189ms ± 3% ~ (p=0.530 n=47+49) XML 251ms ± 3% 250ms ± 1% -0.75% (p=0.002 n=49+48) [Geo mean] 427ms 426ms -0.25% name old user-time/op new user-time/op delta Template 265ms ± 2% 265ms ± 2% ~ (p=0.969 n=48+50) Unicode 119ms ± 6% 119ms ± 6% ~ (p=0.738 n=50+50) GoTypes 923ms ± 2% 925ms ± 2% ~ (p=0.057 n=43+47) Compiler 4.37s ± 2% 4.37s ± 2% ~ (p=0.691 n=50+46) SSA 13.4s ± 1% 13.4s ± 1% ~ (p=0.282 n=42+49) Flate 162ms ± 2% 162ms ± 2% ~ (p=0.774 n=48+50) GoParser 186ms ± 2% 186ms ± 3% ~ (p=0.213 n=47+47) Reflect 572ms ± 2% 573ms ± 3% ~ (p=0.303 n=50+49) Tar 240ms ± 3% 240ms ± 2% ~ (p=0.939 n=46+44) XML 302ms ± 2% 302ms ± 2% ~ (p=0.399 n=47+47) [Geo mean] 540ms 541ms +0.07% name old alloc/op new alloc/op delta Template 36.8MB ± 0% 36.7MB ± 0% -0.42% (p=0.008 n=5+5) Unicode 28.1MB ± 0% 28.1MB ± 0% ~ (p=0.151 n=5+5) GoTypes 124MB ± 0% 124MB ± 0% -0.26% (p=0.008 n=5+5) Compiler 571MB ± 0% 566MB ± 0% -0.84% (p=0.008 n=5+5) SSA 1.86GB ± 0% 1.85GB ± 0% -0.58% (p=0.008 n=5+5) Flate 22.8MB ± 0% 22.8MB ± 0% -0.17% (p=0.008 n=5+5) GoParser 27.3MB ± 0% 27.3MB ± 0% -0.20% (p=0.008 n=5+5) Reflect 79.5MB ± 0% 79.3MB ± 0% -0.20% (p=0.008 n=5+5) Tar 34.7MB ± 0% 34.6MB ± 0% -0.42% (p=0.008 n=5+5) XML 45.4MB ± 0% 45.3MB ± 0% -0.29% (p=0.008 n=5+5) [Geo mean] 80.0MB 79.7MB -0.34% name old allocs/op new allocs/op delta Template 378k ± 0% 377k ± 0% -0.22% (p=0.008 n=5+5) Unicode 339k ± 0% 339k ± 0% ~ (p=0.643 n=5+5) GoTypes 1.36M ± 0% 1.36M ± 0% -0.10% (p=0.008 n=5+5) Compiler 5.51M ± 0% 5.50M ± 0% -0.13% (p=0.008 n=5+5) SSA 17.5M ± 0% 17.5M ± 0% -0.14% (p=0.008 n=5+5) Flate 234k ± 0% 234k ± 0% -0.04% (p=0.008 n=5+5) GoParser 299k ± 0% 299k ± 0% -0.05% (p=0.008 n=5+5) Reflect 978k ± 0% 979k ± 0% +0.02% (p=0.016 n=5+5) Tar 351k ± 0% 351k ± 0% -0.04% (p=0.008 n=5+5) XML 435k ± 0% 435k ± 0% -0.11% (p=0.008 n=5+5) [Geo mean] 840k 840k -0.08% file before after Δ % go 14794788 14770212 -24576 -0.166% addr2line 4203688 4199592 -4096 -0.097% api 5954056 5941768 -12288 -0.206% asm 4862704 4846320 -16384 -0.337% cgo 4778920 4770728 -8192 -0.171% compile 24001568 23923792 -77776 -0.324% cover 5198440 5190248 -8192 -0.158% dist 3595248 3587056 -8192 -0.228% doc 4618504 4610312 -8192 -0.177% fix 3337416 3333320 -4096 -0.123% link 6120408 6116312 -4096 -0.067% nm 4149064 4140872 -8192 -0.197% objdump 4555608 4547416 -8192 -0.180% pprof 14616324 14595844 -20480 -0.140% test2json 2766328 2762232 -4096 -0.148% trace 11638844 11622460 -16384 -0.141% vet 8274936 8258552 -16384 -0.198% total 132520780 132270972 -249808 -0.189% Change-Id: Ifcd235a2a6e5f13ed5c93e62523e2ef61321fccf Reviewed-on: https://go-review.googlesource.com/c/go/+/178197 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
While working on golang#30645, I noticed that many instances in which the walkinrange optimization could apply were not even being considered. This was because of extraneous blocks in the CFG, of the type that shortcircuit normally removes. The change improves the shortcircuit pass to handle most of those cases. (There are a few that can only be reasonably detected later in compilation, after other optimizations have been run, but not enough to be worth chasing.) Notable changes: * Instead of calculating live-across-blocks values, use v.Uses == 1. This is cheaper and more straightforward. v.Uses did not exist when this pass was initially written. * Incorporate a fusePlain and loop until stable. This is necessary to find many of the instances. * Allow Copy and Not wrappers around Phi values. This significantly increases effectiveness. * Allow removal of all preds, creating a dead block. The previous pass stopped unnecessarily at one pred. * Use phielimValue during cleanup instead of manually setting the op to OpCopy. The result is marginally faster compilation and smaller code. name old time/op new time/op delta Template 213ms ± 2% 212ms ± 2% -0.63% (p=0.002 n=49+48) Unicode 90.0ms ± 2% 89.8ms ± 2% ~ (p=0.122 n=48+48) GoTypes 710ms ± 3% 711ms ± 2% ~ (p=0.433 n=45+49) Compiler 3.23s ± 2% 3.22s ± 2% ~ (p=0.124 n=47+49) SSA 10.0s ± 1% 10.0s ± 1% -0.43% (p=0.000 n=48+50) Flate 135ms ± 3% 135ms ± 2% ~ (p=0.311 n=49+49) GoParser 158ms ± 2% 158ms ± 2% ~ (p=0.757 n=48+48) Reflect 447ms ± 2% 447ms ± 2% ~ (p=0.815 n=49+48) Tar 189ms ± 2% 189ms ± 3% ~ (p=0.530 n=47+49) XML 251ms ± 3% 250ms ± 1% -0.75% (p=0.002 n=49+48) [Geo mean] 427ms 426ms -0.25% name old user-time/op new user-time/op delta Template 265ms ± 2% 265ms ± 2% ~ (p=0.969 n=48+50) Unicode 119ms ± 6% 119ms ± 6% ~ (p=0.738 n=50+50) GoTypes 923ms ± 2% 925ms ± 2% ~ (p=0.057 n=43+47) Compiler 4.37s ± 2% 4.37s ± 2% ~ (p=0.691 n=50+46) SSA 13.4s ± 1% 13.4s ± 1% ~ (p=0.282 n=42+49) Flate 162ms ± 2% 162ms ± 2% ~ (p=0.774 n=48+50) GoParser 186ms ± 2% 186ms ± 3% ~ (p=0.213 n=47+47) Reflect 572ms ± 2% 573ms ± 3% ~ (p=0.303 n=50+49) Tar 240ms ± 3% 240ms ± 2% ~ (p=0.939 n=46+44) XML 302ms ± 2% 302ms ± 2% ~ (p=0.399 n=47+47) [Geo mean] 540ms 541ms +0.07% name old alloc/op new alloc/op delta Template 36.8MB ± 0% 36.7MB ± 0% -0.42% (p=0.008 n=5+5) Unicode 28.1MB ± 0% 28.1MB ± 0% ~ (p=0.151 n=5+5) GoTypes 124MB ± 0% 124MB ± 0% -0.26% (p=0.008 n=5+5) Compiler 571MB ± 0% 566MB ± 0% -0.84% (p=0.008 n=5+5) SSA 1.86GB ± 0% 1.85GB ± 0% -0.58% (p=0.008 n=5+5) Flate 22.8MB ± 0% 22.8MB ± 0% -0.17% (p=0.008 n=5+5) GoParser 27.3MB ± 0% 27.3MB ± 0% -0.20% (p=0.008 n=5+5) Reflect 79.5MB ± 0% 79.3MB ± 0% -0.20% (p=0.008 n=5+5) Tar 34.7MB ± 0% 34.6MB ± 0% -0.42% (p=0.008 n=5+5) XML 45.4MB ± 0% 45.3MB ± 0% -0.29% (p=0.008 n=5+5) [Geo mean] 80.0MB 79.7MB -0.34% name old allocs/op new allocs/op delta Template 378k ± 0% 377k ± 0% -0.22% (p=0.008 n=5+5) Unicode 339k ± 0% 339k ± 0% ~ (p=0.643 n=5+5) GoTypes 1.36M ± 0% 1.36M ± 0% -0.10% (p=0.008 n=5+5) Compiler 5.51M ± 0% 5.50M ± 0% -0.13% (p=0.008 n=5+5) SSA 17.5M ± 0% 17.5M ± 0% -0.14% (p=0.008 n=5+5) Flate 234k ± 0% 234k ± 0% -0.04% (p=0.008 n=5+5) GoParser 299k ± 0% 299k ± 0% -0.05% (p=0.008 n=5+5) Reflect 978k ± 0% 979k ± 0% +0.02% (p=0.016 n=5+5) Tar 351k ± 0% 351k ± 0% -0.04% (p=0.008 n=5+5) XML 435k ± 0% 435k ± 0% -0.11% (p=0.008 n=5+5) [Geo mean] 840k 840k -0.08% file before after Δ % go 14794788 14770212 -24576 -0.166% addr2line 4203688 4199592 -4096 -0.097% api 5954056 5941768 -12288 -0.206% asm 4862704 4846320 -16384 -0.337% cgo 4778920 4770728 -8192 -0.171% compile 24001568 23923792 -77776 -0.324% cover 5198440 5190248 -8192 -0.158% dist 3595248 3587056 -8192 -0.228% doc 4618504 4610312 -8192 -0.177% fix 3337416 3333320 -4096 -0.123% link 6120408 6116312 -4096 -0.067% nm 4149064 4140872 -8192 -0.197% objdump 4555608 4547416 -8192 -0.180% pprof 14616324 14595844 -20480 -0.140% test2json 2766328 2762232 -4096 -0.148% trace 11638844 11622460 -16384 -0.141% vet 8274936 8258552 -16384 -0.198% total 132520780 132270972 -249808 -0.189% Change-Id: Ifcd235a2a6e5f13ed5c93e62523e2ef61321fccf Reviewed-on: https://go-review.googlesource.com/c/go/+/178197 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
@mundaym @josharian Were either of you guys planning on submitting something here? |
I'd like to make block rewrite rules more powerful but I don't think I'll be able to get it done before the freeze because it is quite fiddly (happy to help if someone else wants to take a look). I'm happy to write a pass to do this specific optimization though if that would be useful. Most of the tricky stuff I worked through in the CL I posted previously. |
Just finishing your CL for this specific optimization would be good. |
I will also dig up my work in progress and mail it, in case it is helpful. Probably this afternoon. |
Change https://golang.org/cl/201206 mentions this issue: |
Mailed CL 201206 in the hopes that it is helpful. I only vaguely remember where I left it, but I'm pretty sure it's not done. Sorry. |
Change https://golang.org/cl/165998 mentions this issue: |
I'm going to punt this to 1.15. There is a CL, but it could use some more work, it's late, and it is just a performance improvement. |
Change https://golang.org/cl/221802 mentions this issue: |
The walkinrange optimization has been superseded by CL 165998. Has a small positive impact on binary sizes: compilecmp master -> HEAD master (e37cc29): cmd/compile: optimize integer-in-range checks HEAD (1a70680a34): cmd/compile: remove walkinrange optimization platform: linux/amd64 file before after Δ % addr2line 4329144 4325048 -4096 -0.095% api 6060970 6056874 -4096 -0.068% asm 5196905 5192809 -4096 -0.079% cgo 4898769 4890577 -8192 -0.167% compile 20222193 20209713 -12480 -0.062% cover 5331580 5323388 -8192 -0.154% dist 3732778 3728682 -4096 -0.110% doc 4748488 4740296 -8192 -0.173% link 6707380 6695092 -12288 -0.183% nm 4278685 4274589 -4096 -0.096% pack 2305038 2300942 -4096 -0.178% pprof 14874834 14870738 -4096 -0.028% test2json 2849221 2845125 -4096 -0.144% vet 8393173 8384981 -8192 -0.098% go 15205572 15193284 -12288 -0.081% total 131812292 131709700 -102592 -0.078% Updates #30645. Change-Id: I42d74481652c90fef1a9bc58c70836e42c9b1c4b Reviewed-on: https://go-review.googlesource.com/c/go/+/221802 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
I’m on my phone, but I suspect that CL 165617 disabled the walkinrange optimization. This makes sense—order now rewrites away OANDAND and OOROR nodes, but those are what walkinrange operates on.
One symptom (I believe) is this:
https://go-review.googlesource.com/c/go/+/165617/1/test/checkbce.go#36
We have long wanted to move walkinrange to the SSA backend; this is a good motivation.
I’m marking this as a 1.13 release blocker, since IIRC walkinrange is a significant optimization for some hot code (usually parsing/formatting code looking for bytes in a range).
The text was updated successfully, but these errors were encountered: