-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/internal/obj: armv7 (and arm64) should not use constant pools #18293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It's OK to generate MOVT/MOVW when GOARM=7
(just make sure not use these for addresses that
needs dynamic relocations.)
But what about floating point constants?
|
For ARM32, I am not sure we want very different mechanisms for this on GOARM=7 or not. |
@minux Right. The linker would have to support |
@benshi001 : you might want to tackle this. |
I am interested in it and will took a look after finishing my previous CLs.
|
Besides "MOVT $imme16, R", there are also "MOV $imme16, R". both encode the 16-bit immediate into the instruction than constant pool. Maybe we can all attribute them to "MOVW $imme, R" in the front-end, then auto pick a right form in the back-end among |
I roughly have a solution for dynamically selecting the above 3 forms for MOVW. But it depends on Can they be reviewed and accepted first? |
I did some benchmark:
where F uses constant pool while G uses MOVW-MOVT.
Also tried a couple of others. Using constant pool seems always faster. So I'm not sure this is worth doing. If the constant fits in 16 bit, using a single MOVW seems better though. |
Yes. I agree. It is worth to
But it not worth substitute |
Thanks for benchmarking. However, I think it's worth evaluating the |
GCC will compile void ss(void) to Disassembly of section .text: 00000000 : |
I can understand why a pair of Since the constant pool is in the text segment, the CPU has to fill one data cache line to load the constant, which can cost hundreds of ticks. But for MOVW/MOVT, the immediate is already in the code cache, and only several ticks are cost. For a trade off, |
There is already "VMOV R, # immediate" implemented for encoding short floating point constants into the instruction, at line 2222 - 2231 in cmd/internal/obj/arm/asm5.go
We need only consider MOVW/MOVT on ARMv7. |
This issue becomes more complex. All sub-forms of MOVW assume the output has fixed length. But "MOVW $0x12345678, R1" is assembled to 4-byte output on ARMv6, while 8-byte on ARMv7. |
How about substitute MOVW.cond $0x00001234, R0 with MOVW.cond $0x1234, R0 Though 3 bytes are cost, it is still more efficient than fill a new data cache line. |
Yes, my benchmark might not be very representative. But my point is that if we do this, we would need to base it on evidence of actual speedup, instead of what appears to be fast. Note that the complexity of the toolchain is also a cost. |
You can do this by adding another column, GOARM, in Optab. See cmd/internal/obj/mips/asm0.go, where MIPS32 vs MIPS64 is used for instruction selection. |
I recall my previous employer has a test suite, one test case is frequently produce cache miss, maybe it can be adapted to compare ldr and movt.
But it is in c and gnu asm.
Ben Shi
… 在 2017年4月2日,02:12,cherrymui ***@***.***> 写道:
But "MOVW $0x12345678, R1" is assembled to 4-byte output on ARMv6, while 8-byte on ARMv7.
You can do this by adding another column, GOARM, in Optab. See cmd/internal/obj/mips/asm0.go, where MIPS32 vs MIPS64 is used for instruction selection.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I can not find a way to exactly count how cache miss affects. Shall we implement MOVW $imme-16bit, Rn for ARMv7 first? At least it is sure to be optimized. |
I have implemented movw/movt pair in patch set 2 of CL 41190, https://go-review.googlesource.com/?polygerrit=0#/c/41190/ Unfortunately the go1 benchmark test shows no improvement. Please check the attachment for details. . |
I will try if only "MOVW $imm-16, Reg" can optimize ARM code later. |
CL https://golang.org/cl/41190 mentions this issue. |
With patch set 4 of CL 41190, I ran the go1 benchmark test three rounds, each with count=50. I got All showed improvements with "MOVW $imm16, Reg". |
I tried to attach a zip file containing all log files. But always got So I have to attach the screenshots. Really sorry for that. And I will update CL 41190's commit message with new test results which look more reasonable. |
Please don't post screenshots of text. Just copy/paste the text. But I don't see why this bug needs 3 sets of benchmarks pasted in a row. Shouldn't it just be in the commit message of your CLs? |
OK. I will attach the text directly next time. I did more rounds of test, since the results have difference among each round, and more test results can make the improvement more reliable. |
As discussion in issue #18293, "MOVW $Imm-16, Reg" was introduced in ARMv7. It directly encoded the 16-bit immediate into the instruction instead of put it in the constant pool. This patch makes the arm assembler choose this form of MOVW if available. Besides 4 bytes are saved in the constant pool, the go1 benchmark test also shows a slight improvement. name old time/op new time/op delta BinaryTree17-4 42.7s ± 1% 42.7s ± 1% ~ (p=0.304 n=50+50) Fannkuch11-4 24.8s ± 1% 24.8s ± 0% ~ (p=0.757 n=50+49) FmtFprintfEmpty-4 875ns ± 1% 873ns ± 2% ~ (p=0.066 n=44+46) FmtFprintfString-4 1.43µs ± 1% 1.45µs ± 1% +1.68% (p=0.000 n=44+44) FmtFprintfInt-4 1.52µs ± 1% 1.52µs ± 1% +0.26% (p=0.009 n=41+45) FmtFprintfIntInt-4 2.19µs ± 1% 2.20µs ± 1% +0.76% (p=0.000 n=43+46) FmtFprintfPrefixedInt-4 2.56µs ± 2% 2.53µs ± 1% -1.03% (p=0.000 n=45+44) FmtFprintfFloat-4 4.41µs ± 1% 4.39µs ± 1% -0.52% (p=0.000 n=44+44) FmtManyArgs-4 9.02µs ± 2% 9.04µs ± 1% +0.27% (p=0.000 n=46+44) GobDecode-4 106ms ± 1% 106ms ± 1% ~ (p=0.310 n=45+43) GobEncode-4 88.1ms ± 2% 88.0ms ± 2% ~ (p=0.648 n=49+50) Gzip-4 4.31s ± 1% 4.27s ± 1% -1.01% (p=0.000 n=50+50) Gunzip-4 618ms ± 1% 608ms ± 1% -1.65% (p=0.000 n=45+47) HTTPClientServer-4 689µs ± 6% 692µs ± 4% +0.52% (p=0.038 n=50+47) JSONEncode-4 282ms ± 2% 280ms ± 1% -0.75% (p=0.000 n=46+43) JSONDecode-4 945ms ± 2% 940ms ± 1% -0.47% (p=0.000 n=47+47) Mandelbrot200-4 49.4ms ± 1% 49.3ms ± 1% ~ (p=0.163 n=45+45) GoParse-4 46.0ms ± 3% 45.5ms ± 2% -0.95% (p=0.000 n=49+40) RegexpMatchEasy0_32-4 1.29µs ± 1% 1.28µs ± 1% -0.14% (p=0.005 n=38+45) RegexpMatchEasy0_1K-4 7.92µs ± 8% 7.75µs ± 6% -2.12% (p=0.000 n=47+50) RegexpMatchEasy1_32-4 1.31µs ± 1% 1.31µs ± 0% ~ (p=0.282 n=45+48) RegexpMatchEasy1_1K-4 10.4µs ± 5% 10.4µs ± 3% ~ (p=0.771 n=50+49) RegexpMatchMedium_32-4 2.06µs ± 1% 2.07µs ± 1% +0.35% (p=0.001 n=44+49) RegexpMatchMedium_1K-4 533µs ± 1% 532µs ± 1% ~ (p=0.710 n=43+47) RegexpMatchHard_32-4 29.7µs ± 1% 29.6µs ± 1% -0.34% (p=0.002 n=43+46) RegexpMatchHard_1K-4 893µs ± 2% 885µs ± 1% -0.85% (p=0.000 n=50+45) Revcomp-4 85.6ms ± 4% 85.5ms ± 2% ~ (p=0.683 n=50+50) Template-4 1.05s ± 3% 1.04s ± 1% -1.06% (p=0.000 n=50+44) TimeParse-4 7.19µs ± 2% 7.11µs ± 2% -1.10% (p=0.000 n=48+46) TimeFormat-4 13.4µs ± 1% 13.5µs ± 1% ~ (p=0.056 n=46+49) [Geo mean] 747µs 745µs -0.28% name old speed new speed delta GobDecode-4 7.23MB/s ± 1% 7.22MB/s ± 1% ~ (p=0.062 n=45+39) GobEncode-4 8.71MB/s ± 2% 8.72MB/s ± 2% ~ (p=0.656 n=49+50) Gzip-4 4.50MB/s ± 1% 4.55MB/s ± 1% +1.03% (p=0.000 n=50+50) Gunzip-4 31.4MB/s ± 1% 31.9MB/s ± 1% +1.67% (p=0.000 n=45+47) JSONEncode-4 6.89MB/s ± 2% 6.94MB/s ± 1% +0.76% (p=0.000 n=46+43) JSONDecode-4 2.05MB/s ± 2% 2.06MB/s ± 2% +0.32% (p=0.017 n=47+50) GoParse-4 1.26MB/s ± 3% 1.27MB/s ± 1% +0.68% (p=0.000 n=50+48) RegexpMatchEasy0_32-4 24.9MB/s ± 1% 24.9MB/s ± 1% +0.13% (p=0.004 n=38+45) RegexpMatchEasy0_1K-4 129MB/s ± 7% 132MB/s ± 6% +2.34% (p=0.000 n=46+50) RegexpMatchEasy1_32-4 24.5MB/s ± 1% 24.4MB/s ± 1% ~ (p=0.252 n=45+48) RegexpMatchEasy1_1K-4 98.8MB/s ± 4% 98.7MB/s ± 3% ~ (p=0.771 n=50+49) RegexpMatchMedium_32-4 485kB/s ± 3% 480kB/s ± 0% -0.95% (p=0.000 n=50+38) RegexpMatchMedium_1K-4 1.92MB/s ± 1% 1.92MB/s ± 1% ~ (p=0.129 n=43+47) RegexpMatchHard_32-4 1.08MB/s ± 2% 1.08MB/s ± 1% +0.38% (p=0.017 n=46+46) RegexpMatchHard_1K-4 1.15MB/s ± 2% 1.16MB/s ± 1% +0.67% (p=0.001 n=50+49) Revcomp-4 29.7MB/s ± 4% 29.7MB/s ± 2% ~ (p=0.682 n=50+50) Template-4 1.85MB/s ± 3% 1.87MB/s ± 1% +1.04% (p=0.000 n=50+44) [Geo mean] 6.56MB/s 6.60MB/s +0.47% Change-Id: Ic2cca90133c27a08d9f1a23c65b0eed5fbd02684 Reviewed-on: https://go-review.googlesource.com/41190 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
current status,
Can we do any more for this issue? |
@benshi001 Thank you for making the CL and investigating other possibilities. Closing for now (fixed in CL https://go-review.googlesource.com/c/41190/). Feel free to reopen if you find there is more to do. |
The
movw
andmovt
instructions were introduced in ARMv7 to improve the performance of immediate generation.An instruction sequence like
can be retired in a single cycle by some of the more modern Cortex-A chips (in both the 32- and 64-bit variety.)
The ARMv8 equivalent (for
0xb01dfacedebac1e0
) would beand can be retired in as little as one or two cycles as well (but this time at a small code size penalty.) Keep in mind that arm loads, particularly in ARMv8, have more than single-cycle result latencies, so this sequence of
mov
s is likely faster even without the hardware optimization if the immediate is used... immediately.I'd also expect there to be a small but significant improvement in L1 dcache misses, since this would get the dcache out of the equation entirely.
I'm not familiar enough with the toolchain to know whether or not
obj
would get grumpy ifMOVW
compiled to a different number of instructions depending on the immediate.The text was updated successfully, but these errors were encountered: