cmd/internal/obj: armv7 (and arm64) should not use constant pools #18293

philhofer · 2016-12-12T23:19:30Z

The movw and movt instructions were introduced in ARMv7 to improve the performance of immediate generation.

An instruction sequence like

movw r0, #0xabcd
movt r0, #0xefff

can be retired in a single cycle by some of the more modern Cortex-A chips (in both the 32- and 64-bit variety.)

The ARMv8 equivalent (for 0xb01dfacedebac1e0) would be

movz x0, #0xc1e0
movk x0, #0xdeba, lsl #16
movk x0, #0xface, lsl #32
movk x0, #0xb01d, lsl #48

and can be retired in as little as one or two cycles as well (but this time at a small code size penalty.) Keep in mind that arm loads, particularly in ARMv8, have more than single-cycle result latencies, so this sequence of movs is likely faster even without the hardware optimization if the immediate is used... immediately.

I'd also expect there to be a small but significant improvement in L1 dcache misses, since this would get the dcache out of the equation entirely.

I'm not familiar enough with the toolchain to know whether or not obj would get grumpy if MOVW compiled to a different number of instructions depending on the immediate.

The text was updated successfully, but these errors were encountered:

randall77 · 2016-12-12T23:48:42Z

@cherrymui

minux · 2016-12-13T01:07:17Z

It's OK to generate MOVT/MOVW when GOARM=7 (just make sure not use these for addresses that needs dynamic relocations.) But what about floating point constants?

cherrymui · 2016-12-13T01:13:25Z

For ARM32, I am not sure we want very different mechanisms for this on GOARM=7 or not.
For ARM64, it looks like worth an experiment in Go 1.9.

philhofer · 2016-12-13T01:28:16Z

@minux Right. The linker would have to support R_ARM_MOVW_ABS_NC and R_ARM_MOVT_ABS if we wanted to support armv7 in its completeness. As for floating-point, vmov (vfpv3) supports a reasonable range of constants.

randall77 · 2017-03-24T21:28:17Z

@benshi001 : you might want to tackle this.

benshi001 · 2017-03-25T03:07:39Z

I am interested in it and will took a look after finishing my previous CLs.

benshi001 · 2017-03-28T08:08:28Z

Besides "MOVT $imme16, R", there are also "MOV $imme16, R". both encode the 16-bit immediate into the instruction than constant pool.

Maybe we can all attribute them to "MOVW $imme, R" in the front-end, then auto pick a right form in the back-end among
MOVT $imme16, R
MOV $imme16, R
MOV $imme8<<$imme4, R

benshi001 · 2017-03-28T10:04:02Z

I roughly have a solution for dynamically selecting the above 3 forms for MOVW. But it depends on
https://go-review.googlesource.com/#/c/35565/ and https://go-review.googlesource.com/#/c/35610/ ?

Can they be reviewed and accepted first?

cherrymui · 2017-03-28T12:44:37Z

I did some benchmark:

BenchmarkF-4   	 3000000	       508 ns/op
BenchmarkG-4   	 2000000	       769 ns/op

where F uses constant pool while G uses MOVW-MOVT.

TEXT ·f(SB),0,$0-4
        MOVW    $100, R0
        MOVW    $0, R5
loop:
        MOVW    $0x12345678, R1 // use constant pool
        ADD     R1, R5
        MOVW    $0x87654321, R2
        ADD     R2, R5
        MOVW    $0xa0b0c0d0, R3
        ADD     R3, R5
        MOVW    $0xd0c0b0a0, R4
        ADD     R4, R5
        SUB     $1, R0
        CMP     $0, R0
        BNE     loop
        MOVW    R5, ret+0(FP)
        RET

TEXT ·g(SB),0,$0-4
        MOVW    $100, R0
        MOVW    $0, R5
loop:
        WORD    $0xe3051678 // MOVW $0x5678, R1
        WORD    $0xe3411234 // MOVT $0x1234, R1
        ADD     R1, R5
        WORD    $0xe3042321 // MOVW $0x4321, R2
        WORD    $0xe3482765 // MOVT $0x8765, R2
        ADD     R2, R5
        WORD    $0xe30c30d0 // MOVW $0xc0d0, R3
        WORD    $0xe34a30b0 // MOVT $0xa0b0, R3
        ADD     R3, R5
        WORD    $0xe30b40a0 // MOVW $0xb0a0, R4
        WORD    $0xe34d40c0 // MOVT $0xd0c0, R4
        ADD     R4, R5
        SUB     $1, R0
        CMP     $0, R0
        BNE     loop
        MOVW    R5, ret+0(FP)
        RET

Also tried a couple of others. Using constant pool seems always faster. So I'm not sure this is worth doing.

If the constant fits in 16 bit, using a single MOVW seems better though.

benshi001 · 2017-03-29T02:10:49Z

Yes. I agree. It is worth to

substitute
MOVW $0x1234, R1 // use constant pool
with
MOVW $0x1234, R1 // encode 16-bit 0x1234 into the instruction
substitute
MOVW $0x12340000, R1 // use constant pool
with
MOVT $0x1234, R1 // encode 16-bit 0x1234 into the instruction

But it not worth substitute
MOVW $0x12345678, R1 // use constant pool
with
MOVT $0x1234, R1
MOVW $0x5678, R1

philhofer · 2017-03-29T23:49:36Z

Thanks for benchmarking.

However, I think it's worth evaluating the movw; movt pair on whole programs. Micro-benchmarks won't show the cache effects that these instructions can potentially mitigate. Moreover, the movw; movt pair is used by both GCC and clang on ARMv7 as the preferred method of 'large' immediate generation, and it's quite likely that those instruction selection decisions were even implemented by ARM employees. It would be at least a little surprising to discover that the conventional wisdom here is wrong.

benshi001 · 2017-03-30T02:01:40Z

GCC will compile
extern int q;

void ss(void)
{
q = 0x12345678;
}

to

Disassembly of section .text:

00000000 :
0: e3003000 movw r3, # 0
4: e3403000 movt r3, # 0
8: e3052678 movw r2, # 22136 ; 0x5678
c: e3412234 movt r2, # 4660 ; 0x1234
10: e5832000 str r2, [r3]
14: e12fff1e bx lr

benshi001 · 2017-03-31T09:43:36Z

I can understand why a pair of
MOVW, MOVT is more efficient.

Since the constant pool is in the text segment, the CPU has to fill one data cache line to load the constant, which can cost hundreds of ticks.

But for MOVW/MOVT, the immediate is already in the code cache, and only several ticks are cost.

For a trade off,
BenchmarkF-4 3000000 508 ns/op
BenchmarkG-4 2000000 769 ns/op
the cost is tiny compared to load a new cache line.

benshi001 · 2017-04-01T07:15:32Z

There is already "VMOV R, # immediate" implemented for encoding short floating point constants into the instruction, at line 2222 - 2231 in cmd/internal/obj/arm/asm5.go

     case 81: /* fmov sfcon,freg */
            o1 = 0x0eb00a00 // VMOV imm 32
            if p.As == AMOVD {
                    o1 = 0xeeb00b00 // VMOV imm 64
            }
            o1 |= ((uint32(p.Scond) & C_SCOND) ^ C_SCOND_XOR) << 28
            o1 |= (uint32(p.To.Reg) & 15) << 12
            v := int32(chipfloat5(ctxt, p.From.Val.(float64)))
            o1 |= (uint32(v) & 0xf) << 0
            o1 |= (uint32(v) & 0xf0) << 12

We need only consider MOVW/MOVT on ARMv7.

benshi001 · 2017-04-01T08:03:58Z

This issue becomes more complex.

All sub-forms of MOVW assume the output has fixed length.

But "MOVW $0x12345678, R1" is assembled to 4-byte output on ARMv6, while 8-byte on ARMv7.

benshi001 · 2017-04-01T09:10:14Z

How about substitute MOVW.cond $0x00001234, R0 with

MOVW.cond $0x1234, R0
MOVT.cond $0x0000, R0
MOVW.cond R0, R0 (for setting the condition flag bits)

Though 3 bytes are cost, it is still more efficient than fill a new data cache line.

cherrymui · 2017-04-01T18:07:58Z

Yes, my benchmark might not be very representative. But my point is that if we do this, we would need to base it on evidence of actual speedup, instead of what appears to be fast. Note that the complexity of the toolchain is also a cost.

cherrymui · 2017-04-01T18:11:40Z

But "MOVW $0x12345678, R1" is assembled to 4-byte output on ARMv6, while 8-byte on ARMv7.

You can do this by adding another column, GOARM, in Optab. See cmd/internal/obj/mips/asm0.go, where MIPS32 vs MIPS64 is used for instruction selection.

benshi001 · 2017-04-01T23:19:52Z

I recall my previous employer has a test suite, one test case is frequently produce cache miss, maybe it can be adapted to compare ldr and movt. But it is in c and gnu asm. Ben Shi

…

在 2017年4月2日，02:12，cherrymui ***@***.***> 写道： But "MOVW $0x12345678, R1" is assembled to 4-byte output on ARMv6, while 8-byte on ARMv7. You can do this by adding another column, GOARM, in Optab. See cmd/internal/obj/mips/asm0.go, where MIPS32 vs MIPS64 is used for instruction selection. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

benshi001 · 2017-04-05T11:42:34Z

I can not find a way to exactly count how cache miss affects.

Shall we implement MOVW $imme-16bit, Rn for ARMv7 first? At least it is sure to be optimized.

benshi001 · 2017-04-21T03:09:05Z

I have implemented movw/movt pair in patch set 2 of CL 41190, https://go-review.googlesource.com/?polygerrit=0#/c/41190/

Unfortunately the go1 benchmark test shows no improvement. Please check the attachment for details.

.

benshi001 · 2017-04-21T03:10:56Z

I will try if only "MOVW $imm-16, Reg" can optimize ARM code later.

benshi001 · 2017-04-21T10:45:02Z

There is a slight improvement with only "movw $imm16, Reg" in go1 benchmark test.

gopherbot · 2017-04-21T18:18:26Z

CL https://golang.org/cl/41190 mentions this issue.

benshi001 · 2017-04-22T15:51:04Z

With patch set 4 of CL 41190, I ran the go1 benchmark test three rounds, each with count=50. I got
old50_1 vs new50_1
[Geo mean] 747µs 744µs -0.34%
[Geo mean] 6.57MB/s 6.59MB/s +0.33%
old50_2 vs new50_2
[Geo mean] 749µs 747µs -0.31%
[Geo mean] 6.56MB/s 6.58MB/s +0.30%
old50_3 vs new50_3
[Geo mean] 747µs 745µs -0.28%
[Geo mean] 6.56MB/s 6.60MB/s +0.47%

All showed improvements with "MOVW $imm16, Reg".

benshi001 · 2017-04-22T16:04:54Z

benshi001 · 2017-04-22T16:06:42Z

benshi001 · 2017-04-22T16:12:23Z

benshi001 · 2017-04-22T16:14:25Z

I tried to attach a zip file containing all log files. But always got
"We don’t support that file type. with a PNG, GIF, JPG, DOCX, PPTX, XLSX, TXT, PDF, or ZIP. "

So I have to attach the screenshots. Really sorry for that. And I will update CL 41190's commit message with new test results which look more reasonable.

bradfitz · 2017-04-22T16:30:12Z

Please don't post screenshots of text. Just copy/paste the text. But I don't see why this bug needs 3 sets of benchmarks pasted in a row. Shouldn't it just be in the commit message of your CLs?

benshi001 · 2017-04-22T16:39:36Z

OK. I will attach the text directly next time. I did more rounds of test, since the results have difference among each round, and more test results can make the improvement more reliable.

As discussion in issue #18293, "MOVW $Imm-16, Reg" was introduced in ARMv7. It directly encoded the 16-bit immediate into the instruction instead of put it in the constant pool. This patch makes the arm assembler choose this form of MOVW if available. Besides 4 bytes are saved in the constant pool, the go1 benchmark test also shows a slight improvement. name old time/op new time/op delta BinaryTree17-4 42.7s ± 1% 42.7s ± 1% ~ (p=0.304 n=50+50) Fannkuch11-4 24.8s ± 1% 24.8s ± 0% ~ (p=0.757 n=50+49) FmtFprintfEmpty-4 875ns ± 1% 873ns ± 2% ~ (p=0.066 n=44+46) FmtFprintfString-4 1.43µs ± 1% 1.45µs ± 1% +1.68% (p=0.000 n=44+44) FmtFprintfInt-4 1.52µs ± 1% 1.52µs ± 1% +0.26% (p=0.009 n=41+45) FmtFprintfIntInt-4 2.19µs ± 1% 2.20µs ± 1% +0.76% (p=0.000 n=43+46) FmtFprintfPrefixedInt-4 2.56µs ± 2% 2.53µs ± 1% -1.03% (p=0.000 n=45+44) FmtFprintfFloat-4 4.41µs ± 1% 4.39µs ± 1% -0.52% (p=0.000 n=44+44) FmtManyArgs-4 9.02µs ± 2% 9.04µs ± 1% +0.27% (p=0.000 n=46+44) GobDecode-4 106ms ± 1% 106ms ± 1% ~ (p=0.310 n=45+43) GobEncode-4 88.1ms ± 2% 88.0ms ± 2% ~ (p=0.648 n=49+50) Gzip-4 4.31s ± 1% 4.27s ± 1% -1.01% (p=0.000 n=50+50) Gunzip-4 618ms ± 1% 608ms ± 1% -1.65% (p=0.000 n=45+47) HTTPClientServer-4 689µs ± 6% 692µs ± 4% +0.52% (p=0.038 n=50+47) JSONEncode-4 282ms ± 2% 280ms ± 1% -0.75% (p=0.000 n=46+43) JSONDecode-4 945ms ± 2% 940ms ± 1% -0.47% (p=0.000 n=47+47) Mandelbrot200-4 49.4ms ± 1% 49.3ms ± 1% ~ (p=0.163 n=45+45) GoParse-4 46.0ms ± 3% 45.5ms ± 2% -0.95% (p=0.000 n=49+40) RegexpMatchEasy0_32-4 1.29µs ± 1% 1.28µs ± 1% -0.14% (p=0.005 n=38+45) RegexpMatchEasy0_1K-4 7.92µs ± 8% 7.75µs ± 6% -2.12% (p=0.000 n=47+50) RegexpMatchEasy1_32-4 1.31µs ± 1% 1.31µs ± 0% ~ (p=0.282 n=45+48) RegexpMatchEasy1_1K-4 10.4µs ± 5% 10.4µs ± 3% ~ (p=0.771 n=50+49) RegexpMatchMedium_32-4 2.06µs ± 1% 2.07µs ± 1% +0.35% (p=0.001 n=44+49) RegexpMatchMedium_1K-4 533µs ± 1% 532µs ± 1% ~ (p=0.710 n=43+47) RegexpMatchHard_32-4 29.7µs ± 1% 29.6µs ± 1% -0.34% (p=0.002 n=43+46) RegexpMatchHard_1K-4 893µs ± 2% 885µs ± 1% -0.85% (p=0.000 n=50+45) Revcomp-4 85.6ms ± 4% 85.5ms ± 2% ~ (p=0.683 n=50+50) Template-4 1.05s ± 3% 1.04s ± 1% -1.06% (p=0.000 n=50+44) TimeParse-4 7.19µs ± 2% 7.11µs ± 2% -1.10% (p=0.000 n=48+46) TimeFormat-4 13.4µs ± 1% 13.5µs ± 1% ~ (p=0.056 n=46+49) [Geo mean] 747µs 745µs -0.28% name old speed new speed delta GobDecode-4 7.23MB/s ± 1% 7.22MB/s ± 1% ~ (p=0.062 n=45+39) GobEncode-4 8.71MB/s ± 2% 8.72MB/s ± 2% ~ (p=0.656 n=49+50) Gzip-4 4.50MB/s ± 1% 4.55MB/s ± 1% +1.03% (p=0.000 n=50+50) Gunzip-4 31.4MB/s ± 1% 31.9MB/s ± 1% +1.67% (p=0.000 n=45+47) JSONEncode-4 6.89MB/s ± 2% 6.94MB/s ± 1% +0.76% (p=0.000 n=46+43) JSONDecode-4 2.05MB/s ± 2% 2.06MB/s ± 2% +0.32% (p=0.017 n=47+50) GoParse-4 1.26MB/s ± 3% 1.27MB/s ± 1% +0.68% (p=0.000 n=50+48) RegexpMatchEasy0_32-4 24.9MB/s ± 1% 24.9MB/s ± 1% +0.13% (p=0.004 n=38+45) RegexpMatchEasy0_1K-4 129MB/s ± 7% 132MB/s ± 6% +2.34% (p=0.000 n=46+50) RegexpMatchEasy1_32-4 24.5MB/s ± 1% 24.4MB/s ± 1% ~ (p=0.252 n=45+48) RegexpMatchEasy1_1K-4 98.8MB/s ± 4% 98.7MB/s ± 3% ~ (p=0.771 n=50+49) RegexpMatchMedium_32-4 485kB/s ± 3% 480kB/s ± 0% -0.95% (p=0.000 n=50+38) RegexpMatchMedium_1K-4 1.92MB/s ± 1% 1.92MB/s ± 1% ~ (p=0.129 n=43+47) RegexpMatchHard_32-4 1.08MB/s ± 2% 1.08MB/s ± 1% +0.38% (p=0.017 n=46+46) RegexpMatchHard_1K-4 1.15MB/s ± 2% 1.16MB/s ± 1% +0.67% (p=0.001 n=50+49) Revcomp-4 29.7MB/s ± 4% 29.7MB/s ± 2% ~ (p=0.682 n=50+50) Template-4 1.85MB/s ± 3% 1.87MB/s ± 1% +1.04% (p=0.000 n=50+44) [Geo mean] 6.56MB/s 6.60MB/s +0.47% Change-Id: Ic2cca90133c27a08d9f1a23c65b0eed5fbd02684 Reviewed-on: https://go-review.googlesource.com/41190 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>

benshi001 · 2017-04-26T02:59:29Z

current status,

"VMOV $immediate, Reg" already exists
MOVW/MOVT pair is not efficient as expected
MOVW $imm16, Reg is merged

Can we do any more for this issue?

cherrymui · 2017-04-26T14:04:53Z

@benshi001 Thank you for making the CL and investigating other possibilities.

Closing for now (fixed in CL https://go-review.googlesource.com/c/41190/). Feel free to reopen if you find there is more to do.

philhofer changed the title ~~obj: armv7 (and arm64) should not use immediate pools~~ obj: armv7 (and arm64) should not use constant pools Dec 12, 2016

bradfitz added this to the Go1.9Maybe milestone Dec 12, 2016

bradfitz assigned randall77 Dec 12, 2016

bradfitz added the Performance label Dec 12, 2016

cherrymui changed the title ~~obj: armv7 (and arm64) should not use constant pools~~ cmd/internal/obj: armv7 (and arm64) should not use constant pools Dec 13, 2016

cherrymui closed this as completed Apr 26, 2017

golang locked and limited conversation to collaborators Apr 26, 2018

gopherbot added the FrozenDueToAge label Apr 26, 2018

rsc unassigned randall77 Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/internal/obj: armv7 (and arm64) should not use constant pools #18293

cmd/internal/obj: armv7 (and arm64) should not use constant pools #18293

philhofer commented Dec 12, 2016 •

edited

Loading

randall77 commented Dec 12, 2016

minux commented Dec 13, 2016 via email

cherrymui commented Dec 13, 2016

philhofer commented Dec 13, 2016

randall77 commented Mar 24, 2017

benshi001 commented Mar 25, 2017 via email

benshi001 commented Mar 28, 2017 •

edited

Loading

benshi001 commented Mar 28, 2017

cherrymui commented Mar 28, 2017

benshi001 commented Mar 29, 2017

philhofer commented Mar 29, 2017

benshi001 commented Mar 30, 2017 •

edited

Loading

benshi001 commented Mar 31, 2017

benshi001 commented Apr 1, 2017

benshi001 commented Apr 1, 2017

benshi001 commented Apr 1, 2017

cherrymui commented Apr 1, 2017

cherrymui commented Apr 1, 2017

benshi001 commented Apr 1, 2017 via email

benshi001 commented Apr 5, 2017

benshi001 commented Apr 21, 2017

benshi001 commented Apr 21, 2017

benshi001 commented Apr 21, 2017

gopherbot commented Apr 21, 2017

benshi001 commented Apr 22, 2017

benshi001 commented Apr 22, 2017

benshi001 commented Apr 22, 2017

benshi001 commented Apr 22, 2017

benshi001 commented Apr 22, 2017

bradfitz commented Apr 22, 2017

benshi001 commented Apr 22, 2017

benshi001 commented Apr 26, 2017

cherrymui commented Apr 26, 2017

cmd/internal/obj: armv7 (and arm64) should not use constant pools #18293

cmd/internal/obj: armv7 (and arm64) should not use constant pools #18293

Comments

philhofer commented Dec 12, 2016 • edited Loading

randall77 commented Dec 12, 2016

minux commented Dec 13, 2016 via email

cherrymui commented Dec 13, 2016

philhofer commented Dec 13, 2016

randall77 commented Mar 24, 2017

benshi001 commented Mar 25, 2017 via email

benshi001 commented Mar 28, 2017 • edited Loading

benshi001 commented Mar 28, 2017

cherrymui commented Mar 28, 2017

benshi001 commented Mar 29, 2017

philhofer commented Mar 29, 2017

benshi001 commented Mar 30, 2017 • edited Loading

benshi001 commented Mar 31, 2017

benshi001 commented Apr 1, 2017

benshi001 commented Apr 1, 2017

benshi001 commented Apr 1, 2017

cherrymui commented Apr 1, 2017

cherrymui commented Apr 1, 2017

benshi001 commented Apr 1, 2017 via email

benshi001 commented Apr 5, 2017

benshi001 commented Apr 21, 2017

benshi001 commented Apr 21, 2017

benshi001 commented Apr 21, 2017

gopherbot commented Apr 21, 2017

benshi001 commented Apr 22, 2017

benshi001 commented Apr 22, 2017

benshi001 commented Apr 22, 2017

benshi001 commented Apr 22, 2017

benshi001 commented Apr 22, 2017

bradfitz commented Apr 22, 2017

benshi001 commented Apr 22, 2017

benshi001 commented Apr 26, 2017

cherrymui commented Apr 26, 2017

philhofer commented Dec 12, 2016 •

edited

Loading

benshi001 commented Mar 28, 2017 •

edited

Loading

benshi001 commented Mar 30, 2017 •

edited

Loading