Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/internal/obj: armv7 (and arm64) should not use constant pools #18293

Closed
philhofer opened this issue Dec 12, 2016 · 33 comments
Closed

cmd/internal/obj: armv7 (and arm64) should not use constant pools #18293

philhofer opened this issue Dec 12, 2016 · 33 comments

Comments

@philhofer
Copy link
Contributor

philhofer commented Dec 12, 2016

The movw and movt instructions were introduced in ARMv7 to improve the performance of immediate generation.

An instruction sequence like

movw r0, #0xabcd
movt r0, #0xefff

can be retired in a single cycle by some of the more modern Cortex-A chips (in both the 32- and 64-bit variety.)

The ARMv8 equivalent (for 0xb01dfacedebac1e0) would be

movz x0, #0xc1e0
movk x0, #0xdeba, lsl #16
movk x0, #0xface, lsl #32
movk x0, #0xb01d, lsl #48

and can be retired in as little as one or two cycles as well (but this time at a small code size penalty.) Keep in mind that arm loads, particularly in ARMv8, have more than single-cycle result latencies, so this sequence of movs is likely faster even without the hardware optimization if the immediate is used... immediately.

I'd also expect there to be a small but significant improvement in L1 dcache misses, since this would get the dcache out of the equation entirely.

I'm not familiar enough with the toolchain to know whether or not obj would get grumpy if MOVW compiled to a different number of instructions depending on the immediate.

@philhofer philhofer changed the title obj: armv7 (and arm64) should not use immediate pools obj: armv7 (and arm64) should not use constant pools Dec 12, 2016
@bradfitz bradfitz added this to the Go1.9Maybe milestone Dec 12, 2016
@randall77
Copy link
Contributor

@cherrymui

@minux
Copy link
Member

minux commented Dec 13, 2016 via email

@cherrymui
Copy link
Member

For ARM32, I am not sure we want very different mechanisms for this on GOARM=7 or not.
For ARM64, it looks like worth an experiment in Go 1.9.

@cherrymui cherrymui changed the title obj: armv7 (and arm64) should not use constant pools cmd/internal/obj: armv7 (and arm64) should not use constant pools Dec 13, 2016
@philhofer
Copy link
Contributor Author

@minux Right. The linker would have to support R_ARM_MOVW_ABS_NC and R_ARM_MOVT_ABS if we wanted to support armv7 in its completeness. As for floating-point, vmov (vfpv3) supports a reasonable range of constants.

@randall77
Copy link
Contributor

@benshi001 : you might want to tackle this.

@benshi001
Copy link
Member

benshi001 commented Mar 25, 2017 via email

@benshi001
Copy link
Member

benshi001 commented Mar 28, 2017

Besides "MOVT $imme16, R", there are also "MOV $imme16, R". both encode the 16-bit immediate into the instruction than constant pool.

Maybe we can all attribute them to "MOVW $imme, R" in the front-end, then auto pick a right form in the back-end among
MOVT $imme16, R
MOV $imme16, R
MOV $imme8<<$imme4, R

@benshi001
Copy link
Member

I roughly have a solution for dynamically selecting the above 3 forms for MOVW. But it depends on
https://go-review.googlesource.com/#/c/35565/ and https://go-review.googlesource.com/#/c/35610/ ?

Can they be reviewed and accepted first?

@cherrymui
Copy link
Member

I did some benchmark:

BenchmarkF-4   	 3000000	       508 ns/op
BenchmarkG-4   	 2000000	       769 ns/op

where F uses constant pool while G uses MOVW-MOVT.

TEXT ·f(SB),0,$0-4
        MOVW    $100, R0
        MOVW    $0, R5
loop:
        MOVW    $0x12345678, R1 // use constant pool
        ADD     R1, R5
        MOVW    $0x87654321, R2
        ADD     R2, R5
        MOVW    $0xa0b0c0d0, R3
        ADD     R3, R5
        MOVW    $0xd0c0b0a0, R4
        ADD     R4, R5
        SUB     $1, R0
        CMP     $0, R0
        BNE     loop
        MOVW    R5, ret+0(FP)
        RET

TEXT ·g(SB),0,$0-4
        MOVW    $100, R0
        MOVW    $0, R5
loop:
        WORD    $0xe3051678 // MOVW $0x5678, R1
        WORD    $0xe3411234 // MOVT $0x1234, R1
        ADD     R1, R5
        WORD    $0xe3042321 // MOVW $0x4321, R2
        WORD    $0xe3482765 // MOVT $0x8765, R2
        ADD     R2, R5
        WORD    $0xe30c30d0 // MOVW $0xc0d0, R3
        WORD    $0xe34a30b0 // MOVT $0xa0b0, R3
        ADD     R3, R5
        WORD    $0xe30b40a0 // MOVW $0xb0a0, R4
        WORD    $0xe34d40c0 // MOVT $0xd0c0, R4
        ADD     R4, R5
        SUB     $1, R0
        CMP     $0, R0
        BNE     loop
        MOVW    R5, ret+0(FP)
        RET

Also tried a couple of others. Using constant pool seems always faster. So I'm not sure this is worth doing.

If the constant fits in 16 bit, using a single MOVW seems better though.

@benshi001
Copy link
Member

Yes. I agree. It is worth to

  1. substitute
    MOVW $0x1234, R1 // use constant pool
    with
    MOVW $0x1234, R1 // encode 16-bit 0x1234 into the instruction

  2. substitute
    MOVW $0x12340000, R1 // use constant pool
    with
    MOVT $0x1234, R1 // encode 16-bit 0x1234 into the instruction

But it not worth substitute
MOVW $0x12345678, R1 // use constant pool
with
MOVT $0x1234, R1
MOVW $0x5678, R1

@philhofer
Copy link
Contributor Author

Thanks for benchmarking.

However, I think it's worth evaluating the movw; movt pair on whole programs. Micro-benchmarks won't show the cache effects that these instructions can potentially mitigate. Moreover, the movw; movt pair is used by both GCC and clang on ARMv7 as the preferred method of 'large' immediate generation, and it's quite likely that those instruction selection decisions were even implemented by ARM employees. It would be at least a little surprising to discover that the conventional wisdom here is wrong.

@benshi001
Copy link
Member

benshi001 commented Mar 30, 2017

GCC will compile
extern int q;

void ss(void)
{
q = 0x12345678;
}

to

Disassembly of section .text:

00000000 :
0: e3003000 movw r3, # 0
4: e3403000 movt r3, # 0
8: e3052678 movw r2, # 22136 ; 0x5678
c: e3412234 movt r2, # 4660 ; 0x1234
10: e5832000 str r2, [r3]
14: e12fff1e bx lr

@benshi001
Copy link
Member

I can understand why a pair of
MOVW, MOVT is more efficient.

Since the constant pool is in the text segment, the CPU has to fill one data cache line to load the constant, which can cost hundreds of ticks.

But for MOVW/MOVT, the immediate is already in the code cache, and only several ticks are cost.

For a trade off,
BenchmarkF-4 3000000 508 ns/op
BenchmarkG-4 2000000 769 ns/op
the cost is tiny compared to load a new cache line.

@benshi001
Copy link
Member

There is already "VMOV R, # immediate" implemented for encoding short floating point constants into the instruction, at line 2222 - 2231 in cmd/internal/obj/arm/asm5.go

     case 81: /* fmov sfcon,freg */
            o1 = 0x0eb00a00 // VMOV imm 32
            if p.As == AMOVD {
                    o1 = 0xeeb00b00 // VMOV imm 64
            }
            o1 |= ((uint32(p.Scond) & C_SCOND) ^ C_SCOND_XOR) << 28
            o1 |= (uint32(p.To.Reg) & 15) << 12
            v := int32(chipfloat5(ctxt, p.From.Val.(float64)))
            o1 |= (uint32(v) & 0xf) << 0
            o1 |= (uint32(v) & 0xf0) << 12

We need only consider MOVW/MOVT on ARMv7.

@benshi001
Copy link
Member

This issue becomes more complex.

All sub-forms of MOVW assume the output has fixed length.

But "MOVW $0x12345678, R1" is assembled to 4-byte output on ARMv6, while 8-byte on ARMv7.

@benshi001
Copy link
Member

How about substitute MOVW.cond $0x00001234, R0 with

MOVW.cond $0x1234, R0
MOVT.cond $0x0000, R0
MOVW.cond R0, R0 (for setting the condition flag bits)

Though 3 bytes are cost, it is still more efficient than fill a new data cache line.

@cherrymui
Copy link
Member

Yes, my benchmark might not be very representative. But my point is that if we do this, we would need to base it on evidence of actual speedup, instead of what appears to be fast. Note that the complexity of the toolchain is also a cost.

@cherrymui
Copy link
Member

But "MOVW $0x12345678, R1" is assembled to 4-byte output on ARMv6, while 8-byte on ARMv7.

You can do this by adding another column, GOARM, in Optab. See cmd/internal/obj/mips/asm0.go, where MIPS32 vs MIPS64 is used for instruction selection.

@benshi001
Copy link
Member

benshi001 commented Apr 1, 2017 via email

@benshi001
Copy link
Member

I can not find a way to exactly count how cache miss affects.

Shall we implement MOVW $imme-16bit, Rn for ARMv7 first? At least it is sure to be optimized.

@benshi001
Copy link
Member

I have implemented movw/movt pair in patch set 2 of CL 41190, https://go-review.googlesource.com/?polygerrit=0#/c/41190/

Unfortunately the go1 benchmark test shows no improvement. Please check the attachment for details.
screenshot from 2017-04-21 11-07-55

.

@benshi001
Copy link
Member

I will try if only "MOVW $imm-16, Reg" can optimize ARM code later.

@benshi001
Copy link
Member

There is a slight improvement with only "movw $imm16, Reg" in go1 benchmark test.

screenshot from 2017-04-21 18-43-12

@gopherbot
Copy link

CL https://golang.org/cl/41190 mentions this issue.

@benshi001
Copy link
Member

With patch set 4 of CL 41190, I ran the go1 benchmark test three rounds, each with count=50. I got
old50_1 vs new50_1
[Geo mean] 747µs 744µs -0.34%
[Geo mean] 6.57MB/s 6.59MB/s +0.33%
old50_2 vs new50_2
[Geo mean] 749µs 747µs -0.31%
[Geo mean] 6.56MB/s 6.58MB/s +0.30%
old50_3 vs new50_3
[Geo mean] 747µs 745µs -0.28%
[Geo mean] 6.56MB/s 6.60MB/s +0.47%

All showed improvements with "MOVW $imm16, Reg".

@benshi001
Copy link
Member

1

@benshi001
Copy link
Member

2

@benshi001
Copy link
Member

3

@benshi001
Copy link
Member

I tried to attach a zip file containing all log files. But always got
"We don’t support that file type. with a PNG, GIF, JPG, DOCX, PPTX, XLSX, TXT, PDF, or ZIP. "

So I have to attach the screenshots. Really sorry for that. And I will update CL 41190's commit message with new test results which look more reasonable.

@bradfitz
Copy link
Contributor

Please don't post screenshots of text. Just copy/paste the text. But I don't see why this bug needs 3 sets of benchmarks pasted in a row. Shouldn't it just be in the commit message of your CLs?

@benshi001
Copy link
Member

OK. I will attach the text directly next time. I did more rounds of test, since the results have difference among each round, and more test results can make the improvement more reliable.

gopherbot pushed a commit that referenced this issue Apr 25, 2017
As discussion in issue #18293, "MOVW $Imm-16, Reg" was introduced in
ARMv7. It directly encoded the 16-bit immediate into the instruction
instead of put it in the constant pool.

This patch makes the arm assembler choose this form of MOVW if available.

Besides 4 bytes are saved in the constant pool, the go1 benchmark test
also shows a slight improvement.

name                     old time/op    new time/op    delta
BinaryTree17-4              42.7s ± 1%     42.7s ± 1%    ~     (p=0.304 n=50+50)
Fannkuch11-4                24.8s ± 1%     24.8s ± 0%    ~     (p=0.757 n=50+49)
FmtFprintfEmpty-4           875ns ± 1%     873ns ± 2%    ~     (p=0.066 n=44+46)
FmtFprintfString-4         1.43µs ± 1%    1.45µs ± 1%  +1.68%  (p=0.000 n=44+44)
FmtFprintfInt-4            1.52µs ± 1%    1.52µs ± 1%  +0.26%  (p=0.009 n=41+45)
FmtFprintfIntInt-4         2.19µs ± 1%    2.20µs ± 1%  +0.76%  (p=0.000 n=43+46)
FmtFprintfPrefixedInt-4    2.56µs ± 2%    2.53µs ± 1%  -1.03%  (p=0.000 n=45+44)
FmtFprintfFloat-4          4.41µs ± 1%    4.39µs ± 1%  -0.52%  (p=0.000 n=44+44)
FmtManyArgs-4              9.02µs ± 2%    9.04µs ± 1%  +0.27%  (p=0.000 n=46+44)
GobDecode-4                 106ms ± 1%     106ms ± 1%    ~     (p=0.310 n=45+43)
GobEncode-4                88.1ms ± 2%    88.0ms ± 2%    ~     (p=0.648 n=49+50)
Gzip-4                      4.31s ± 1%     4.27s ± 1%  -1.01%  (p=0.000 n=50+50)
Gunzip-4                    618ms ± 1%     608ms ± 1%  -1.65%  (p=0.000 n=45+47)
HTTPClientServer-4          689µs ± 6%     692µs ± 4%  +0.52%  (p=0.038 n=50+47)
JSONEncode-4                282ms ± 2%     280ms ± 1%  -0.75%  (p=0.000 n=46+43)
JSONDecode-4                945ms ± 2%     940ms ± 1%  -0.47%  (p=0.000 n=47+47)
Mandelbrot200-4            49.4ms ± 1%    49.3ms ± 1%    ~     (p=0.163 n=45+45)
GoParse-4                  46.0ms ± 3%    45.5ms ± 2%  -0.95%  (p=0.000 n=49+40)
RegexpMatchEasy0_32-4      1.29µs ± 1%    1.28µs ± 1%  -0.14%  (p=0.005 n=38+45)
RegexpMatchEasy0_1K-4      7.92µs ± 8%    7.75µs ± 6%  -2.12%  (p=0.000 n=47+50)
RegexpMatchEasy1_32-4      1.31µs ± 1%    1.31µs ± 0%    ~     (p=0.282 n=45+48)
RegexpMatchEasy1_1K-4      10.4µs ± 5%    10.4µs ± 3%    ~     (p=0.771 n=50+49)
RegexpMatchMedium_32-4     2.06µs ± 1%    2.07µs ± 1%  +0.35%  (p=0.001 n=44+49)
RegexpMatchMedium_1K-4      533µs ± 1%     532µs ± 1%    ~     (p=0.710 n=43+47)
RegexpMatchHard_32-4       29.7µs ± 1%    29.6µs ± 1%  -0.34%  (p=0.002 n=43+46)
RegexpMatchHard_1K-4        893µs ± 2%     885µs ± 1%  -0.85%  (p=0.000 n=50+45)
Revcomp-4                  85.6ms ± 4%    85.5ms ± 2%    ~     (p=0.683 n=50+50)
Template-4                  1.05s ± 3%     1.04s ± 1%  -1.06%  (p=0.000 n=50+44)
TimeParse-4                7.19µs ± 2%    7.11µs ± 2%  -1.10%  (p=0.000 n=48+46)
TimeFormat-4               13.4µs ± 1%    13.5µs ± 1%    ~     (p=0.056 n=46+49)
[Geo mean]                  747µs          745µs       -0.28%

name                     old speed      new speed      delta
GobDecode-4              7.23MB/s ± 1%  7.22MB/s ± 1%    ~     (p=0.062 n=45+39)
GobEncode-4              8.71MB/s ± 2%  8.72MB/s ± 2%    ~     (p=0.656 n=49+50)
Gzip-4                   4.50MB/s ± 1%  4.55MB/s ± 1%  +1.03%  (p=0.000 n=50+50)
Gunzip-4                 31.4MB/s ± 1%  31.9MB/s ± 1%  +1.67%  (p=0.000 n=45+47)
JSONEncode-4             6.89MB/s ± 2%  6.94MB/s ± 1%  +0.76%  (p=0.000 n=46+43)
JSONDecode-4             2.05MB/s ± 2%  2.06MB/s ± 2%  +0.32%  (p=0.017 n=47+50)
GoParse-4                1.26MB/s ± 3%  1.27MB/s ± 1%  +0.68%  (p=0.000 n=50+48)
RegexpMatchEasy0_32-4    24.9MB/s ± 1%  24.9MB/s ± 1%  +0.13%  (p=0.004 n=38+45)
RegexpMatchEasy0_1K-4     129MB/s ± 7%   132MB/s ± 6%  +2.34%  (p=0.000 n=46+50)
RegexpMatchEasy1_32-4    24.5MB/s ± 1%  24.4MB/s ± 1%    ~     (p=0.252 n=45+48)
RegexpMatchEasy1_1K-4    98.8MB/s ± 4%  98.7MB/s ± 3%    ~     (p=0.771 n=50+49)
RegexpMatchMedium_32-4    485kB/s ± 3%   480kB/s ± 0%  -0.95%  (p=0.000 n=50+38)
RegexpMatchMedium_1K-4   1.92MB/s ± 1%  1.92MB/s ± 1%    ~     (p=0.129 n=43+47)
RegexpMatchHard_32-4     1.08MB/s ± 2%  1.08MB/s ± 1%  +0.38%  (p=0.017 n=46+46)
RegexpMatchHard_1K-4     1.15MB/s ± 2%  1.16MB/s ± 1%  +0.67%  (p=0.001 n=50+49)
Revcomp-4                29.7MB/s ± 4%  29.7MB/s ± 2%    ~     (p=0.682 n=50+50)
Template-4               1.85MB/s ± 3%  1.87MB/s ± 1%  +1.04%  (p=0.000 n=50+44)
[Geo mean]               6.56MB/s       6.60MB/s       +0.47%


Change-Id: Ic2cca90133c27a08d9f1a23c65b0eed5fbd02684
Reviewed-on: https://go-review.googlesource.com/41190
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@benshi001
Copy link
Member

current status,

  1. "VMOV $immediate, Reg" already exists
  2. MOVW/MOVT pair is not efficient as expected
  3. MOVW $imm16, Reg is merged

Can we do any more for this issue?

@cherrymui
Copy link
Member

@benshi001 Thank you for making the CL and investigating other possibilities.

Closing for now (fixed in CL https://go-review.googlesource.com/c/41190/). Feel free to reopen if you find there is more to do.

@golang golang locked and limited conversation to collaborators Apr 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants