Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: cmd/go: add GOARM=8 for further optimization on armv7/aarch32 #29373

Open
benshi001 opened this issue Dec 21, 2018 · 9 comments
Open

Comments

@benshi001
Copy link
Member

Currently an ARM program built by go will call runtime.udiv() for a division, in which it detect if a hardware divider is available, or use software division.

The main reason is that a hardware divider is an optional component for an ARMv7 machine. But in the real world, most ARMv7 SOC has it, such as RaspberryPi2.

GOARM=8 implies that the program will run in the aarch32 mode (ARMv7 compatible) of an arm64 machine, on which a hardware divider is a must. So

  1. the compiler will directly generate a SDIV/UDIV for div and mod operations.
  2. the program built with GOARM=8 can also run on ARMv7 with a HW Dividor, like RP2.

The go1 benchmark does show some improvement for directly generation of SDIV/UDIV against runtime detection.

name                     old time/op    new time/op    delta
BinaryTree17-4              20.6s ± 0%     20.5s ± 1%   -0.30%  (p=0.000 n=40+40)
Fannkuch11-4                9.31s ± 0%     9.27s ± 0%   -0.42%  (p=0.000 n=40+39)
FmtFprintfEmpty-4           297ns ± 0%     298ns ± 0%   +0.34%  (p=0.000 n=38+40)
FmtFprintfString-4          588ns ± 0%     599ns ± 0%   +1.81%  (p=0.000 n=36+40)
FmtFprintfInt-4             633ns ± 0%     637ns ± 0%   +0.56%  (p=0.000 n=40+29)
FmtFprintfIntInt-4          960ns ± 0%     953ns ± 0%   -0.71%  (p=0.000 n=40+40)
FmtFprintfPrefixedInt-4    1.05µs ± 0%    1.05µs ± 0%     ~     (p=0.194 n=35+38)
FmtFprintfFloat-4          1.95µs ± 0%    1.75µs ± 0%  -10.12%  (p=0.000 n=38+40)
FmtManyArgs-4              3.55µs ± 0%    3.42µs ± 0%   -3.68%  (p=0.000 n=40+40)
GobDecode-4                37.4ms ± 1%    37.4ms ± 1%     ~     (p=0.320 n=37+39)
GobEncode-4                34.7ms ± 1%    34.4ms ± 1%   -0.80%  (p=0.000 n=40+40)
Gzip-4                      2.06s ± 1%     2.07s ± 1%   +0.44%  (p=0.000 n=39+38)
Gunzip-4                    254ms ± 0%     254ms ± 0%   +0.16%  (p=0.000 n=40+38)
HTTPClientServer-4          823µs ± 2%     817µs ± 2%   -0.70%  (p=0.008 n=37+37)
JSONEncode-4               79.4ms ± 0%    76.0ms ± 1%   -4.23%  (p=0.000 n=32+40)
JSONDecode-4                308ms ± 0%     304ms ± 0%   -1.06%  (p=0.000 n=40+39)
Mandelbrot200-4            17.6ms ± 0%    17.6ms ± 0%     ~     (p=0.210 n=34+38)
GoParse-4                  18.9ms ± 1%    18.7ms ± 1%   -1.10%  (p=0.000 n=39+40)
RegexpMatchEasy0_32-4       500ns ± 0%     502ns ± 2%   +0.35%  (p=0.014 n=39+40)
RegexpMatchEasy0_1K-4      3.82µs ± 0%    3.82µs ± 0%   +0.15%  (p=0.000 n=40+40)
RegexpMatchEasy1_32-4       546ns ± 0%     548ns ± 1%   +0.44%  (p=0.000 n=40+40)
RegexpMatchEasy1_1K-4      4.78µs ± 0%    4.79µs ± 0%   +0.16%  (p=0.000 n=38+37)
RegexpMatchMedium_32-4      737ns ± 2%     741ns ± 3%   +0.53%  (p=0.026 n=40+40)
RegexpMatchMedium_1K-4      164µs ± 0%     162µs ± 0%   -0.72%  (p=0.000 n=39+35)
RegexpMatchHard_32-4       10.6µs ± 0%    10.5µs ± 0%   -0.86%  (p=0.000 n=40+40)
RegexpMatchHard_1K-4        316µs ± 0%     312µs ± 0%   -1.13%  (p=0.000 n=38+40)
Revcomp-4                  40.5ms ± 3%    40.8ms ± 2%   +0.85%  (p=0.001 n=40+39)
Template-4                  395ms ± 0%     387ms ± 0%   -2.07%  (p=0.000 n=40+39)
TimeParse-4                2.68µs ± 0%    2.65µs ± 0%   -1.12%  (p=0.000 n=40+40)
TimeFormat-4               5.42µs ± 0%    5.29µs ± 0%   -2.30%  (p=0.000 n=38+37)
[Geo mean]                  304µs          302µs        -0.88%

name                     old speed      new speed      delta
GobDecode-4              20.5MB/s ± 1%  20.5MB/s ± 1%     ~     (p=0.284 n=37+39)
GobEncode-4              22.1MB/s ± 1%  22.3MB/s ± 1%   +0.81%  (p=0.000 n=40+38)
Gzip-4                   9.41MB/s ± 1%  9.37MB/s ± 1%   -0.41%  (p=0.000 n=38+38)
Gunzip-4                 76.5MB/s ± 0%  76.4MB/s ± 0%   -0.16%  (p=0.000 n=40+38)
JSONEncode-4             24.4MB/s ± 0%  25.5MB/s ± 1%   +4.42%  (p=0.000 n=32+40)
JSONDecode-4             6.31MB/s ± 0%  6.37MB/s ± 0%   +1.02%  (p=0.000 n=40+35)
GoParse-4                3.06MB/s ± 1%  3.10MB/s ± 1%   +1.13%  (p=0.000 n=39+40)
RegexpMatchEasy0_32-4    63.9MB/s ± 0%  63.7MB/s ± 2%     ~     (p=0.070 n=39+40)
RegexpMatchEasy0_1K-4     268MB/s ± 0%   268MB/s ± 0%   -0.15%  (p=0.000 n=40+40)
RegexpMatchEasy1_32-4    58.6MB/s ± 0%  58.3MB/s ± 1%   -0.44%  (p=0.000 n=40+40)
RegexpMatchEasy1_1K-4     214MB/s ± 0%   214MB/s ± 0%   -0.16%  (p=0.000 n=38+37)
RegexpMatchMedium_32-4   1.36MB/s ± 3%  1.35MB/s ± 3%   -0.70%  (p=0.022 n=40+40)
RegexpMatchMedium_1K-4   6.26MB/s ± 0%  6.31MB/s ± 0%   +0.73%  (p=0.000 n=37+40)
RegexpMatchHard_32-4     3.01MB/s ± 1%  3.04MB/s ± 0%   +1.03%  (p=0.000 n=40+40)
RegexpMatchHard_1K-4     3.25MB/s ± 0%  3.28MB/s ± 0%   +1.05%  (p=0.000 n=40+40)
Revcomp-4                62.8MB/s ± 4%  62.3MB/s ± 2%   -0.86%  (p=0.001 n=40+39)
Template-4               4.91MB/s ± 0%  5.01MB/s ± 0%   +2.11%  (p=0.000 n=40+39)
[Geo mean]               17.0MB/s       17.1MB/s        +0.53%
@benshi001 benshi001 added this to the Go1.13 milestone Dec 21, 2018
@benshi001 benshi001 changed the title Add GOARM=8 Add GOARM=8 for further optimization on armv7/aarch32 Dec 21, 2018
@ALTree ALTree changed the title Add GOARM=8 for further optimization on armv7/aarch32 proposal: cmd/go: add GOARM=8 for further optimization on armv7/aarch32 Dec 21, 2018
@benshi001
Copy link
Member Author

@cherrymui @bradfitz @randall77 @griesemer
What is your opinion?

@agnivade agnivade modified the milestones: Go1.13, Proposal Dec 25, 2018
@cherrymui
Copy link
Member

I don't think it is a good idea to introduce another GOARM value for only two instructions and a performance gain less than 1% of geomean. I think dynamic feature detection is still better in this case. If the overhead of a runtime call is larger than we thought, we could generate the feature test and conditional branch inlined. Compared to a division operation, I don't think the overhead of a conditional branch is not acceptable.

@rsc
Copy link
Contributor

rsc commented Jan 9, 2019

It sounds like the proposal is GOARM=8 means ARMv7+hwdivide.
And ARMv8 was 64-bit ARM, right?
So it's a little weird to call 32-bit ARMv7 with hw divide GOARM=8.

It looks like this usually doesn't matter, performance-wise. Maybe the compiler should emit code to do the check + branch like we already do for write barriers? That would at least allow a bit more optimization of the hw code path (not doing unnecessary spills/reloads/etc).

And a very division heavy function could just have two copies with the compiler having hoisted the check out of the loop or other body to amortize it, all without a GOARM=.

Another option is to follow the GOMIPS and have GOARM=7+divide. But the faster branch check a la write barriers seems better to try first.

@rsc
Copy link
Contributor

rsc commented Jan 23, 2019

Can someone please try the faster branch check in the previous comment? We should have that data before making any decision to expose this detail to users.

@rsc rsc added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 23, 2019
@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@ALTree ALTree reopened this Feb 23, 2019
@benshi001 benshi001 self-assigned this Aug 23, 2019
@benshi001
Copy link
Member Author

unfortunantely, the go1 benchmark shows little improvement for a runtime check of a hardware dividor.

name                     old time/op    new time/op    delta
BinaryTree17-4              21.1s ± 1%     21.3s ± 1%  +1.30%  (p=0.000 n=40+40)
Fannkuch11-4                8.34s ± 0%     8.08s ± 0%  -3.14%  (p=0.000 n=40+40)
FmtFprintfEmpty-4           285ns ± 0%     284ns ± 0%  -0.24%  (p=0.000 n=40+34)
FmtFprintfString-4          578ns ± 0%     570ns ± 0%  -1.30%  (p=0.000 n=38+40)
FmtFprintfInt-4             613ns ± 0%     597ns ± 0%  -2.56%  (p=0.000 n=33+40)
FmtFprintfIntInt-4          915ns ± 0%     898ns ± 0%  -1.87%  (p=0.000 n=39+40)
FmtFprintfPrefixedInt-4    1.00µs ± 0%    0.98µs ± 0%  -1.16%  (p=0.000 n=38+37)
FmtFprintfFloat-4          1.86µs ± 0%    1.88µs ± 0%  +0.84%  (p=0.000 n=37+37)
FmtManyArgs-4              3.39µs ± 0%    3.36µs ± 0%  -0.84%  (p=0.000 n=40+40)
GobDecode-4                32.5ms ± 1%    33.3ms ± 1%  +2.45%  (p=0.000 n=39+39)
GobEncode-4                29.0ms ± 1%    29.1ms ± 1%  +0.48%  (p=0.001 n=36+40)
Gzip-4                      2.00s ± 2%     1.99s ± 2%  -0.47%  (p=0.019 n=40+39)
Gunzip-4                    239ms ± 0%     240ms ± 1%  +0.50%  (p=0.000 n=40+40)
HTTPClientServer-4          705µs ± 2%     729µs ± 4%  +3.52%  (p=0.000 n=35+36)
JSONEncode-4               75.4ms ± 1%    75.4ms ± 1%    ~     (p=0.845 n=36+39)
JSONDecode-4                307ms ± 5%     307ms ± 5%    ~     (p=0.407 n=39+36)
Mandelbrot200-4            16.5ms ± 0%    16.5ms ± 0%    ~     (p=0.206 n=37+40)
GoParse-4                  21.6ms ± 6%    21.7ms ± 6%    ~     (p=0.819 n=39+39)
RegexpMatchEasy0_32-4       475ns ± 0%     479ns ± 0%  +0.84%  (p=0.000 n=34+32)
RegexpMatchEasy0_1K-4      3.58µs ± 0%    3.58µs ± 0%    ~     (p=0.090 n=39+40)
RegexpMatchEasy1_32-4       515ns ± 1%     513ns ± 0%  -0.33%  (p=0.000 n=39+37)
RegexpMatchEasy1_1K-4      4.52µs ± 1%    4.47µs ± 0%  -1.10%  (p=0.000 n=40+40)
RegexpMatchMedium_32-4     32.6ns ± 0%    32.6ns ± 0%    ~     (p=0.265 n=40+40)
RegexpMatchMedium_1K-4      154µs ± 0%     153µs ± 0%  -0.68%  (p=0.000 n=37+40)
RegexpMatchHard_32-4       10.0µs ± 0%     9.9µs ± 0%  -1.30%  (p=0.000 n=32+40)
RegexpMatchHard_1K-4        299µs ± 0%     295µs ± 0%  -1.35%  (p=0.000 n=39+38)
Revcomp-4                  34.6ms ± 2%    34.5ms ± 2%    ~     (p=0.054 n=39+38)
Template-4                  518ms ± 4%     519ms ± 3%    ~     (p=0.702 n=40+38)
TimeParse-4                2.47µs ± 0%    2.49µs ± 0%  +0.77%  (p=0.000 n=39+40)
TimeFormat-4               5.06µs ± 0%    5.08µs ± 1%  +0.34%  (p=0.000 n=40+39)
[Geo mean]                  262µs          262µs       -0.17%

name                     old speed      new speed      delta
GobDecode-4              23.6MB/s ± 1%  23.1MB/s ± 1%  -2.39%  (p=0.000 n=39+39)
GobEncode-4              26.5MB/s ± 1%  26.4MB/s ± 1%  -0.48%  (p=0.001 n=36+40)
Gzip-4                   9.71MB/s ± 2%  9.76MB/s ± 2%  +0.48%  (p=0.022 n=40+39)
Gunzip-4                 81.1MB/s ± 0%  80.7MB/s ± 1%  -0.50%  (p=0.000 n=40+40)
JSONEncode-4             25.7MB/s ± 1%  25.7MB/s ± 1%    ~     (p=0.886 n=36+39)
JSONDecode-4             6.33MB/s ± 5%  6.31MB/s ± 5%    ~     (p=0.313 n=39+37)
GoParse-4                2.68MB/s ± 7%  2.67MB/s ± 6%    ~     (p=0.802 n=39+39)
RegexpMatchEasy0_32-4    67.4MB/s ± 0%  66.8MB/s ± 0%  -0.85%  (p=0.000 n=40+34)
RegexpMatchEasy0_1K-4     286MB/s ± 0%   286MB/s ± 0%    ~     (p=0.120 n=39+40)
RegexpMatchEasy1_32-4    62.2MB/s ± 1%  62.4MB/s ± 0%  +0.36%  (p=0.000 n=40+37)
RegexpMatchEasy1_1K-4     226MB/s ± 1%   229MB/s ± 0%  +1.12%  (p=0.000 n=40+40)
RegexpMatchMedium_32-4   30.6MB/s ± 0%  30.6MB/s ± 0%    ~     (p=0.126 n=40+40)
RegexpMatchMedium_1K-4   6.65MB/s ± 0%  6.70MB/s ± 0%  +0.70%  (p=0.000 n=38+32)
RegexpMatchHard_32-4     3.19MB/s ± 0%  3.23MB/s ± 0%  +1.32%  (p=0.000 n=32+37)
RegexpMatchHard_1K-4     3.42MB/s ± 0%  3.47MB/s ± 0%  +1.46%  (p=0.000 n=40+38)
Revcomp-4                73.4MB/s ± 2%  73.6MB/s ± 2%    ~     (p=0.061 n=39+38)
Template-4               3.75MB/s ± 5%  3.74MB/s ± 3%    ~     (p=0.634 n=40+38)
[Geo mean]               21.2MB/s       21.2MB/s       +0.04%

@benshi001
Copy link
Member Author

The reason why "GOARM=8" works well, is due to

  1. the software division routine clobbers 2 extra registers.
  2. the software division routine clobbers the input registers with the results filled.

With the "GOARM=8", the compile can fill more spilt values to the saved registers, and so the go1 benchmark improves.

So I strongly suggest adding GOARM=8, as ARM's official definition:

  1. ARMv8 == Aarch64 + Aarch32
  2. Aarch64 (ARM's term) == ARM64 (Go's term)
  3. Aarch32 = ARM32 + HWDIV + some other features

GOARM=8 does not only improve division speed, but also has potential benefits (running arm32 program on a armv8 machine)

michaelbaudino added a commit to michaelbaudino/buildroot that referenced this issue Dec 18, 2020
Go doesn't support ARMv8 optimizations yet (see this GitHub
issue: golang/go#29373) but can still
benefit from ARMv7 optimizations.

A comment is left in `go.mk` to mention this and avoid any
confusion when reading "ifeq ARMv8 → GOARM = 7".

Signed-off-by: Michael Baudino <michael@baudi.no>
buildroot-auto-update pushed a commit to buildroot/buildroot that referenced this issue Dec 19, 2020
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8
optimizations (see issue: golang/go#29373)
but can still benefit from ARMv7 optimizations.

Signed-off-by: Michael Baudino <michael@baudi.no>
[yann.morin.1998@free.fr:
  - move the comment to its own line, expand and reword it a bit
  - reword the commit log
]
Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr>
woodsts pushed a commit to woodsts/buildroot that referenced this issue Dec 22, 2020
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8
optimizations (see issue: golang/go#29373)
but can still benefit from ARMv7 optimizations.

Signed-off-by: Michael Baudino <michael@baudi.no>
[yann.morin.1998@free.fr:
  - move the comment to its own line, expand and reword it a bit
  - reword the commit log
]
Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr>
(cherry picked from commit c59409a)
Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
paralin pushed a commit to skiffos/buildroot that referenced this issue Jan 27, 2021
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8
optimizations (see issue: golang/go#29373)
but can still benefit from ARMv7 optimizations.

Signed-off-by: Michael Baudino <michael@baudi.no>
[yann.morin.1998@free.fr:
  - move the comment to its own line, expand and reword it a bit
  - reword the commit log
]
Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr>
(cherry picked from commit c59409a)
Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
paralin pushed a commit to skiffos/buildroot that referenced this issue Feb 10, 2021
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8
optimizations (see issue: golang/go#29373)
but can still benefit from ARMv7 optimizations.

Signed-off-by: Michael Baudino <michael@baudi.no>
[yann.morin.1998@free.fr:
  - move the comment to its own line, expand and reword it a bit
  - reword the commit log
]
Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr>
(cherry picked from commit c59409a)
Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
paralin pushed a commit to skiffos/buildroot that referenced this issue Feb 21, 2021
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8
optimizations (see issue: golang/go#29373)
but can still benefit from ARMv7 optimizations.

Signed-off-by: Michael Baudino <michael@baudi.no>
[yann.morin.1998@free.fr:
  - move the comment to its own line, expand and reword it a bit
  - reword the commit log
]
Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr>
(cherry picked from commit c59409a)
Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
paralin pushed a commit to skiffos/buildroot that referenced this issue Mar 24, 2021
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8
optimizations (see issue: golang/go#29373)
but can still benefit from ARMv7 optimizations.

Signed-off-by: Michael Baudino <michael@baudi.no>
[yann.morin.1998@free.fr:
  - move the comment to its own line, expand and reword it a bit
  - reword the commit log
]
Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr>
(cherry picked from commit c59409a)
Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
@cristaloleg
Copy link

Should the results from the last comment be more impactful after the #40724 ? 🤔

@andig
Copy link
Contributor

andig commented May 13, 2021

I don't think it is a good idea to introduce another GOARM value for only two instructions and a performance gain less than 1% of geomean. I think dynamic feature detection is still better in this case.

I'm not sure if that's applicable here (or even possible with modern memory protection mechanisms), but in the past dynamic feature detection was sometimes implemented by patching the executable in-memory at runtime, one invocation at a time.

@seankhliao seankhliao removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Dec 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants