Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: mod operator is very slow on intel hardware with int64 #59089

Open
rcsaquino opened this issue Mar 17, 2023 · 10 comments
Open

cmd/compile: mod operator is very slow on intel hardware with int64 #59089

rcsaquino opened this issue Mar 17, 2023 · 10 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance

Comments

@rcsaquino
Copy link

rcsaquino commented Mar 17, 2023

What version of Go are you using (go version)?

$ go version
go1.20.2 windows/amd64

Does this issue reproduce with the latest release?

Yes

What did you do?

Mod operator is very slow on intel hardware when used on int type.

What did you expect to see?

Mod operator optimized to handle int types on intel hardwares

What did you see instead?

The mod operator is very slow when used on int types on intel hardware.

@merykitty
Copy link

This is expected, int64 division on Skylake results in 57 micro-ops and has a throughput of 24 cycles/operation, while the figures for int32 are only 10 micro-ops and 6 cycles/operation, change int64 to uint64 also helps reduce these to 36 micro-ops and 21 cycles/operation. As a result, on 64-bit hardware where int is equivalent to int64 the division throughput is a quarter of that of int32 division. Which explains the execution speed differences.

@merykitty
Copy link

merykitty commented Mar 17, 2023

I have also taken a look at the Rust compiled code, it seems they cleverly check if the upper 32-bits of both the dividend and the divisor are all zero, and fall to faster uint32 division if that is the case. This explains why the Rust implementation is as fast as the JS and V implementations.

.LBB3_10:
    mov     rax, rbx
    xor     edx, edx
    div     rdi.     // slow path, long division
.LBB3_11:
    inc     rdi
    test    rdx, rdx
    je      .LBB3_12
.LBB3_3:
    lea     rax, [r12 + rdi]
    cmp     rax, 2
    je      .LBB3_4
    mov     rax, rbx
    or      rax, rdi // or the dividend and the divisor
    shr     rax, 32  // get the upper 32-bit of the or
    jne     .LBB3_10 // shr set ZF so this is taken if any of the aforementioned 32 bits is non-zero
    mov     eax, ebx
    xor     edx, edx
    div     edi.     // fast path, integer division
    jmp     .LBB3_11

@seankhliao seankhliao changed the title Mod operator is very slow on intel hardware when used on int type. cmd/compile: mod operator is very slow on intel hardware with int64 Mar 17, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 17, 2023
@seankhliao seankhliao added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 17, 2023
@ianlancetaylor
Copy link
Contributor

CC @randall77

@randall77
Copy link
Contributor

Seems fine to put in a similar test + 32-bit path in our code.

My only question is how universal is this? Skylake was mentioned - how does this fare on other processor vendors / families?

@merykitty
Copy link

It is generally true on x86 that an uint32 division is cheaper than int64 and uint64 ones, but to add a branch to fastpath the latters into the former I think would need more consideration. I think LLVM does so on all intel chips while doing the division normally on AMD ones, which seems reasonable and reflects the throughput and latency differences on architectures. Thanks.

@gopherbot
Copy link

Change https://go.dev/cl/482658 mentions this issue: cmd/compile/internal/amd64: improve fix up code for signed division

@gopherbot
Copy link

Change https://go.dev/cl/482656 mentions this issue: cmd/compile: add math benchmarks

@gopherbot
Copy link

Change https://go.dev/cl/482657 mentions this issue: cmd/compile/internal/amd64: simplify code generation for signed division

@gopherbot
Copy link

Change https://go.dev/cl/482659 mentions this issue: cmd/compile: prefer 32 bit unsigned division on amd64

gopherbot pushed a commit that referenced this issue Apr 8, 2023
The same switch statement handles code generation for signed division of
words, double words and quad words. Rather than using multiple switch
statements to select the appropriate instructions, determine all of the
correctly sized operands up front, then use them as needed.

Updates #59089

Change-Id: I2b7567c8e0ecb9904c37607332538c95b0521dca
Reviewed-on: https://go-review.googlesource.com/c/go/+/482657
Run-TryBot: Joel Sing <joel@sing.id.au>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Apr 8, 2023
In order to avoid a CPU exception resulting from signed overflow, the signed
division code tests if the divisor is -1 and if it is, runs fix up code to
manually compute the quotient and remainder (thus avoiding IDIV and potential
signed overflow).

However, the way that this is currently structured means that the normal code
path for the case where the divisor is not -1 results in five instructions
and two branches (CMP, JEQ, followed by sign extension, IDIV and another JMP
to skip over the fix up code).

Rework the fix up code such that the final JMP is incurred by the less likely
divisor is -1 code path, rather than more likely code path (which is already
more expensive due to IDIV). This result in a four instruction sequence
(CMP, JNE, sign extension, IDIV), with only a single branch.

Updates #59089

Change-Id: Ie8d065750a178518d7397e194920b201afeb0530
Reviewed-on: https://go-review.googlesource.com/c/go/+/482658
Run-TryBot: Joel Sing <joel@sing.id.au>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@gopherbot
Copy link

Change https://go.dev/cl/484438 mentions this issue: cmd/compile: prefer 32 bit signed division on amd64

gopherbot pushed a commit that referenced this issue Apr 14, 2023
This adds benchmarks for division and modulus of 64 bit signed and unsigned
integers.

Updates #59089

Change-Id: Ie757c6d74a1f355873e79619eae26ece21a8f23e
Reviewed-on: https://go-review.googlesource.com/c/go/+/482656
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

7 participants