Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: optimize overhead from CPU feature detection #36351

Open
smasher164 opened this issue Jan 1, 2020 · 7 comments
Open

cmd/compile: optimize overhead from CPU feature detection #36351

smasher164 opened this issue Jan 1, 2020 · 7 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@smasher164
Copy link
Member

As investigated in #36196, the overhead of checking for hardware FMA on every iteration of a loop causes it to slow down. @josharian's CL 212360, which introduces a HasCPUFeature intrinsic, somewhat alleviates this overhead, but it is still non-negligble. We should look into lowering or in some cases eliminating the overhead for operations that require CPU feature detection, like population count, FMA, rounding, SSE3, etc...

One method is to hoist the check outside the loop. To quote #15808 (comment)

we’d need an optimization that rewrote:

var b bool = ...
for ... {
  if b {
    ...
  } else {
    ...
  }
}

into

var b bool = ...
if b {
  for ... {
    ...
  }
} else {
  for ... {
    ... 
  }
}

Which is not always going to be an improvement.

For large loops and operations that permit > 2 implementations, the above optimization could result in inflated binaries, but it works well for smaller loops.

Another method is to set a function pointer to the preferred implementation on program initialization, so that all invocations incur an indirect function call overhead, with the benefit that the implementation wouldn't change at runtime. This would be akin to the dispatcher in GCC's function multi-versioning.

It is worth further investigating opportunities for optimization in this space.

@smasher164 smasher164 added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 1, 2020
@smasher164 smasher164 added this to the Unplanned milestone Jan 1, 2020
@egonelbre
Copy link
Contributor

One option would be to allow people to hoist the check manually and then elide the checks inside the loops.

func OnesCount(xs []uint64) int {
    var total int
    if cpu.HasSSE2 {
        for _, x := range xs {
            total += bits.OnesCount64(x)    
        }
    } else {
        for _, x := range xs {
            total += bits.OnesCount64(x)    
        }
    }
    return total
}

@seebs
Copy link
Contributor

seebs commented Jan 2, 2020

This becomes even more noticeable when the loop is unrolled at all, because then you don't even have another branch between the things.

I did some testing to find out what the impact was, and because I was not awake enough at the time, I ended up with an implementation that had the untaken branches to call the library functions, but not the popcount instructions, and discovered that the cost of that, plus the cost of the popcount operation without the branches, is much smaller than the cost of the branches and the popcount instructions. I'm not sure why. But the net impact is that the cost of the branch before every popcount, even though it's obviously a completely predictable branch (and perf confirms that it's predicted essentially 100% of the time), ends up being to increase the cost of popcount something like 3x overall.

@smasher164
Copy link
Member Author

@egonelbre One option would be to allow people to hoist the check manually

One issue with this is that internal/cpu is not a public API, and I am not sure it is something we want to surface to the user. /cc @martisch

@ianlancetaylor
Copy link
Contributor

The externally usable version of internal/cpu is golang.org/x/sys/cpu.

But, regardless, we should not expect users to manually hoist a CPU-specific flag.

@rasky
Copy link
Member

rasky commented Apr 4, 2020

This is fixed by fff7509

@rasky rasky closed this as completed Apr 4, 2020
@gopherbot
Copy link

Change https://golang.org/cl/227238 mentions this issue: cmd/compile: use MOVBQZX for OpAMD64LoweredHasCPUFeature

@smasher164 smasher164 reopened this Apr 5, 2020
@mattn
Copy link
Member

mattn commented Apr 5, 2020

I tried CL 227238. I'm using AMD Ryzen 5 3500U.

name      old time/op  new time/op  delta
FMA-8     1.16ns ± 0%  0.95ns ± 0%   ~     (p=1.000 n=1+1)
NonFMA-8  0.89ns ± 0%  0.98ns ± 0%   ~     (p=1.000 n=1+1)
name      old time/op  new time/op  delta
FMA-8     1.16ns ± 0%  0.73ns ± 0%   ~     (p=1.000 n=1+1)
NonFMA-8  0.89ns ± 0%  0.96ns ± 0%   ~     (p=1.000 n=1+1)
name      old time/op  new time/op  delta
FMA-8     1.16ns ± 0%  1.17ns ± 0%   ~     (p=1.000 n=1+1)
NonFMA-8  0.89ns ± 0%  1.21ns ± 0%   ~     (p=1.000 n=1+1)

gopherbot pushed a commit that referenced this issue Apr 7, 2020
In the commit message of CL 212360, I wrote:

> This new intrinsic ... generates MOVB+TESTB+NE.
> (It is possible that MOVBQZX+TESTQ+NE would be better.)

I should have tested. MOVBQZX+TESTQ+NE does in fact appear to be better.

For the benchmark in #36196, on my machine:

name      old time/op  new time/op  delta
FMA-8     0.86ns ± 6%  0.70ns ± 5%  -18.79%  (p=0.000 n=98+97)
NonFMA-8  0.61ns ± 5%  0.60ns ± 4%   -0.74%  (p=0.001 n=100+97)

Interestingly, these are both considerably faster than
the measurements I took a couple of months ago (1.4ns/2ns).
It appears that CL 219131 (clearing VZEROUPPER in asyncPreempt) helped a lot.
And FMA is now once again slower than NonFMA, although this change
helps it regain some ground.

Updates #15808
Updates #36351
Updates #36196

Change-Id: I8a326289a963b1939aaa7eaa2fab2ec536467c7d
Reviewed-on: https://go-review.googlesource.com/c/go/+/227238
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

7 participants