Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Darwin slow when using signals + int->float instructions #37174

Closed
randall77 opened this issue Feb 11, 2020 · 28 comments
Closed

runtime: Darwin slow when using signals + int->float instructions #37174

randall77 opened this issue Feb 11, 2020 · 28 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin Performance
Milestone

Comments

@randall77
Copy link
Contributor

Split off from #37121

package bench1

import (
	"math"
	"testing"
)

const N = 64

func BenchmarkFast(b *testing.B) {
	var x, y, z [N]float32

	for i := 0; i < b.N; i++ {
		mulFast(&x, &y, &z)
	}
}

func mulFast(x, y, z *[N]float32) {
	for i := 0; i < N; i++ {
		z[i] = x[i] * y[i]
	}
}

func BenchmarkSlow(b *testing.B) {
	var z [N]float32
	var x, y [N]uint32

	for i := 0; i < b.N; i++ {
		mulSlow(&x, &y, &z)
	}
}

func mulSlow(x, y *[N]uint32, z *[N]float32) {
	for i := 0; i < N; i++ {
		z[i] = math.Float32frombits(x[i]) * math.Float32frombits(y[i])
	}
}
% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x 
goos: darwin
goarch: amd64
BenchmarkFast-16    	10000000	        55.9 ns/op
BenchmarkSlow-16    	10000000	        61.1 ns/op
PASS
% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x -test.cpuprofile=cpu.prof
goos: darwin
goarch: amd64
BenchmarkFast-16    	10000000	        89.7 ns/op
BenchmarkSlow-16    	10000000	       223 ns/op
PASS

For some strange reason, code that includes int->float instructions runs a lot slower when profiling is on.

This bug is reproducible from at least 1.11.

@randall77 randall77 added this to the Go1.14 milestone Feb 11, 2020
@dmitshur dmitshur added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 11, 2020
@gopherbot
Copy link

Change https://golang.org/cl/219127 mentions this issue: cmd/compile: add rewrite to avoid MOVLi2f

@dr2chase
Copy link
Contributor

@jyknight suggests lack of a vzeroupper, perhaps in the Darwin signal handler.

@dr2chase
Copy link
Contributor

@cherrymui reports that gratuitously inserting the VZEROUPPER before the conversion instruction also works. Wheeee!

@randall77
Copy link
Contributor Author

Can we put the VZEROUPPER in our signal handler return path, when the code that was interrupted was Go code? That would be a more efficient way to ensure avx-cleanliness of float registers for Go.

@dr2chase
Copy link
Contributor

We have an office full of people playing musical VZEROUPPERs trying to figure out what the problem is. A whole bunch of things don't help, we think the bug is in Darwin....

@cherrymui
Copy link
Member

@dr2chase tried that, it didn't help. I also tried adding VZEROUPPER to asyncPreempt, it didn't help either. I also tried to change sigtramp to immediately return, it also didn't help...

@randall77
Copy link
Contributor Author

Here's a C/assembly reproducer:

main.c

#include <stdio.h>
#include <stdint.h>
#include <signal.h>

#define N 64
#define SIG SIGFPE

// Does z = y+float64(x) for 64 entries. This is the main benchmark.
void add(uint64_t *x, double *y, double *z);

uint64_t x[N];
double y[N];
double z[N];

void handler(int sig, siginfo_t *info, void *arg) {
}

int main(int argc, char *argv[]) {
  // Install handler for a signal.
  // sa_sigaction is required; sa_handler doesn't trigger the bug.
  struct sigaction act;
  act.sa_sigaction = handler;
  act.sa_flags |= SA_SIGINFO;
  sigaction(SIG, &act, NULL);
  // Raise the signal, which dirties the upper bits of YMM registers.
  raise(SIG);

  // Run the benchmark.
  for (int i = 0; i < 40000000; i++) {
    add(&x[0], &y[0], &z[0]);
  }
  return 0;
}

add.s:

	.globl	_add
_add:
	// Uncomment the vzeroupper to "fix" the bug.
	//vzeroupper
	movq	$0, %rax
loop:
	cmpq	$64, %rax
	je	done
	movq	(%rdi, %rax, 8), %rcx
	movq	%rcx, %xmm0
	addsd	(%rsi, %rax, 8), %xmm0
	movsd	%xmm0, (%rdx, %rax, 8)
	incq	%rax
	jmp	loop

done:
	ret

@cherrymui
Copy link
Member

I also tried adding VZEROUPPER to asyncPreempt, it didn't help either.

That was not true. I had nosplit in mulFast/mulSlow left for previous experiments, which caused it not to actually inject the call. Removing the nosplit makes it work, and it actually faster.

Before:

$ go-tmp test -bench=. .
goos: darwin
goarch: amd64
BenchmarkFast-8   	14685382	        90.9 ns/op
BenchmarkSlow-8   	 8348937	       184 ns/op
PASS
ok  	_/tmp/pp	3.121s

After:

$ go-tmp test -bench=. .
goos: darwin
goarch: amd64
BenchmarkFast-8   	24309422	        48.6 ns/op
BenchmarkSlow-8   	25514097	        49.0 ns/op
PASS
ok  	_/tmp/pp	2.553s

@gopherbot
Copy link

Change https://golang.org/cl/219131 mentions this issue: runtime: zero upper bit of Y registers in asyncPreempt on darwin/amd64

@mknyszek
Copy link
Contributor

mknyszek commented Feb 12, 2020

@randall77 @jyknight and I tried also commenting out the "raise" line and leaving "vzeroupper" uncommented and the performance was somewhere in between "vzeroupper" being used and not (with raise still in there), which doesn't really make much sense to me. Might be a useful data point.

@randall77
Copy link
Contributor Author

On our 1.11 builders, my C repro is ~14% slower with the VZEROUPPER commented out.
On 1.12, ~13% slower.
On 1.14, ~11% faster.
On 1.15, ~11% faster.

So for some reason this bug doesn't appear on the builders like it does on my desktop. I could imagine that some virtualization layer hides the bug, and the VZEROUPPER instruction has a cost.

@randall77
Copy link
Contributor Author

I reported a bug to apple here.
We'll see if we get any traction on that.

gopherbot pushed a commit that referenced this issue Feb 13, 2020
Apparently, the signal handling code path in darwin kernel leaves
the upper bits of Y registers in a dirty state, which causes many
SSE operations (128-bit and narrower) become much slower. Clear
the upper bits to get to a clean state.

We do it at the entry of asyncPreempt, which is immediately
following exiting from the kernel's signal handling code, if we
actually injected a call. It does not cover other exits where we
don't inject a call, e.g. failed preemption, profiling signal, or
other async signals. But it does cover an important use case of
async signals, preempting a tight numerical loop, which we
introduced in this cycle.

Running the benchmark in issue #37174:

name    old time/op  new time/op  delta
Fast-8  90.0ns ± 1%  46.8ns ± 3%  -47.97%  (p=0.000 n=10+10)
Slow-8   188ns ± 5%    49ns ± 1%  -73.82%  (p=0.000 n=10+9)

There is no more slowdown due to preemption signals.

For #37174.

Change-Id: I8b83d083fade1cabbda09b4bc25ccbadafaf7605
Reviewed-on: https://go-review.googlesource.com/c/go/+/219131
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@toothrot toothrot modified the milestones: Go1.14, Go1.15 Feb 25, 2020
@odeke-em
Copy link
Member

@randall77, unfortunately your bug report to Apple seems to return a 404.

@randall77
Copy link
Contributor Author

The link works fine for me. It is behind a login, though. Not sure if that would give you a 404 or not.

@odeke-em
Copy link
Member

I did login and tried again but in vain
3F389383-7881-4363-BA4B-10EB186F360F

Anyways it is not a problem, was just checking up on this issue as I was re-triaging bugs that are in the Go1.15 milestone. Sorry for the noise :)

@ianlancetaylor
Copy link
Contributor

Do we think that there is anything to fix in Go here?

@randall77
Copy link
Contributor Author

No, I think there's nothing left on our end.

The bug I linked to above now says:

Recent Similar Reports:Less than 10
Resolution:Potential fix identified - For a future OS update

Pretty vague, but at least something is happening.

Maybe when something is released, we can circle back and check the fix works, and then maybe remove our workaround (CL 219131) when the buggy versions of OSX become unsupported.
No need to keep the issue open, though.

@randall77
Copy link
Contributor Author

Apple claims this was fixed in 10.15.6. Anyone want to verify? I'm still on 10.15.2.
(Just build and run the C code and see what happens.)

@cuonglm
Copy link
Member

cuonglm commented Aug 31, 2020

Apple claims this was fixed in 10.15.6. Anyone want to verify? I'm still on 10.15.2.
(Just build and run the C code and see what happens.)

Nothing happens:

$ time ./a.out
=====
JOB ./a.out
99%    cpu
1.97s real
1.96s user
0.00s sys

@randall77
Copy link
Contributor Author

randall77 commented Aug 31, 2020

Try it with and without the VZEROUPPER in add.s commented out. If it's fixed, there shouldn't be any performance difference between the two.

@cuonglm
Copy link
Member

cuonglm commented Aug 31, 2020

Try it with and without the VZEROUPPER in add.s commented out. If it's fixed, there shouldn't be any performance difference between the two.

Here's with VZEROUPPER commented out:

=====
JOB ./a.out
99%    cpu
1.88s real
1.88s user
0.00s sys

So it's still slightly faster than with VZEROUPPER commented (I posted above).

@randall77
Copy link
Contributor Author

Ok, sounds fixed.
So once we stop supporting releases <10.15.6, we can get rid of the VZEROUPPER patch.
We need a way to ask googlebot to reopen issues triggered on a minimum supported OS level.

@dmitshur
Copy link
Contributor

We need a way to ask googlebot to reopen issues triggered on a minimum supported OS level.

We don't have the ability to track minimum OS support yet, so the best alternative I can suggest for now is to open a new tracking issue.

@randall77
Copy link
Contributor Author

@HenkPoley thinks this might have regressed in 10.15.7. See his comment in #41152.

@randall77 randall77 reopened this Nov 17, 2020
@HenkPoley
Copy link

HenkPoley commented Nov 17, 2020

Just for the record. I have no system to test this on. I don't run Go. Just gave a headsup that the issue in macOS that Apple forgets to use VZEROUPPER in their code (causing issues here) seems to have re-appeared. Which was the original issue, isn't it?

I don't think Go need to do anything, except shift that 'remove VZEROUPPER patch once Darwin <10.15.6 is not supported #41152' to =<10.15.7

@randall77
Copy link
Contributor Author

Right, sorry, it's not that there's any fix to do for this bug. We just need to re-evaluate the point at which the currently checked-in fix will be superfluous.

@dmitshur
Copy link
Contributor

dmitshur commented Dec 3, 2020

@randall77 Does this issue still need to be open, or is it enough that #41152 is open?

It's currently milestoned for Go 1.15, so if it needs to stays open, we should move it to a milestone that isn't in the past. If there's something different to do, it might be more clear to open a new issue and reference this one.

@randall77
Copy link
Contributor Author

Yes, #41152 is all we need now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin Performance
Projects
None yet
Development

No branches or pull requests