-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: Darwin slow when using signals + int->float instructions #37174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Change https://golang.org/cl/219127 mentions this issue: |
@jyknight suggests lack of a vzeroupper, perhaps in the Darwin signal handler. |
@cherrymui reports that gratuitously inserting the VZEROUPPER before the conversion instruction also works. Wheeee! |
Can we put the VZEROUPPER in our signal handler return path, when the code that was interrupted was Go code? That would be a more efficient way to ensure avx-cleanliness of float registers for Go. |
We have an office full of people playing musical VZEROUPPERs trying to figure out what the problem is. A whole bunch of things don't help, we think the bug is in Darwin.... |
@dr2chase tried that, it didn't help. I also tried adding VZEROUPPER to asyncPreempt, it didn't help either. I also tried to change sigtramp to immediately return, it also didn't help... |
Here's a C/assembly reproducer: main.c
add.s:
|
That was not true. I had nosplit in mulFast/mulSlow left for previous experiments, which caused it not to actually inject the call. Removing the nosplit makes it work, and it actually faster. Before:
After:
|
Change https://golang.org/cl/219131 mentions this issue: |
@randall77 @jyknight and I tried also commenting out the "raise" line and leaving "vzeroupper" uncommented and the performance was somewhere in between "vzeroupper" being used and not (with raise still in there), which doesn't really make much sense to me. Might be a useful data point. |
On our 1.11 builders, my C repro is ~14% slower with the VZEROUPPER commented out. So for some reason this bug doesn't appear on the builders like it does on my desktop. I could imagine that some virtualization layer hides the bug, and the VZEROUPPER instruction has a cost. |
I reported a bug to apple here. |
Apparently, the signal handling code path in darwin kernel leaves the upper bits of Y registers in a dirty state, which causes many SSE operations (128-bit and narrower) become much slower. Clear the upper bits to get to a clean state. We do it at the entry of asyncPreempt, which is immediately following exiting from the kernel's signal handling code, if we actually injected a call. It does not cover other exits where we don't inject a call, e.g. failed preemption, profiling signal, or other async signals. But it does cover an important use case of async signals, preempting a tight numerical loop, which we introduced in this cycle. Running the benchmark in issue #37174: name old time/op new time/op delta Fast-8 90.0ns ± 1% 46.8ns ± 3% -47.97% (p=0.000 n=10+10) Slow-8 188ns ± 5% 49ns ± 1% -73.82% (p=0.000 n=10+9) There is no more slowdown due to preemption signals. For #37174. Change-Id: I8b83d083fade1cabbda09b4bc25ccbadafaf7605 Reviewed-on: https://go-review.googlesource.com/c/go/+/219131 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
@randall77, unfortunately your bug report to Apple seems to return a 404. |
The link works fine for me. It is behind a login, though. Not sure if that would give you a 404 or not. |
Do we think that there is anything to fix in Go here? |
No, I think there's nothing left on our end. The bug I linked to above now says:
Pretty vague, but at least something is happening. Maybe when something is released, we can circle back and check the fix works, and then maybe remove our workaround (CL 219131) when the buggy versions of OSX become unsupported. |
Apple claims this was fixed in 10.15.6. Anyone want to verify? I'm still on 10.15.2. |
Nothing happens:
|
Try it with and without the VZEROUPPER in add.s commented out. If it's fixed, there shouldn't be any performance difference between the two. |
Here's with VZEROUPPER commented out:
So it's still slightly faster than with VZEROUPPER commented (I posted above). |
Ok, sounds fixed. |
We don't have the ability to track minimum OS support yet, so the best alternative I can suggest for now is to open a new tracking issue. |
@HenkPoley thinks this might have regressed in 10.15.7. See his comment in #41152. |
Just for the record. I have no system to test this on. I don't run Go. Just gave a headsup that the issue in macOS that Apple forgets to use VZEROUPPER in their code (causing issues here) seems to have re-appeared. Which was the original issue, isn't it? I don't think Go need to do anything, except shift that 'remove VZEROUPPER patch once Darwin <10.15.6 is not supported #41152' to =<10.15.7 |
Right, sorry, it's not that there's any fix to do for this bug. We just need to re-evaluate the point at which the currently checked-in fix will be superfluous. |
@randall77 Does this issue still need to be open, or is it enough that #41152 is open? It's currently milestoned for Go 1.15, so if it needs to stays open, we should move it to a milestone that isn't in the past. If there's something different to do, it might be more clear to open a new issue and reference this one. |
Yes, #41152 is all we need now. |
Split off from #37121
For some strange reason, code that includes int->float instructions runs a lot slower when profiling is on.
This bug is reproducible from at least 1.11.
The text was updated successfully, but these errors were encountered: