Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/sys/cpu: cpu.X86.HasAVX512 is incorrectly always false on darwin #43089

Closed
vsivsi opened this issue Dec 9, 2020 · 20 comments
Closed

x/sys/cpu: cpu.X86.HasAVX512 is incorrectly always false on darwin #43089

vsivsi opened this issue Dec 9, 2020 · 20 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin
Milestone

Comments

@vsivsi
Copy link

vsivsi commented Dec 9, 2020

What version of Go are you using (go version)?

$ go version
go version go1.15.5 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/vsi/Library/Caches/go-build"
GOENV="/Users/vsi/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/vsi/go/pkg/mod"
GONOPROXY="github.com/vsivsi"
GONOSUMDB="github.com/vsivsi"
GOOS="darwin"
GOPATH="/Users/vsi/go"
GOPRIVATE="github.com/vsivsi"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.15.5/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.15.5/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/kp/kjdr0ytx5z9djnq4ysl15x0h0000gn/T/go-build186752670=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Test for AVX512 support using cpu.X86.HasAVX512

main.go

package main

import (
	"fmt"

	"golang.org/x/sys/cpu"
)

func main() {
	fmt.Println(cpu.X86.HasAVX512)
}

What did you expect to see?

The program above should print true on any OS/hardware combination that is capable of running AVX512 instructions.

What did you see instead?

This program prints false on all Macs that are perfectly capable of running AVX512 instructions generated by the Go assembler.

The reason is complicated, and appears to have to do with how recent versions of the darwin kernel (those since AVX512 enabled processors began appearing in Mac hardware) choose to support the greatly expanded AVX512 thread state.

In summary, darwin implements a two-tier "promotion" based scheme to economize on saving thread state when AVX512 specific registers are not in use. It implements this by initially disabling AVX512 support for new threads, and then trapping undefined instruction faults for AVX512 instructions in the kernel, enabling AVX512 support for the thread, and then restarting execution at the faulted instruction. This scheme has the advantage of maintaining pre-AVX512 efficiency when preempting threads that haven't used any AVX512 extensions. But the cost appears to be that testing for AVX512 support is more complex.

Specifically, this code assumes that disabled AVX512 OS support is permanent:

https://github.com/golang/sys/blob/master/cpu/cpu_x86.go#L90

The test in the code above is performed at init time before any AVX512 instructions have been run, and hence the bits inspected from xgetbv() reflect at that point that AVX512 support is disabled by the OS. Upon failing that test (cpu.X86.HasAVX512 != true), the CPUID bits indicating that the hardware is AVX512 capable are simply ignored.

Given darwin's two-tier thread state scheme, clearly something more sophisticated is needed here to properly detect whether AVX512 instructions can be run.

Here is a reference to the darwin code implementing these checks:
https://github.com/apple/darwin-xnu/blob/0a798f6738bc1db01281fc08ae024145e84df927/osfmk/i386/fpu.c#L176

And here is an issue on an Intel compiler project raising the same problem:
ispc/ispc#1854

There is also a known issue with darwin where threads executing unsupported AVX512 instructions get stuck in a tight loop of some kind, so properly detecting AVX512 support and the CPUID flags for specific extensions is critical. See:

#42649

@vsivsi
Copy link
Author

vsivsi commented Dec 9, 2020

Note: All of the above results were produced using MacOS Catalina (10.15.7). I haven't yet been able to test any of this on Big Sur (11.x).

I've reproduced this behavior on two different Macs supporting AVX512:

  • Mac Pro (2019) with 2.7 GHz 24-Core Intel Xeon W
  • MacBook Pro (13-inch, 2020, Four Thunderbolt 3 ports) with 2.3 GHz Quad-Core Intel Core i7

Both of these processors support AVX512 (with varying extensions) and I have successfully run golang programs using AVX512 instructions assembled by the go assembler on each of them. Yet cpu.X86.HasAVX512 is always false.

@randall77
Copy link
Contributor

I'm not sure how we would possibly rectify this situation. We ask the hardware if it supports AVX512, and it says no.
Then what? Possibly

  1. Install a SIGILL signal handler
  2. Issue an AVX512 instruction
    3a. If we end up in the signal handler, set cpu.X86.HasAVX512 to false and disable step 3b.
    3b. If don't end up in the signal handler, set cpu.X86.HasAVX512 to true (or maybe recheck the hardware support bitvector?).
  3. Uninstall signal handler

Seems awfully complicated. We'd have to do this on startup of pretty much every Go binary. We're already parsimonious about how many system calls we make on startup; this goes in decidedly the wrong direction.

Why can't OSX set the hardware enabled bits? Is it because it can't trap on instructions unless the support bits are 0? It would be nice if the interrupt-on-these-instructions bits were distinct from the these-instructions-are-supported bits.

@martisch martisch changed the title internal/cpu: cpu.X86.HasAVX512 incorrectly always returns false on darwin x/sys/cpu: cpu.X86.HasAVX512 incorrectly always returns false on darwin Dec 9, 2020
@gopherbot gopherbot added this to the Unreleased milestone Dec 9, 2020
@martisch
Copy link
Contributor

martisch commented Dec 9, 2020

I agree that claiming in CPUID the instruction is not supported and then still using it is not how it should be. Some hypervisors allow to explicitly disable AVX or AVX512 to allow migration to non AVX/AVX512 machines. So just trying the instruction might not work for those scenarios.

MacBook Pro (13-inch, 2020, Four Thunderbolt 3 ports):

 % sw_vers
ProductName:	Mac OS X
ProductVersion:	10.15.7
BuildVersion:	19H15
% sysctl machdep.cpu.brand_string 
machdep.cpu.brand_string: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz

These are 1 even when the program from the first comment in the issue returns false:

sysctl -a | grep avx512
hw.optional.avx512f: 1
hw.optional.avx512cd: 1
hw.optional.avx512dq: 1
hw.optional.avx512bw: 1
hw.optional.avx512vl: 1
hw.optional.avx512ifma: 1
hw.optional.avx512vbmi: 1
% sysctl machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

@vsivsi
Copy link
Author

vsivsi commented Dec 9, 2020

Quoting from the Darwin source linked in the OP:

On machines with AVX512 support, by default, threads are created with
AVX512 masked off in XCR0 and an AVX-sized savearea is used. However, AVX512
capabilities are advertised in the commpage and via sysctl.

So the true answer is to probably use one of those kernel supplied mechanisms. I’m not a kernel hacker so the best way to do that is outside my wheelhouse.

It does seem to be the case that the CPUID bits for AVX512 always correctly reflect what the processor itself can do. But XCR0 bits (controlled by the kernel) take precedence of course, because it would be unsafe to execute instructions against register state that might not survive a thread preemption.

So for Darwin, when might it be unsafe to just go by the CPUID bits (AVX512F specifically) ignoring XCR0, and just assume that the thread will be promoted by the kernel?

As near as I can tell, AVX512 has only shipped in 3 lines of Macs, the 2017 iMac Pro (released with MacOS 10.13 High Sierra) and the 2019 Mac Pro and 2020 MacBook Pro 13” (both released with MacOS 10.15 Catalina). I have access to a 2017 iMac Pro that is still running 10.14/Mojave, and can test what happens there tomorrow.

Basically, for Apple Mac hardware running MacOS, I think the only issue is whether there is a version of MacOS/Darwin that supports one of those three machines that doesn’t support AVX512 at all. I’d be surprised if Apple shipped that 2017 iMac Pro with a kernel that couldn’t use the full instruction set, but I can’t currently test it. The public change date for the AVX512 support in Darwin is 9/26/2017, which is exactly one day after MacOS 10.13 released. So it’s fair bet that whatever version of MacOS 10.13 shipped on the 2017 iMac Pro contained those changes.

Which is all a long winded way of saying that for that part of the Darwin universe that consists of Apple hardware running MacOS that supports that hardware, if the processor CPUID bit for AVX512F is set, then you can probably assume that the Darwin kernel will promote a thread that uses AVX512 instructions.

So what other cases are there to care about? Darwin is an open source kernel, so it’s conceivable that there are people running old non-MacOS Darwin, that predates the AVX512 support, on processors that do have AVX512 per the CPUID bits. In that case, respecting the XCR0 bits is critical to identifying that the kernel does not support those instructions, and if you try to run them you will get UD faults regardless of what CPUID says. But how many people are actually doing this? And trying to run AVX512 on this setup? Probably tiny compared to the MacOS installed base, but it’s probably non-zero.

And of course there are people running Hackintoshes, who might be running pre-10.13 MacOS on hardware supporting AVX512, and that will be the same situation as above, although probably somewhat larger numbers.

Anyway, I’ve convinced myself that it’s safe enough for me to just go by the CPUID bits when running on MacOS and ignore the XCR0 bits and assume that the darwin kernel will promote.

But for the golang cpu package, IMO the code needs to actually consult sysctl or the “commpage” (whatever that is) and not make any assumptions.

@vsivsi
Copy link
Author

vsivsi commented Dec 9, 2020

To golang’s credit, it currently appears to do precisely what Intel recommends:

Prior to using AVX-512 Foundation instructions, the application must identify that the operating system supports the XGETBV instruction and the ZMM register state, in addition to confirming the processor’s support for ZMM state management using XSAVE/XRSTOR and AVX-512 Foundation instructions. The following simplified sequence accomplishes both and is strongly recommended.

  1. Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application use1).
  2. Execute XGETBV and verify that XCR0[7:5] = ‘111b’ (OPMASK state, upper 256-bit of ZMM0-ZMM15 and ZMM16-ZMM31 state are enabled by OS) and that XCR0[2:1] = ‘11b’ (XMM state and YMM state are enabled by OS).
  3. Detect CPUID.0x7.0:EBX.AVX512F[bit 16] = 1.

From section 15.2 of volume 1 of the Intel software developers manual.

Unfortunately the darwin kernel has other ideas...

@martisch
Copy link
Contributor

martisch commented Dec 9, 2020

Yes Go has been following what Intel recommends to check and that has worked on other Operating Systems so far.

Two other categories of valid usage of darwin/amd64 that may need to be considered for interpreting CPUID:

So far if this needs fixing because darwin is set to do it differently then I would likely recommend we check sysctls to detect if AVX512 (maybe zmm if there is a ctl for it) is supported. We can make one check if AVX512F is supported in sysctls and only if that is enabled check the rest.

@vsivsi
Copy link
Author

vsivsi commented Dec 9, 2020

@martisch Thanks for those additional cases. Questions:

On point 1) AWS, I’m fairly certain the AWS MacOS support is not virtualized. They are renting bare metal MacMini machines as I recall from the announcements. e.g.: Amazon Web Services adds macOS on bare metal to EC2. That doesn’t change the fact that this might happen in the future however!

On point 2) I haven’t heard anything at all about AVX512 on Rosetta2, I would be more than a little surprised if they went to the Great trouble of emulating it given the relatively small installed base of machines/sw that use it, and the fact that any such software will almost certainly provide non-AVX512 code paths for when it is not supported (and cannot blame Apple if they don’t properly check these things, which brings us right back to why this issue is important!)

Thanks again!

@martisch
Copy link
Contributor

martisch commented Dec 9, 2020

  1. You are right AWS is not actually offering virtualized OS X instances it seems. I misremembered that. VMware and other products still let you run OS X virtualized as far as I have seen. This may also allow the migration of VMs from Apple hardware with AVX512 to Apple hardware without AVX512.

  2. I dont think any AVX is supported by Rosetta2. (https://developer.apple.com/documentation/apple_silicon/about_the_rosetta_translation_environment: "What Can't Be Translated?") I was just thinking about the case where CPUID claims AVX512 is not supported and then it actually isnt on modern OS X versions.

I have not tested this but if I understand your later replies correctly OS X isnt actually claiming in CPUID that AVX512 isnt supported on CPUs that support it but rather that XCR0 claims ZMM registers are not supported (saved and restored). So there isnt actually a case of OSX "lying" about the instruction set extension support but it is rather that OSX changes its stance whether it saves and restores ZMM (and other) registers as reported in XCR0 based on when an AVX512 instruction was used.

If thats the case Go might just check AVX512F in sysctl and if that is "1" Go should assume ZMM registers are supported and rely on CPUID for determining all the instructions extensions that may or may not be supported.

On the other hand in "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2
2.6.11.1 State Dependent #UD" Intel EVEX-encoded 512-bit instructions require XCR0 Bit Vector [7:0] to be 111xx111b to not cause an invalid opcode exception.

@randall77
Copy link
Contributor

OSX changes its stance whether it saves and restores ZMM (and other) registers as reported in XCR0 based on when an AVX512 instruction was used.

This sounds like an OSX bug. Saving/restoring ZMM registers after the first AVX512 instruction is indistinguishable from saving/restoring ZMM registers from process start. So they should say they do, even if they don't until the first instruction.
Or is there another reason they would lie to us? Helping out user-level thread schedulers, perhaps? Not sure.

@martisch
Copy link
Contributor

martisch commented Dec 9, 2020

Tested this on my AVX512 supporting MacBook Pro:

When no AVX512 instruction has been executed:
Bits 5,6,7 of eax from xgetbv() are false and bit 16 of cpuid(7,0) is true ( AVX-512 Foundation).

When I add a VPXORD Z1, Z1, Z1 before xgetbv then:
Bits 5,6,7 of eax from xgetbv() are true and bit 16 of cpuid(7,0) is true ( AVX-512 Foundation).

My hunch why it might be setup this way is that setting ZMM register support to false in XCR0 will trigger an exception by the CPU when AVX512 is used which OS X can catch and then enable ZMM register state saving. So for programs that do not use AVX512 OS X can avoid the extra save and restore overhead for ZMM. Once AVX512 is used the OS will get notified with the exception then enable ZMM support, change XCR0 so that AVX512 doesnt trigger exceptions anymore and resume program execution now with the additional overhead for saves and restores of the register state for ZMM.

@vsivsi
Copy link
Author

vsivsi commented Dec 9, 2020

@martisch Yes! Everything you wrote above agrees exactly with both my testing and my understanding of the situation.

So I think there are two reasonable possible ways for golang to handle this:

  1. update the code in the cpu package with a special case for darwin to check for AVX512 support using a call to sysctl instead of trusting the XCR0 bits (as has already been discussed above).

OR

  1. setup every thread spawned by the golang runtime on darwin to enable AVX512 by default using thread_set_state() (when on AVX512 hardware, which presumably darwin will handle correctly).

Per the Darwin documentation:

In addition to AVX512 instructions causing promotion, the thread_set_state()
primitive with an AVX512 state flavor result in promotion.

AVX512 promotion of the first thread in a task causes the default xstate
of the task to be promoted so that any subsequently created or subsequently
DNA-faulted thread will have AVX512 xstate and it will not need to fault-in
a promoted xstate.

So this would only need to be done for the original thread in the runtime process. All subsequent threads should inherit this state. Con: This would make thread preemption of non-AVX512 Go programs on darwin less efficient than it would otherwise be (on hardware with AVX512). Pro: It would presumably make Darwin’s behavior consistent with all other platforms and require no changes to the cpu package. The only darwin “special case” would be promoting the initial process thread to force saving AVX512 state.

I’m actually pretty partial to option 2). Yes it would make OS task switching Go program threads less efficient on darwin than it would otherwise be when AVX512 is present and unused, but no less efficient than it already will be on any other OS platform in that case. Basically go would be “giving up” whatever performance advantage darwin provides with this rather ugly hack, in exchange for parity with every other OS and isolating the darwin specific code to this single initial thread state setup.

===

As an aside, it also seems clear now that the bug described in #42649 appears to be caused by this sequence:

  1. kernel traps an AVX-512 instruction fault
  2. kernel “promotes” thread to handle AVX512 state (modifying XCR0 bits) and restarts the trapped instruction
  3. kernel traps an actually unsupported AVX512 instruction (e.g. VPOPCNTQ)
  4. go to step 2
  5. profit? Nope. Infinite loop between 2-4

I mention this here because it’s possible that potential solution 2) above could alleviate this, depending on what darwin does internally with that call to set_thread_state()

@rasky
Copy link
Member

rasky commented Dec 10, 2020

It looks like Apple suggest to use sysctl or commpage to check for AVX512 capability rather than using XCR0. Spawning sysctl doesn't sound something we want to do in the runtime, but reading it from the commpage seems like a no-brainer. Why can't we just do that?

https://github.com/apple/darwin-xnu/blob/a449c6a3b8014d9406c2ddbdc81795da24aa7443/osfmk/i386/cpu_capabilities.h#L77
https://github.com/apple/darwin-xnu/blob/a449c6a3b8014d9406c2ddbdc81795da24aa7443/osfmk/i386/cpu_capabilities.h#L178

It looks like there's also a libsystem wrapper to access the CPU capabilities in the commpage, that we can call if we don't want to hardcode the commpage layout as that has bitten us already once:
https://github.com/apple/darwin-xnu/blob/0a798f6738bc1db01281fc08ae024145e84df927/libsyscall/wrappers/__get_cpu_capabilities.s#L35

@martisch
Copy link
Contributor

martisch commented Dec 10, 2020

It looks like Apple suggest to use sysctl or commpage to check for AVX512 capability rather than using XCR0.

x/sys/cpu already checks if AVX512 instruction extensions are supported by using CPUID and that works even on darwin/amd64 correctly on the machines I could test. AFAIK XCR0 does not contain information if AVX512 is supported by the CPU but if XMM, YMM, ZMM registers save and restore is supported by the Operating System. This can vary between Operating Systems on the same CPU.

https://github.com/apple/darwin-xnu/blob/a449c6a3b8014d9406c2ddbdc81795da24aa7443/osfmk/i386/cpu_capabilities.h
Does not mention ZMM or any restore or save behavior for ZMM it seems.

So the question open for me is what is the correct way on darwin/amd64 to check if ZMM save and restore will correctly be supported? Intel manuals say its XCR0 and that AVX512 even if supported by the CPU will throw an exception if OSXSAVE is not enabled and ZMM save and restore activated.

An even less computationally involved way would be to not check ZMM register support on darwin/amd64 and assume that any OS X that can run on a AVX512 supporting CPU always supports ZMM registers (not sure if that is true).

@vsivsi
Copy link
Author

vsivsi commented Dec 10, 2020

@martisch See back up this thread for my discussion of the risk of encountering a CPU supporting AVX512 with a version of Darwin that does not support the necessary register state save/restore.

TL;DR is for Apple Mac hardware running MacOS I think there is no risk (because as a rule, Mac hardware can’t run versions of MacOS earlier than the initial version that hardware shipped with.) For “Hackintoshes” (MacOS patched to run on non-Apple supported hardware) or open source Darwin distros (non-MacOS) there is no guarantee, since AVX512 supporting CPUs might be paired with versions of Darwin that pre-date its AVX512 support.

So if you as a developer know that your code will only ever be run on darwin using genuine Apple Mac hardware running unpatched MacOS, then is it probably sufficient to just check the CPUID bits for AVX512F and whenever other extensions your assembly code paths require, assuming in that case, darwin will promote the thread to set the correct XCR0 bits to support save/restore of the resister state used by your code.

But I don’t think golang can safely make that assumption, because strictly speaking, Darwin != MacOS. And also, it would violate both darwin and Intel’s documented recommendations.

So we are left with the two possibilities I articulated above:

  1. Modify the AVX512 checks in the cpu package for darwin, specifically to ignore the XCR0 bits, and instead use either the commpage info or call the sysctl interface, to determine whether the running version of darwin supports AVX512 (register state save/restore).

  2. Always use set_thread_state() in the go runtime on darwin at initialization to request the kernel to enable AVX512 (thread state save/restore) for all hardware threads in the process. In this case, the XCR0 bits will always correctly reflect the CPU capabilities at the time the cpu package inits, so no darwin specific checks there should be necessary.

There are trade offs for both of these that I discuss above and won’t repeat here. I’d be happy with either solution. It’s really up to the golang maintainers to decide which is the best fit given all of the other priorities and criteria you all need to consider.

@martisch
Copy link
Contributor

martisch commented Dec 10, 2020

and instead use either the commpage info or call the sysctl interface, to determine whether the running version of darwin supports AVX512 (register state save/restore).

What is the capability to check there? With a quick search I could not find anything about ZMM save and restore support. The CPUID already correctly reflects the instruction extension support state of the CPU. "hw.optional.avx512f" only seems to reflect if the CPU supports it which aligns with CPUID.

One scenario that is not yet clear to me where OS X might not support ZMM while the CPU does on original Apple hardware is if someone runs an older version of OS X in e.g. VMWare Fusion while the host OS is newer and has a CPU supporting AVX512.

If that cant happen then my gut feeling is to not check for XCR0 ZMM bits on darwin/arm64 when deciding if AVX512 is ok to use. Has no overhead and is a simple change.

@vsivsi
Copy link
Author

vsivsi commented Dec 10, 2020

Based on my understanding of everything, I think if the commpage advertises kHasAVX512F then it is safe to assume that darwin will promote threads attempting to use AVX512 instructions.

But I think that is different from what you are suggesting, if I’m understanding correctly, which it that if the CPUID bit for AVX512F is set then we can probably safely make that same assumption.

As I say above, I think that is almost certainly true for unmodified Apple hardware running unpatched MacOS.

Outside of that ideal walled garden, all bets are off. So one question is, does golang as a project care about this case? I assume it does, otherwise why call the OS target “darwin” and not “macos” or whatever? But I honestly don’t know the history of this.

Virtualization keeps coming up, and I think it also raises important potential cases. To my knowledge it is certainly plausible that an older (pre-AVX512) version of MacOS could be run on newer hardware that supports AVX512 under virtualization. If in that case the hypervisor chooses to enable AVX512 within that VM, then yes, it would be incorrect for golang to rely exclusively on the CPUID bits to determine support. If this scenario is indeed possible, it is another argument for checking the commpage for kHasAVX512F (which an older version of darwin shouldn’t set, regardless of what the CPUID says). What I don’t know is if the checks MacOS makes under virtualization to ensure it is running on genuine Mac hardware obey the same constraints as when it is running on bare metal.

To be clear, “older” here means MacOS 10.12 Sierra or before, i.e. versions of MacOS/OS X that are no longer supported by Apple.

@toothrot toothrot added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin labels Dec 11, 2020
@vsivsi vsivsi changed the title x/sys/cpu: cpu.X86.HasAVX512 incorrectly always returns false on darwin x/sys/cpu: cpu.X86.HasAVX512 is incorrectly always false on darwin Dec 13, 2020
@vsivsi
Copy link
Author

vsivsi commented Jan 14, 2021

This has been quiet for over a month now. It's an obvious bug with a straightforward solution.

What is the next step in the process to get this fixed in an upcoming release?

I'm happy to contribute in any way I can.

@martisch
Copy link
Contributor

martisch commented Jan 14, 2021

It's an obvious bug with a straightforward solution.

Feel free to contribute a CL. https://golang.org/doc/contribute.html

When writing the patch note that this is about x/sys/cpu and not internal/cpu (std lib doesnt use AVX512) and that x/sys/cpu should not pull in big dependencies such as x/sys/unix.

@vsivsi
Copy link
Author

vsivsi commented Jan 16, 2021

Before I go to the work of putting together a CL, here is my proposed implementation of the darwin commpage check:

In cpu_gc_x86.go add:

// darwinHasAVX512 is implemented in cpu_x86.s for gc compiler
// and in cpu_gccgo.go for gccgo.
func darwinHasAVX512() bool

In cpu_gccgo_x86.go add:

// gccgo doesn't build on Darwin, per:
// https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/gcc.rb#L7
func darwinHasAVX512() bool {
	return false
}

In cpu_x86.go (note addition of conditional darwin codepath):

if X86.HasOSXSAVE {
	eax, _ := xgetbv()
	// Check if XMM and YMM registers have OS support.
	osSupportsAVX = isSet(1, eax) && isSet(2, eax)

	if runtime.GOOS == "darwin" {
		// Check darwin commpage for AVX512 support
		osSupportsAVX512 = osSupportsAVX && darwinHasAVX512()
	} else {
		// Check if OPMASK and ZMM registers have OS support.
		osSupportsAVX512 = osSupportsAVX && isSet(5, eax) && isSet(6, eax) && isSet(7, eax)
	}
}

In cpu_x86.s add this code:

// func darwinHasAVX512() bool
TEXT ·darwinHasAVX512(SB), NOSPLIT, $0-1

    MOVB    $0, ret+0(FP) // default to false

#ifdef GOOS_darwin   // return if not darwin
#ifdef GOARCH_amd64  // return if not amd64

// These values from:
// https://github.com/apple/darwin-xnu/blob/xnu-4570.1.46/osfmk/i386/cpu_capabilities.h 
#define commpage64_base_address         0x00007fffffe00000
#define commpage64_cpu_capabilities64   (commpage64_base_address+0x010)
#define commpage64_version              (commpage64_base_address+0x01E) 
#define hasAVX512F                      0x0000004000000000

    MOVQ    $commpage64_version, BX
    MOVW    (BX), AX
    CMPW    AX, $13  // versions < 13 do not support AVX512
    JL      no_avx512
    MOVQ    $commpage64_cpu_capabilities64, BX
    MOVQ    (BX), AX
    MOVQ    $hasAVX512F, CX
    ANDQ    CX, AX
    JZ      no_avx512
    MOVB    $1, ret+0(FP) 
no_avx512:

#endif
#endif

    RET

Note that the cpu_capabilities field occurs above the page_version in the commpage layout.
Seems very unlikely these fields will ever move or change.

Also note the use of conditional compilation above to render the darwinHasAVX512 function a stub that always returns false in all cases except GOOS=="darwin" && GOARCH=="amd64". This seems preferable to the alternative, which would be to create two new separate assembly files for the only purpose of providing the above function and its stub in those two cases.

@gopherbot
Copy link

Change https://golang.org/cl/285572 mentions this issue: x/sys/cpu: Fix cpu.X86.HasAVX512 determination on Darwin

seiko2plus added a commit to seiko2plus/numpy that referenced this issue Jun 27, 2021
  On Darwin, machines with AVX512 support, by default, threads are created with
  AVX512 masked off in XCR0 and an AVX-sized savearea is used.
  However, AVX512 capabilities are advertised in the commpage and via sysctl.

  For more information, check:
   - https://github.com/apple/darwin-xnu/blob/0a798f6738bc1db01281fc08ae024145e84df927/osfmk/i386/fpu.c#L175-L201
   - golang/go#43089
   - numpy#19319
charris pushed a commit to charris/numpy that referenced this issue Jun 27, 2021
  On Darwin, machines with AVX512 support, by default, threads are created with
  AVX512 masked off in XCR0 and an AVX-sized savearea is used.
  However, AVX512 capabilities are advertised in the commpage and via sysctl.

  For more information, check:
   - https://github.com/apple/darwin-xnu/blob/0a798f6738bc1db01281fc08ae024145e84df927/osfmk/i386/fpu.c#L175-L201
   - golang/go#43089
   - numpy#19319
@golang golang locked and limited conversation to collaborators May 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin
Projects
None yet
Development

No branches or pull requests

6 participants