Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

math: FMA is slower than non-FMA calculation #36196

Closed
mattn opened this issue Dec 18, 2019 · 17 comments
Closed

math: FMA is slower than non-FMA calculation #36196

mattn opened this issue Dec 18, 2019 · 17 comments

Comments

@mattn
Copy link
Member

mattn commented Dec 18, 2019

What version of Go are you using (go version)?

$ go version
go version devel +0377f06168 Tue Dec 17 20:57:06 2019 +0000 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=auto
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\mattn\AppData\Local\go-build
set GOENV=C:\Users\mattn\AppData\Roaming\go\env
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\mattn\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\go\pkg\tool\windows_amd64
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\mattn\AppData\Local\Temp\go-build216348474=/tmp/go-build -gno-record-gcc-switches

What did you do?

package main

import (
	"fmt"
	"math"
	"math/rand"
	"time"
)

var a, b, c, d float64

func pure_fma_func() {
	d = math.FMA(a, b, c)
}

func non_fma_func() {
	d = a*b + c
}

func main() {
	const n = 1000000000

	a = rand.Float64()
	b = rand.Float64()
	c = rand.Float64()

	t1 := time.Now()
	for i := int64(0); i < n; i++ {
		non_fma_func()
	}
	t2 := time.Now()
	for i := int64(0); i < n; i++ {
		pure_fma_func()
	}
	t3 := time.Now()

	fmt.Println("non FMA", t2.Sub(t1))
	fmt.Println("    FMA", t3.Sub(t2))
}

And go run.

What did you expect to see?

math.FMA is faster than non-FMA code.

What did you see instead?

non FMA 548.0314ms
    FMA 924.0528ms

I confirmed my CPU have simd-FMA. Is this an overhead of function call?

@martisch
Copy link
Contributor

Which CPU is used for the benchmark?

Please benchmark with GODEBUG=cpu.fma=off and see if this changes anything. If FMA instructions are used I would expect a change in the FMA benchmark numbers.

Please also write the benchmark as a go benchmark:
https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go

Then execute the test with -count=20 and store the results in a file. Use a quiet machine with e.g. no browser or videos running.

Afterwards use https://godoc.org/golang.org/x/tools/cmd/benchcmp
to produce an "average" over the runs.

This will give better information how consistent between runs the results are.

@martisch martisch added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Dec 18, 2019
@mattn
Copy link
Member Author

mattn commented Dec 18, 2019

Intel Core i5 4460

package fma_test

import (
	"math"
	"math/rand"
	"testing"
)

var va, vb, vc, vd float64

func pure_fma_func() {
	vd = math.FMA(va, vb, vc)
}

func non_fma_func() {
	vd = va*vb + vc
}

func BenchmarkFMA(b *testing.B) {
	va = rand.Float64()
	vb = rand.Float64()
	vc = rand.Float64()

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		pure_fma_func()
	}
}

func BenchmarkNonFMA(b *testing.B) {
	va = rand.Float64()
	vb = rand.Float64()
	vc = rand.Float64()

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		non_fma_func()
	}
}
set GODEBUG=cpu.fma=off
go test -bench . > old
set GODEBUG=
go test -bench . > new
benchstat old new

old

goos: windows
goarch: amd64
pkg: github.com/mattn/fma-example
BenchmarkFMA-4      	1000000000	         0.930 ns/op
BenchmarkNonFMA-4   	1000000000	         0.615 ns/op
PASS
ok  	github.com/mattn/fma-example	1.855s

new

goos: windows
goarch: amd64
pkg: github.com/mattn/fma-example
BenchmarkFMA-4      	1000000000	         0.912 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
PASS
ok  	github.com/mattn/fma-example	1.816s

benchstat

name      old time/op  new time/op  delta
FMA-4     0.93ns ± 0%  0.91ns ± 0%   ~     (p=1.000 n=1+1)
NonFMA-4  0.61ns ± 0%  0.62ns ± 0%   ~     (p=1.000 n=1+1)

@mattn
Copy link
Member Author

mattn commented Dec 18, 2019

Added -count=20

old

goos: windows
goarch: amd64
pkg: github.com/mattn/fma-example
BenchmarkFMA-4      	1000000000	         0.937 ns/op
BenchmarkFMA-4      	1000000000	         0.926 ns/op
BenchmarkFMA-4      	1000000000	         0.933 ns/op
BenchmarkFMA-4      	1000000000	         0.932 ns/op
BenchmarkFMA-4      	1000000000	         0.935 ns/op
BenchmarkFMA-4      	1000000000	         0.934 ns/op
BenchmarkFMA-4      	1000000000	         0.960 ns/op
BenchmarkFMA-4      	1000000000	         0.956 ns/op
BenchmarkFMA-4      	1000000000	         0.958 ns/op
BenchmarkFMA-4      	1000000000	         0.930 ns/op
BenchmarkFMA-4      	1000000000	         0.944 ns/op
BenchmarkFMA-4      	1000000000	         0.942 ns/op
BenchmarkFMA-4      	1000000000	         0.943 ns/op
BenchmarkFMA-4      	1000000000	         0.940 ns/op
BenchmarkFMA-4      	1000000000	         0.942 ns/op
BenchmarkFMA-4      	1000000000	         0.943 ns/op
BenchmarkFMA-4      	1000000000	         0.935 ns/op
BenchmarkFMA-4      	1000000000	         0.929 ns/op
BenchmarkFMA-4      	1000000000	         0.930 ns/op
BenchmarkFMA-4      	1000000000	         0.927 ns/op
BenchmarkNonFMA-4   	1000000000	         0.617 ns/op
BenchmarkNonFMA-4   	1000000000	         0.617 ns/op
BenchmarkNonFMA-4   	1000000000	         0.619 ns/op
BenchmarkNonFMA-4   	1000000000	         0.621 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.617 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.622 ns/op
BenchmarkNonFMA-4   	1000000000	         0.622 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.640 ns/op
BenchmarkNonFMA-4   	1000000000	         0.625 ns/op
BenchmarkNonFMA-4   	1000000000	         0.620 ns/op
BenchmarkNonFMA-4   	1000000000	         0.624 ns/op
BenchmarkNonFMA-4   	1000000000	         0.620 ns/op
BenchmarkNonFMA-4   	1000000000	         0.625 ns/op
BenchmarkNonFMA-4   	1000000000	         0.621 ns/op
BenchmarkNonFMA-4   	1000000000	         0.616 ns/op
BenchmarkNonFMA-4   	1000000000	         0.647 ns/op
BenchmarkNonFMA-4   	1000000000	         0.647 ns/op
PASS
ok  	github.com/mattn/fma-example	34.560s

new

goos: windows
goarch: amd64
pkg: github.com/mattn/fma-example
BenchmarkFMA-4      	1000000000	         0.948 ns/op
BenchmarkFMA-4      	1000000000	         0.936 ns/op
BenchmarkFMA-4      	1000000000	         0.932 ns/op
BenchmarkFMA-4      	1000000000	         0.923 ns/op
BenchmarkFMA-4      	1000000000	         0.938 ns/op
BenchmarkFMA-4      	1000000000	         0.927 ns/op
BenchmarkFMA-4      	1000000000	         0.921 ns/op
BenchmarkFMA-4      	1000000000	         0.928 ns/op
BenchmarkFMA-4      	1000000000	         0.916 ns/op
BenchmarkFMA-4      	1000000000	         0.946 ns/op
BenchmarkFMA-4      	1000000000	         0.970 ns/op
BenchmarkFMA-4      	1000000000	         0.959 ns/op
BenchmarkFMA-4      	1000000000	         0.938 ns/op
BenchmarkFMA-4      	1000000000	         0.938 ns/op
BenchmarkFMA-4      	1000000000	         0.956 ns/op
BenchmarkFMA-4      	1000000000	         0.976 ns/op
BenchmarkFMA-4      	1000000000	         0.955 ns/op
BenchmarkFMA-4      	1000000000	         0.966 ns/op
BenchmarkFMA-4      	1000000000	         0.942 ns/op
BenchmarkFMA-4      	1000000000	         0.943 ns/op
BenchmarkNonFMA-4   	1000000000	         0.617 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.624 ns/op
BenchmarkNonFMA-4   	1000000000	         0.623 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.629 ns/op
BenchmarkNonFMA-4   	1000000000	         0.624 ns/op
BenchmarkNonFMA-4   	1000000000	         0.630 ns/op
BenchmarkNonFMA-4   	1000000000	         0.622 ns/op
BenchmarkNonFMA-4   	1000000000	         0.634 ns/op
BenchmarkNonFMA-4   	1000000000	         0.622 ns/op
BenchmarkNonFMA-4   	1000000000	         0.633 ns/op
BenchmarkNonFMA-4   	1000000000	         0.635 ns/op
BenchmarkNonFMA-4   	1000000000	         0.627 ns/op
BenchmarkNonFMA-4   	1000000000	         0.631 ns/op
BenchmarkNonFMA-4   	1000000000	         0.633 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.628 ns/op
BenchmarkNonFMA-4   	1000000000	         0.630 ns/op
BenchmarkNonFMA-4   	1000000000	         0.617 ns/op
PASS
ok  	github.com/mattn/fma-example	34.760s

benchstat

name      old time/op  new time/op  delta
FMA-4     0.94ns ± 2%  0.94ns ± 4%    ~     (p=0.465 n=20+20)
NonFMA-4  0.62ns ± 1%  0.63ns ± 1%  +0.91%  (p=0.005 n=17+20)

@martisch
Copy link
Contributor

martisch commented Dec 18, 2019

I missed you were using windows on which GO does not support using GODEBUG to disable cpu features used by GO. I should consider printing a warning.

Disasembly for me shows:

main_test.go:12	0x4fabae		f20f110d3ab91600	MOVSD_XMM X1, _/usr/local/google/home/moehrmann/test_test.vd(SB)	
  main_test.go:25	0x4fabb6		48ffc1			INCQ CX									
  main_test.go:25	0x4fabb9		48398810010000		CMPQ CX, 0x110(AX)							
  main_test.go:25	0x4fabc0		7e66			JLE 0x4fac28								
  main_test.go:26	0x4fabc2		90			NOPL									
  main_test.go:12	0x4fabc3		f20f100515b91600	MOVSD_XMM _/usr/local/google/home/moehrmann/test_test.vb(SB), X0	
  main_test.go:12	0x4fabcb		f20f100d15b91600	MOVSD_XMM _/usr/local/google/home/moehrmann/test_test.vc(SB), X1	
  main_test.go:12	0x4fabd3		f20f1015fdb81600	MOVSD_XMM _/usr/local/google/home/moehrmann/test_test.va(SB), X2	
  main_test.go:12	0x4fabdb		803d95b8160000		CMPB $0x0, runtime.x86HasFMA(SB)					
  main_test.go:12	0x4fabe2		7407			JE 0x4fabeb								
  main_test.go:12	0x4fabe4		c4e2e9b9c8ebc348	MOVL $0x48c3ebc8, CX							
  main_test.go:25	0x4fabec		894c2420		MOVL CX, 0x20(SP)							
  main_test.go:12	0x4fabf0		f20f111424		MOVSD_XMM X2, 0(SP)							
  main_test.go:12	0x4fabf5		f20f1005e3b81600	MOVSD_XMM _/usr/local/google/home/moehrmann/test_test.vb(SB), X0	
  main_test.go:12	0x4fabfd		f20f11442408		MOVSD_XMM X0, 0x8(SP)							
  main_test.go:12	0x4fac03		f20f1005ddb81600	MOVSD_XMM _/usr/local/google/home/moehrmann/test_test.vc(SB), X0	
  main_test.go:12	0x4fac0b		f20f11442410		MOVSD_XMM X0, 0x10(SP)							
  main_test.go:12	0x4fac11		e82a28f8ff		CALL math.FMA(SB)							
  main_test.go:12	0x4fac16		f20f104c2418		MOVSD_XMM 0x18(SP), X1							
  main_test.go:25	0x4fac1c		488b442438		MOVQ 0x38(SP), AX							
  main_test.go:25	0x4fac21		488b4c2420		MOVQ 0x20(SP), CX							
  main_test.go:12	0x4fac26		eb86			JMP 0x4fabae								
  main_test.go:25	0x4fac28		488b6c2428		MOVQ 0x28(SP), BP							
  main_test.go:25	0x4fac2d		4883c430		ADDQ $0x30, SP								
  main_test.go:25	0x4fac31		c3			RET				

Which looks wrong since its missing the VFMADD231SD instruction.

Using GOSSAFUNC=BenchmarkFMA go test -c however I see the VFMADD231SD instruction:

00035 (+12) MOVSD "".vb(SB), X0
00036 (12) MOVSD "".vc(SB), X1
00037 (12) MOVSD "".va(SB), X2
00038 (12) CMPB runtime.x86HasFMA(SB), $0
00039 (12) JEQ 42
00040 (12) VFMADD231SD X0, X2, X1
00041 (12) JMP 26
00042 (25) PCDATA $0, $0
00043 (25) MOVQ CX, "".i-8(SP)
00044 (12) MOVSD X2, (SP)
00045 (12) MOVSD "".vb(SB), X0
00046 (12) MOVSD X0, 8(SP)
00047 (12) MOVSD "".vc(SB), X0
00048 (12) MOVSD X0, 16(SP)
00049 (12) CALL math.FMA(SB)
00050 (12) MOVSD 24(SP), X1
00051 (25) PCDATA $0, $1
00052 (25) MOVQ "".b(SP), AX
00053 (25) MOVQ "".i-8(SP), CX
00054 (12) JMP 26
00055 (25) PCDATA $0, $-1
00056 (25) PCDATA $1, $-1
00057 (25) RET
00058 (?) END

Running with GODEBUG=cpu.fma=off go test -bench=. -count=20 -cpu=1
does make a huge difference on my Intel Xeon E5-1650 v3:

name    old time/op  new time/op   delta
FMA     0.96ns ± 6%  23.02ns ±14%  +2291.32%  (p=0.000 n=20+20)
NonFMA  0.57ns ± 5%   0.55ns ± 4%       ~     (p=0.064 n=20+18)

but both is slower than without FMA. Note that the FMA version has more precision as it does not do the 64bit rounding between the steps.

@martisch
Copy link
Contributor

VFMADD231SD has a 5 cycle latency on Haswell and 2 can be executed in parallel. Same for MULSD. The added check and jump as well as other factors can make the FMA indeed slower. This seems to be 1 or 2 cycles here. This might be WAI due to a slight overhead for runtime dispatch within the loop.

@martisch
Copy link
Contributor

To verify that this is the runtime dispatch overhead to determine if the cpu supports FMA I changed the compiler in cmd/compile/internal/gc/ssa.go to not add any checks.

name    time/op
FMA     0.58ns ± 2%
NonFMA  0.56ns ± 4%

As noted even if equally fast FMA has the advantage of not rounding the intermediate step.

As long as the build go binary needs to support both FMA capable and non FMA capable cpus there will be some overhead. Ideally that could be moved outside the loop but we do not have that currently. For the later I thought we already had a general bug to move the checks.

@smasher164
Copy link
Member

For the later I thought we already had a general bug to move the checks.

The closest bug I could find related to this is #34950, which is intended to mark ops requiring feature detection such that the check isn't optimized away. However, there likely needs to be some issue to track the hoisting optimization for loop invariants. This will likely become even more important if vector intrinsics land in the stdlib someday.

@gopherbot
Copy link

Change https://golang.org/cl/212360 mentions this issue: cmd/compile: add intrinsic HasCPUFeature for checking cpu features

@smasher164
Copy link
Member

Running @josharian's CL (212360) provides ~16% on my windows laptop that had the same overhead mentioned above.

old.txt

BenchmarkFMA            733668986                1.54 ns/op
BenchmarkFMA            791581790                1.46 ns/op
BenchmarkFMA            796827562                1.48 ns/op
BenchmarkFMA            903921406                1.49 ns/op
BenchmarkFMA            679788086                1.55 ns/op
BenchmarkFMA            796837087                1.58 ns/op
BenchmarkFMA            786417004                1.50 ns/op
BenchmarkFMA            871895959                1.51 ns/op
BenchmarkFMA            853358212                1.42 ns/op
BenchmarkFMA            687550634                1.48 ns/op
BenchmarkFMA            884715651                1.46 ns/op
BenchmarkFMA            699548965                1.50 ns/op
BenchmarkFMA            776279486                1.54 ns/op
BenchmarkFMA            841439197                1.50 ns/op
BenchmarkFMA            742726387                1.49 ns/op
BenchmarkFMA            938591630                1.44 ns/op
BenchmarkFMA            954924385                1.49 ns/op
BenchmarkFMA            771302407                1.39 ns/op
BenchmarkFMA            932724160                1.54 ns/op
BenchmarkFMA            878254848                1.45 ns/op
BenchmarkNonFMA         1000000000               1.10 ns/op
BenchmarkNonFMA         1000000000               1.12 ns/op
BenchmarkNonFMA         1000000000               1.04 ns/op
BenchmarkNonFMA         1000000000               1.15 ns/op
BenchmarkNonFMA         1000000000               1.13 ns/op
BenchmarkNonFMA         1000000000               1.11 ns/op
BenchmarkNonFMA         1000000000               1.07 ns/op
BenchmarkNonFMA         1000000000               1.17 ns/op
BenchmarkNonFMA         1000000000               1.06 ns/op
BenchmarkNonFMA         1000000000               1.05 ns/op
BenchmarkNonFMA         1000000000               1.08 ns/op
BenchmarkNonFMA         1000000000               1.05 ns/op
BenchmarkNonFMA         1000000000               1.08 ns/op
BenchmarkNonFMA         1000000000               1.10 ns/op
BenchmarkNonFMA         1000000000               1.06 ns/op
BenchmarkNonFMA         1000000000               1.01 ns/op
BenchmarkNonFMA         1000000000               1.13 ns/op
BenchmarkNonFMA         1000000000               1.08 ns/op
BenchmarkNonFMA         1000000000               1.08 ns/op
BenchmarkNonFMA         1000000000               1.05 ns/op

new.txt

BenchmarkFMA            911520248                1.20 ns/op
BenchmarkFMA            962585889                1.23 ns/op
BenchmarkFMA            940021165                1.23 ns/op
BenchmarkFMA            1000000000               1.35 ns/op
BenchmarkFMA            1000000000               1.26 ns/op
BenchmarkFMA            954936543                1.20 ns/op
BenchmarkFMA            947425016                1.17 ns/op
BenchmarkFMA            911521633                1.33 ns/op
BenchmarkFMA            977869788                1.27 ns/op
BenchmarkFMA            947416040                1.27 ns/op
BenchmarkFMA            1000000000               1.20 ns/op
BenchmarkFMA            897926014                1.21 ns/op
BenchmarkFMA            940010120                1.24 ns/op
BenchmarkFMA            1000000000               1.26 ns/op
BenchmarkFMA            796828092                1.28 ns/op
BenchmarkFMA            807528860                1.25 ns/op
BenchmarkFMA            1000000000               1.24 ns/op
BenchmarkFMA            956333464                1.27 ns/op
BenchmarkFMA            978562149                1.27 ns/op
BenchmarkFMA            911528557                1.23 ns/op
BenchmarkNonFMA         1000000000               1.12 ns/op
BenchmarkNonFMA         1000000000               1.06 ns/op
BenchmarkNonFMA         1000000000               1.20 ns/op
BenchmarkNonFMA         947412301                1.06 ns/op
BenchmarkNonFMA         1000000000               1.11 ns/op
BenchmarkNonFMA         918488014                1.11 ns/op
BenchmarkNonFMA         1000000000               1.08 ns/op
BenchmarkNonFMA         1000000000               1.05 ns/op
BenchmarkNonFMA         1000000000               1.09 ns/op
BenchmarkNonFMA         986236250                1.07 ns/op
BenchmarkNonFMA         1000000000               1.10 ns/op
BenchmarkNonFMA         994421295                1.06 ns/op
BenchmarkNonFMA         1000000000               1.06 ns/op
BenchmarkNonFMA         1000000000               1.07 ns/op
BenchmarkNonFMA         1000000000               1.13 ns/op
BenchmarkNonFMA         1000000000               1.07 ns/op
BenchmarkNonFMA         1000000000               1.12 ns/op
BenchmarkNonFMA         1000000000               1.14 ns/op
BenchmarkNonFMA         1000000000               1.09 ns/op
BenchmarkNonFMA         1000000000               1.10 ns/op

benchstat old.txt new.txt

name    old time/op  new time/op  delta
FMA     1.49ns ± 7%  1.24ns ± 7%  -16.63%  (p=0.000 n=20+19)
NonFMA  1.09ns ± 8%  1.09ns ± 5%     ~     (p=0.680 n=20+19)

@smasher164 smasher164 removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Dec 25, 2019
@davecheney
Copy link
Contributor

Are you sure the compiler isn’t optimising you’re code away?

@smasher164
Copy link
Member

smasher164 commented Dec 27, 2019

@davecheney Just checked and it is not. Disassembling the binary with go tool objdump shows this section inside BenchmarkFMA:

  fma_test.go:12	0x50cb7c		84c0			TESTL AL, AL							
  fma_test.go:12	0x50cb7e		7407			JE 0x50cb87							
  fma_test.go:12	0x50cb80		c4e2e9b9c8ebc848	MOVL $0x48c8ebc8, CX						
  fma_test.go:25	0x50cb88		89542428		MOVL DX, 0x28(SP)

although objdump doesn't know about the VFMA* instructions yet, so that roughly translates to (in intel syntax)

84 c0             test        al,al
74 07             je          0x50cb87							
c4 e2 e9 b9 c8    vfmadd231sd xmm1,xmm2,xmm0
eb c8             jmp         0x50cb4f

al stores the value in runtime.x86HasFMA.

@davecheney
Copy link
Contributor

have a look at what the benchmark loop is compiling to.

@davecheney
Copy link
Contributor

Ahh, ignore me, I missed that va, vb, etc were package level decls

@cagedmantis cagedmantis added this to the Unplanned milestone Dec 30, 2019
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 30, 2019
@cagedmantis cagedmantis modified the milestones: Unplanned, Backlog Dec 30, 2019
@cagedmantis
Copy link
Contributor

/cc @griesemer @rsc

@martisch
Copy link
Contributor

martisch commented Jan 1, 2020

As investigated above the slow down is caused by dynamically checking on every iteration that the FMA CPU capability is present.

math.FMA is still useful even if slower as it has more precision then doing the computation with temporary results explicitly.

What could be improved when FMA operations are executed in a loop is hoisting the CPU feature checking and/or load out of the loop (up to even creating two loops) if the loop body is small. I would suggest we create a new generic CPU feature detection issue for that and close this issue.

@martisch martisch removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 1, 2020
@martisch martisch modified the milestones: Backlog, Unplanned Jan 1, 2020
@smasher164
Copy link
Member

Created the general CPU feature detection issue. Closing this one.

gopherbot pushed a commit that referenced this issue Apr 4, 2020
Before using some CPU instructions, we must check for their presence.
We use global variables in the runtime package to record features.

Prior to this CL, we issued a regular memory load for these features.
The downside to this is that, because it is a regular memory load,
it cannot be hoisted out of loops or otherwise reordered with other loads.

This CL introduces a new intrinsic just for checking cpu features.
It still ends up resulting in a memory load, but that memory load can
now be floated to the entry block and rematerialized as needed.

One downside is that the regular load could be combined with the comparison
into a CMPBconstload+NE. This new intrinsic cannot; it generates MOVB+TESTB+NE.
(It is possible that MOVBQZX+TESTQ+NE would be better.)

This CL does only amd64. It is easy to extend to other architectures.

For the benchmark in #36196, on my machine, this offers a mild speedup.

name      old time/op  new time/op  delta
FMA-8     1.39ns ± 6%  1.29ns ± 9%  -7.19%  (p=0.000 n=97+96)
NonFMA-8  2.03ns ±11%  2.04ns ±12%    ~     (p=0.618 n=99+98)

Updates #15808
Updates #36196

Change-Id: I75e2fcfcf5a6df1bdb80657a7143bed69fca6deb
Reviewed-on: https://go-review.googlesource.com/c/go/+/212360
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Giovanni Bajo <rasky@develer.com>
@gopherbot
Copy link

Change https://golang.org/cl/227238 mentions this issue: cmd/compile: use MOVBQZX for OpAMD64LoweredHasCPUFeature

gopherbot pushed a commit that referenced this issue Apr 7, 2020
In the commit message of CL 212360, I wrote:

> This new intrinsic ... generates MOVB+TESTB+NE.
> (It is possible that MOVBQZX+TESTQ+NE would be better.)

I should have tested. MOVBQZX+TESTQ+NE does in fact appear to be better.

For the benchmark in #36196, on my machine:

name      old time/op  new time/op  delta
FMA-8     0.86ns ± 6%  0.70ns ± 5%  -18.79%  (p=0.000 n=98+97)
NonFMA-8  0.61ns ± 5%  0.60ns ± 4%   -0.74%  (p=0.001 n=100+97)

Interestingly, these are both considerably faster than
the measurements I took a couple of months ago (1.4ns/2ns).
It appears that CL 219131 (clearing VZEROUPPER in asyncPreempt) helped a lot.
And FMA is now once again slower than NonFMA, although this change
helps it regain some ground.

Updates #15808
Updates #36351
Updates #36196

Change-Id: I8a326289a963b1939aaa7eaa2fab2ec536467c7d
Reviewed-on: https://go-review.googlesource.com/c/go/+/227238
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@golang golang locked and limited conversation to collaborators Apr 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants