math: FMA is slower than non-FMA calculation #36196

mattn · 2019-12-18T06:28:50Z

What version of Go are you using (`go version`)?

$ go version
go version devel +0377f06168 Tue Dec 17 20:57:06 2019 +0000 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
set GO111MODULE=auto
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\mattn\AppData\Local\go-build
set GOENV=C:\Users\mattn\AppData\Roaming\go\env
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\mattn\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\go\pkg\tool\windows_amd64
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\mattn\AppData\Local\Temp\go-build216348474=/tmp/go-build -gno-record-gcc-switches

What did you do?

package main

import (
	"fmt"
	"math"
	"math/rand"
	"time"
)

var a, b, c, d float64

func pure_fma_func() {
	d = math.FMA(a, b, c)
}

func non_fma_func() {
	d = a*b + c
}

func main() {
	const n = 1000000000

	a = rand.Float64()
	b = rand.Float64()
	c = rand.Float64()

	t1 := time.Now()
	for i := int64(0); i < n; i++ {
		non_fma_func()
	}
	t2 := time.Now()
	for i := int64(0); i < n; i++ {
		pure_fma_func()
	}
	t3 := time.Now()

	fmt.Println("non FMA", t2.Sub(t1))
	fmt.Println("    FMA", t3.Sub(t2))
}

And go run.

What did you expect to see?

math.FMA is faster than non-FMA code.

What did you see instead?

non FMA 548.0314ms
    FMA 924.0528ms

I confirmed my CPU have simd-FMA. Is this an overhead of function call?

The text was updated successfully, but these errors were encountered:

martisch · 2019-12-18T08:21:58Z

Which CPU is used for the benchmark?

Please benchmark with GODEBUG=cpu.fma=off and see if this changes anything. If FMA instructions are used I would expect a change in the FMA benchmark numbers.

Please also write the benchmark as a go benchmark:
https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go

Then execute the test with -count=20 and store the results in a file. Use a quiet machine with e.g. no browser or videos running.

Afterwards use https://godoc.org/golang.org/x/tools/cmd/benchcmp
to produce an "average" over the runs.

This will give better information how consistent between runs the results are.

mattn · 2019-12-18T08:55:25Z

Intel Core i5 4460

package fma_test

import (
	"math"
	"math/rand"
	"testing"
)

var va, vb, vc, vd float64

func pure_fma_func() {
	vd = math.FMA(va, vb, vc)
}

func non_fma_func() {
	vd = va*vb + vc
}

func BenchmarkFMA(b *testing.B) {
	va = rand.Float64()
	vb = rand.Float64()
	vc = rand.Float64()

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		pure_fma_func()
	}
}

func BenchmarkNonFMA(b *testing.B) {
	va = rand.Float64()
	vb = rand.Float64()
	vc = rand.Float64()

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		non_fma_func()
	}
}

set GODEBUG=cpu.fma=off
go test -bench . > old
set GODEBUG=
go test -bench . > new
benchstat old new

old

goos: windows
goarch: amd64
pkg: github.com/mattn/fma-example
BenchmarkFMA-4      	1000000000	         0.930 ns/op
BenchmarkNonFMA-4   	1000000000	         0.615 ns/op
PASS
ok  	github.com/mattn/fma-example	1.855s

new

goos: windows
goarch: amd64
pkg: github.com/mattn/fma-example
BenchmarkFMA-4      	1000000000	         0.912 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
PASS
ok  	github.com/mattn/fma-example	1.816s

benchstat

name      old time/op  new time/op  delta
FMA-4     0.93ns ± 0%  0.91ns ± 0%   ~     (p=1.000 n=1+1)
NonFMA-4  0.61ns ± 0%  0.62ns ± 0%   ~     (p=1.000 n=1+1)

mattn · 2019-12-18T09:35:53Z

Added -count=20

old

goos: windows
goarch: amd64
pkg: github.com/mattn/fma-example
BenchmarkFMA-4      	1000000000	         0.937 ns/op
BenchmarkFMA-4      	1000000000	         0.926 ns/op
BenchmarkFMA-4      	1000000000	         0.933 ns/op
BenchmarkFMA-4      	1000000000	         0.932 ns/op
BenchmarkFMA-4      	1000000000	         0.935 ns/op
BenchmarkFMA-4      	1000000000	         0.934 ns/op
BenchmarkFMA-4      	1000000000	         0.960 ns/op
BenchmarkFMA-4      	1000000000	         0.956 ns/op
BenchmarkFMA-4      	1000000000	         0.958 ns/op
BenchmarkFMA-4      	1000000000	         0.930 ns/op
BenchmarkFMA-4      	1000000000	         0.944 ns/op
BenchmarkFMA-4      	1000000000	         0.942 ns/op
BenchmarkFMA-4      	1000000000	         0.943 ns/op
BenchmarkFMA-4      	1000000000	         0.940 ns/op
BenchmarkFMA-4      	1000000000	         0.942 ns/op
BenchmarkFMA-4      	1000000000	         0.943 ns/op
BenchmarkFMA-4      	1000000000	         0.935 ns/op
BenchmarkFMA-4      	1000000000	         0.929 ns/op
BenchmarkFMA-4      	1000000000	         0.930 ns/op
BenchmarkFMA-4      	1000000000	         0.927 ns/op
BenchmarkNonFMA-4   	1000000000	         0.617 ns/op
BenchmarkNonFMA-4   	1000000000	         0.617 ns/op
BenchmarkNonFMA-4   	1000000000	         0.619 ns/op
BenchmarkNonFMA-4   	1000000000	         0.621 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.617 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.622 ns/op
BenchmarkNonFMA-4   	1000000000	         0.622 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.640 ns/op
BenchmarkNonFMA-4   	1000000000	         0.625 ns/op
BenchmarkNonFMA-4   	1000000000	         0.620 ns/op
BenchmarkNonFMA-4   	1000000000	         0.624 ns/op
BenchmarkNonFMA-4   	1000000000	         0.620 ns/op
BenchmarkNonFMA-4   	1000000000	         0.625 ns/op
BenchmarkNonFMA-4   	1000000000	         0.621 ns/op
BenchmarkNonFMA-4   	1000000000	         0.616 ns/op
BenchmarkNonFMA-4   	1000000000	         0.647 ns/op
BenchmarkNonFMA-4   	1000000000	         0.647 ns/op
PASS
ok  	github.com/mattn/fma-example	34.560s

new

goos: windows
goarch: amd64
pkg: github.com/mattn/fma-example
BenchmarkFMA-4      	1000000000	         0.948 ns/op
BenchmarkFMA-4      	1000000000	         0.936 ns/op
BenchmarkFMA-4      	1000000000	         0.932 ns/op
BenchmarkFMA-4      	1000000000	         0.923 ns/op
BenchmarkFMA-4      	1000000000	         0.938 ns/op
BenchmarkFMA-4      	1000000000	         0.927 ns/op
BenchmarkFMA-4      	1000000000	         0.921 ns/op
BenchmarkFMA-4      	1000000000	         0.928 ns/op
BenchmarkFMA-4      	1000000000	         0.916 ns/op
BenchmarkFMA-4      	1000000000	         0.946 ns/op
BenchmarkFMA-4      	1000000000	         0.970 ns/op
BenchmarkFMA-4      	1000000000	         0.959 ns/op
BenchmarkFMA-4      	1000000000	         0.938 ns/op
BenchmarkFMA-4      	1000000000	         0.938 ns/op
BenchmarkFMA-4      	1000000000	         0.956 ns/op
BenchmarkFMA-4      	1000000000	         0.976 ns/op
BenchmarkFMA-4      	1000000000	         0.955 ns/op
BenchmarkFMA-4      	1000000000	         0.966 ns/op
BenchmarkFMA-4      	1000000000	         0.942 ns/op
BenchmarkFMA-4      	1000000000	         0.943 ns/op
BenchmarkNonFMA-4   	1000000000	         0.617 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.624 ns/op
BenchmarkNonFMA-4   	1000000000	         0.623 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.629 ns/op
BenchmarkNonFMA-4   	1000000000	         0.624 ns/op
BenchmarkNonFMA-4   	1000000000	         0.630 ns/op
BenchmarkNonFMA-4   	1000000000	         0.622 ns/op
BenchmarkNonFMA-4   	1000000000	         0.634 ns/op
BenchmarkNonFMA-4   	1000000000	         0.622 ns/op
BenchmarkNonFMA-4   	1000000000	         0.633 ns/op
BenchmarkNonFMA-4   	1000000000	         0.635 ns/op
BenchmarkNonFMA-4   	1000000000	         0.627 ns/op
BenchmarkNonFMA-4   	1000000000	         0.631 ns/op
BenchmarkNonFMA-4   	1000000000	         0.633 ns/op
BenchmarkNonFMA-4   	1000000000	         0.618 ns/op
BenchmarkNonFMA-4   	1000000000	         0.628 ns/op
BenchmarkNonFMA-4   	1000000000	         0.630 ns/op
BenchmarkNonFMA-4   	1000000000	         0.617 ns/op
PASS
ok  	github.com/mattn/fma-example	34.760s

benchstat

name      old time/op  new time/op  delta
FMA-4     0.94ns ± 2%  0.94ns ± 4%    ~     (p=0.465 n=20+20)
NonFMA-4  0.62ns ± 1%  0.63ns ± 1%  +0.91%  (p=0.005 n=17+20)

martisch · 2019-12-18T12:36:40Z

I missed you were using windows on which GO does not support using GODEBUG to disable cpu features used by GO. I should consider printing a warning.

Disasembly for me shows:

main_test.go:12	0x4fabae		f20f110d3ab91600	MOVSD_XMM X1, _/usr/local/google/home/moehrmann/test_test.vd(SB)	
  main_test.go:25	0x4fabb6		48ffc1			INCQ CX									
  main_test.go:25	0x4fabb9		48398810010000		CMPQ CX, 0x110(AX)							
  main_test.go:25	0x4fabc0		7e66			JLE 0x4fac28								
  main_test.go:26	0x4fabc2		90			NOPL									
  main_test.go:12	0x4fabc3		f20f100515b91600	MOVSD_XMM _/usr/local/google/home/moehrmann/test_test.vb(SB), X0	
  main_test.go:12	0x4fabcb		f20f100d15b91600	MOVSD_XMM _/usr/local/google/home/moehrmann/test_test.vc(SB), X1	
  main_test.go:12	0x4fabd3		f20f1015fdb81600	MOVSD_XMM _/usr/local/google/home/moehrmann/test_test.va(SB), X2	
  main_test.go:12	0x4fabdb		803d95b8160000		CMPB $0x0, runtime.x86HasFMA(SB)					
  main_test.go:12	0x4fabe2		7407			JE 0x4fabeb								
  main_test.go:12	0x4fabe4		c4e2e9b9c8ebc348	MOVL $0x48c3ebc8, CX							
  main_test.go:25	0x4fabec		894c2420		MOVL CX, 0x20(SP)							
  main_test.go:12	0x4fabf0		f20f111424		MOVSD_XMM X2, 0(SP)							
  main_test.go:12	0x4fabf5		f20f1005e3b81600	MOVSD_XMM _/usr/local/google/home/moehrmann/test_test.vb(SB), X0	
  main_test.go:12	0x4fabfd		f20f11442408		MOVSD_XMM X0, 0x8(SP)							
  main_test.go:12	0x4fac03		f20f1005ddb81600	MOVSD_XMM _/usr/local/google/home/moehrmann/test_test.vc(SB), X0	
  main_test.go:12	0x4fac0b		f20f11442410		MOVSD_XMM X0, 0x10(SP)							
  main_test.go:12	0x4fac11		e82a28f8ff		CALL math.FMA(SB)							
  main_test.go:12	0x4fac16		f20f104c2418		MOVSD_XMM 0x18(SP), X1							
  main_test.go:25	0x4fac1c		488b442438		MOVQ 0x38(SP), AX							
  main_test.go:25	0x4fac21		488b4c2420		MOVQ 0x20(SP), CX							
  main_test.go:12	0x4fac26		eb86			JMP 0x4fabae								
  main_test.go:25	0x4fac28		488b6c2428		MOVQ 0x28(SP), BP							
  main_test.go:25	0x4fac2d		4883c430		ADDQ $0x30, SP								
  main_test.go:25	0x4fac31		c3			RET

Which looks wrong since its missing the VFMADD231SD instruction.

Using GOSSAFUNC=BenchmarkFMA go test -c however I see the VFMADD231SD instruction:

00035 (+12) MOVSD "".vb(SB), X0
00036 (12) MOVSD "".vc(SB), X1
00037 (12) MOVSD "".va(SB), X2
00038 (12) CMPB runtime.x86HasFMA(SB), $0
00039 (12) JEQ 42
00040 (12) VFMADD231SD X0, X2, X1
00041 (12) JMP 26
00042 (25) PCDATA $0, $0
00043 (25) MOVQ CX, "".i-8(SP)
00044 (12) MOVSD X2, (SP)
00045 (12) MOVSD "".vb(SB), X0
00046 (12) MOVSD X0, 8(SP)
00047 (12) MOVSD "".vc(SB), X0
00048 (12) MOVSD X0, 16(SP)
00049 (12) CALL math.FMA(SB)
00050 (12) MOVSD 24(SP), X1
00051 (25) PCDATA $0, $1
00052 (25) MOVQ "".b(SP), AX
00053 (25) MOVQ "".i-8(SP), CX
00054 (12) JMP 26
00055 (25) PCDATA $0, $-1
00056 (25) PCDATA $1, $-1
00057 (25) RET
00058 (?) END

Running with GODEBUG=cpu.fma=off go test -bench=. -count=20 -cpu=1
does make a huge difference on my Intel Xeon E5-1650 v3:

name    old time/op  new time/op   delta
FMA     0.96ns ± 6%  23.02ns ±14%  +2291.32%  (p=0.000 n=20+20)
NonFMA  0.57ns ± 5%   0.55ns ± 4%       ~     (p=0.064 n=20+18)

but both is slower than without FMA. Note that the FMA version has more precision as it does not do the 64bit rounding between the steps.

martisch · 2019-12-18T12:55:08Z

VFMADD231SD has a 5 cycle latency on Haswell and 2 can be executed in parallel. Same for MULSD. The added check and jump as well as other factors can make the FMA indeed slower. This seems to be 1 or 2 cycles here. This might be WAI due to a slight overhead for runtime dispatch within the loop.

martisch · 2019-12-18T13:19:33Z

To verify that this is the runtime dispatch overhead to determine if the cpu supports FMA I changed the compiler in cmd/compile/internal/gc/ssa.go to not add any checks.

name    time/op
FMA     0.58ns ± 2%
NonFMA  0.56ns ± 4%

As noted even if equally fast FMA has the advantage of not rounding the intermediate step.

As long as the build go binary needs to support both FMA capable and non FMA capable cpus there will be some overhead. Ideally that could be moved outside the loop but we do not have that currently. For the later I thought we already had a general bug to move the checks.

smasher164 · 2019-12-19T08:03:47Z

For the later I thought we already had a general bug to move the checks.

The closest bug I could find related to this is #34950, which is intended to mark ops requiring feature detection such that the check isn't optimized away. However, there likely needs to be some issue to track the hoisting optimization for loop invariants. This will likely become even more important if vector intrinsics land in the stdlib someday.

gopherbot · 2019-12-21T22:27:58Z

Change https://golang.org/cl/212360 mentions this issue: cmd/compile: add intrinsic HasCPUFeature for checking cpu features

smasher164 · 2019-12-23T02:54:35Z

Running @josharian's CL (212360) provides ~16% on my windows laptop that had the same overhead mentioned above.

old.txt

BenchmarkFMA            733668986                1.54 ns/op
BenchmarkFMA            791581790                1.46 ns/op
BenchmarkFMA            796827562                1.48 ns/op
BenchmarkFMA            903921406                1.49 ns/op
BenchmarkFMA            679788086                1.55 ns/op
BenchmarkFMA            796837087                1.58 ns/op
BenchmarkFMA            786417004                1.50 ns/op
BenchmarkFMA            871895959                1.51 ns/op
BenchmarkFMA            853358212                1.42 ns/op
BenchmarkFMA            687550634                1.48 ns/op
BenchmarkFMA            884715651                1.46 ns/op
BenchmarkFMA            699548965                1.50 ns/op
BenchmarkFMA            776279486                1.54 ns/op
BenchmarkFMA            841439197                1.50 ns/op
BenchmarkFMA            742726387                1.49 ns/op
BenchmarkFMA            938591630                1.44 ns/op
BenchmarkFMA            954924385                1.49 ns/op
BenchmarkFMA            771302407                1.39 ns/op
BenchmarkFMA            932724160                1.54 ns/op
BenchmarkFMA            878254848                1.45 ns/op
BenchmarkNonFMA         1000000000               1.10 ns/op
BenchmarkNonFMA         1000000000               1.12 ns/op
BenchmarkNonFMA         1000000000               1.04 ns/op
BenchmarkNonFMA         1000000000               1.15 ns/op
BenchmarkNonFMA         1000000000               1.13 ns/op
BenchmarkNonFMA         1000000000               1.11 ns/op
BenchmarkNonFMA         1000000000               1.07 ns/op
BenchmarkNonFMA         1000000000               1.17 ns/op
BenchmarkNonFMA         1000000000               1.06 ns/op
BenchmarkNonFMA         1000000000               1.05 ns/op
BenchmarkNonFMA         1000000000               1.08 ns/op
BenchmarkNonFMA         1000000000               1.05 ns/op
BenchmarkNonFMA         1000000000               1.08 ns/op
BenchmarkNonFMA         1000000000               1.10 ns/op
BenchmarkNonFMA         1000000000               1.06 ns/op
BenchmarkNonFMA         1000000000               1.01 ns/op
BenchmarkNonFMA         1000000000               1.13 ns/op
BenchmarkNonFMA         1000000000               1.08 ns/op
BenchmarkNonFMA         1000000000               1.08 ns/op
BenchmarkNonFMA         1000000000               1.05 ns/op

new.txt

BenchmarkFMA            911520248                1.20 ns/op
BenchmarkFMA            962585889                1.23 ns/op
BenchmarkFMA            940021165                1.23 ns/op
BenchmarkFMA            1000000000               1.35 ns/op
BenchmarkFMA            1000000000               1.26 ns/op
BenchmarkFMA            954936543                1.20 ns/op
BenchmarkFMA            947425016                1.17 ns/op
BenchmarkFMA            911521633                1.33 ns/op
BenchmarkFMA            977869788                1.27 ns/op
BenchmarkFMA            947416040                1.27 ns/op
BenchmarkFMA            1000000000               1.20 ns/op
BenchmarkFMA            897926014                1.21 ns/op
BenchmarkFMA            940010120                1.24 ns/op
BenchmarkFMA            1000000000               1.26 ns/op
BenchmarkFMA            796828092                1.28 ns/op
BenchmarkFMA            807528860                1.25 ns/op
BenchmarkFMA            1000000000               1.24 ns/op
BenchmarkFMA            956333464                1.27 ns/op
BenchmarkFMA            978562149                1.27 ns/op
BenchmarkFMA            911528557                1.23 ns/op
BenchmarkNonFMA         1000000000               1.12 ns/op
BenchmarkNonFMA         1000000000               1.06 ns/op
BenchmarkNonFMA         1000000000               1.20 ns/op
BenchmarkNonFMA         947412301                1.06 ns/op
BenchmarkNonFMA         1000000000               1.11 ns/op
BenchmarkNonFMA         918488014                1.11 ns/op
BenchmarkNonFMA         1000000000               1.08 ns/op
BenchmarkNonFMA         1000000000               1.05 ns/op
BenchmarkNonFMA         1000000000               1.09 ns/op
BenchmarkNonFMA         986236250                1.07 ns/op
BenchmarkNonFMA         1000000000               1.10 ns/op
BenchmarkNonFMA         994421295                1.06 ns/op
BenchmarkNonFMA         1000000000               1.06 ns/op
BenchmarkNonFMA         1000000000               1.07 ns/op
BenchmarkNonFMA         1000000000               1.13 ns/op
BenchmarkNonFMA         1000000000               1.07 ns/op
BenchmarkNonFMA         1000000000               1.12 ns/op
BenchmarkNonFMA         1000000000               1.14 ns/op
BenchmarkNonFMA         1000000000               1.09 ns/op
BenchmarkNonFMA         1000000000               1.10 ns/op

benchstat old.txt new.txt

name    old time/op  new time/op  delta
FMA     1.49ns ± 7%  1.24ns ± 7%  -16.63%  (p=0.000 n=20+19)
NonFMA  1.09ns ± 8%  1.09ns ± 5%     ~     (p=0.680 n=20+19)

davecheney · 2019-12-27T05:18:29Z

Are you sure the compiler isn’t optimising you’re code away?

smasher164 · 2019-12-27T06:13:24Z

@davecheney Just checked and it is not. Disassembling the binary with go tool objdump shows this section inside BenchmarkFMA:

  fma_test.go:12	0x50cb7c		84c0			TESTL AL, AL							
  fma_test.go:12	0x50cb7e		7407			JE 0x50cb87							
  fma_test.go:12	0x50cb80		c4e2e9b9c8ebc848	MOVL $0x48c8ebc8, CX						
  fma_test.go:25	0x50cb88		89542428		MOVL DX, 0x28(SP)

although objdump doesn't know about the VFMA* instructions yet, so that roughly translates to (in intel syntax)

84 c0             test        al,al
74 07             je          0x50cb87							
c4 e2 e9 b9 c8    vfmadd231sd xmm1,xmm2,xmm0
eb c8             jmp         0x50cb4f

al stores the value in runtime.x86HasFMA.

davecheney · 2019-12-27T08:23:25Z

have a look at what the benchmark loop is compiling to.

davecheney · 2019-12-27T08:24:18Z

Ahh, ignore me, I missed that va, vb, etc were package level decls

cagedmantis · 2019-12-30T15:49:34Z

/cc @griesemer @rsc

martisch · 2020-01-01T08:09:01Z

As investigated above the slow down is caused by dynamically checking on every iteration that the FMA CPU capability is present.

math.FMA is still useful even if slower as it has more precision then doing the computation with temporary results explicitly.

What could be improved when FMA operations are executed in a loop is hoisting the CPU feature checking and/or load out of the loop (up to even creating two loops) if the loop body is small. I would suggest we create a new generic CPU feature detection issue for that and close this issue.

smasher164 · 2020-01-01T18:20:07Z

Created the general CPU feature detection issue. Closing this one.

Before using some CPU instructions, we must check for their presence. We use global variables in the runtime package to record features. Prior to this CL, we issued a regular memory load for these features. The downside to this is that, because it is a regular memory load, it cannot be hoisted out of loops or otherwise reordered with other loads. This CL introduces a new intrinsic just for checking cpu features. It still ends up resulting in a memory load, but that memory load can now be floated to the entry block and rematerialized as needed. One downside is that the regular load could be combined with the comparison into a CMPBconstload+NE. This new intrinsic cannot; it generates MOVB+TESTB+NE. (It is possible that MOVBQZX+TESTQ+NE would be better.) This CL does only amd64. It is easy to extend to other architectures. For the benchmark in #36196, on my machine, this offers a mild speedup. name old time/op new time/op delta FMA-8 1.39ns ± 6% 1.29ns ± 9% -7.19% (p=0.000 n=97+96) NonFMA-8 2.03ns ±11% 2.04ns ±12% ~ (p=0.618 n=99+98) Updates #15808 Updates #36196 Change-Id: I75e2fcfcf5a6df1bdb80657a7143bed69fca6deb Reviewed-on: https://go-review.googlesource.com/c/go/+/212360 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Giovanni Bajo <rasky@develer.com>

gopherbot · 2020-04-05T03:48:03Z

Change https://golang.org/cl/227238 mentions this issue: cmd/compile: use MOVBQZX for OpAMD64LoweredHasCPUFeature

In the commit message of CL 212360, I wrote: > This new intrinsic ... generates MOVB+TESTB+NE. > (It is possible that MOVBQZX+TESTQ+NE would be better.) I should have tested. MOVBQZX+TESTQ+NE does in fact appear to be better. For the benchmark in #36196, on my machine: name old time/op new time/op delta FMA-8 0.86ns ± 6% 0.70ns ± 5% -18.79% (p=0.000 n=98+97) NonFMA-8 0.61ns ± 5% 0.60ns ± 4% -0.74% (p=0.001 n=100+97) Interestingly, these are both considerably faster than the measurements I took a couple of months ago (1.4ns/2ns). It appears that CL 219131 (clearing VZEROUPPER in asyncPreempt) helped a lot. And FMA is now once again slower than NonFMA, although this change helps it regain some ground. Updates #15808 Updates #36351 Updates #36196 Change-Id: I8a326289a963b1939aaa7eaa2fab2ec536467c7d Reviewed-on: https://go-review.googlesource.com/c/go/+/227238 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

martisch added the WaitingForInfo label Dec 18, 2019

smasher164 mentioned this issue Dec 21, 2019

cmd/compile: loads/constants not lifted out of loop #15808

Open

smasher164 removed the WaitingForInfo label Dec 25, 2019

cagedmantis added this to the Unplanned milestone Dec 30, 2019

cagedmantis added the NeedsInvestigation label Dec 30, 2019

cagedmantis modified the milestones: Unplanned, Backlog Dec 30, 2019

martisch removed the NeedsInvestigation label Jan 1, 2020

martisch modified the milestones: Backlog, Unplanned Jan 1, 2020

smasher164 mentioned this issue Jan 1, 2020

cmd/compile: optimize overhead from CPU feature detection #36351

Open

smasher164 closed this as completed Jan 1, 2020

golang locked and limited conversation to collaborators Apr 5, 2021

gopherbot added the FrozenDueToAge label Apr 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

math: FMA is slower than non-FMA calculation #36196

math: FMA is slower than non-FMA calculation #36196

mattn commented Dec 18, 2019

martisch commented Dec 18, 2019

mattn commented Dec 18, 2019

mattn commented Dec 18, 2019

martisch commented Dec 18, 2019 •

edited

Loading

martisch commented Dec 18, 2019

martisch commented Dec 18, 2019

smasher164 commented Dec 19, 2019

gopherbot commented Dec 21, 2019

smasher164 commented Dec 23, 2019

davecheney commented Dec 27, 2019

smasher164 commented Dec 27, 2019 •

edited

Loading

davecheney commented Dec 27, 2019

davecheney commented Dec 27, 2019

cagedmantis commented Dec 30, 2019

martisch commented Jan 1, 2020 •

edited

Loading

smasher164 commented Jan 1, 2020

gopherbot commented Apr 5, 2020

math: FMA is slower than non-FMA calculation #36196

math: FMA is slower than non-FMA calculation #36196

Comments

mattn commented Dec 18, 2019

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

martisch commented Dec 18, 2019

mattn commented Dec 18, 2019

mattn commented Dec 18, 2019

martisch commented Dec 18, 2019 • edited Loading

martisch commented Dec 18, 2019

martisch commented Dec 18, 2019

smasher164 commented Dec 19, 2019

gopherbot commented Dec 21, 2019

smasher164 commented Dec 23, 2019

old.txt

new.txt

benchstat old.txt new.txt

davecheney commented Dec 27, 2019

smasher164 commented Dec 27, 2019 • edited Loading

davecheney commented Dec 27, 2019

davecheney commented Dec 27, 2019

cagedmantis commented Dec 30, 2019

martisch commented Jan 1, 2020 • edited Loading

smasher164 commented Jan 1, 2020

gopherbot commented Apr 5, 2020

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

martisch commented Dec 18, 2019 •

edited

Loading

smasher164 commented Dec 27, 2019 •

edited

Loading

martisch commented Jan 1, 2020 •

edited

Loading