-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spec: allow the use of fused multiply-add floating point instructions #17895
Comments
Note: The important part of this outcome is the precedent or more general rule it establishes for avoiding compiler optimization on float32/float64 arithmetic more generally. |
What is a "local variable", what is a "memory location", and what does it mean for an intermediate result to be "visible to the program"? These may seem like simple questions, but in practice the distinction is blurry and depends on choices made by the compiler. Examples: var global float64
var globalPtr *float64
// z is changed after it escapes. Is rounding forced?
// Note that z is a "local variable", but also a "memory location", since it escapes.
func case1(x, y float64) {
var z float64
globalPtr = &z
z = x + y
}
// z is changed before it escapes. Is rounding forced before or after Println?
func case2(x, y float64) {
var z float64
z = x + y
fmt.Println(z)
globalPtr = &z
}
// The first store into `global` may not be visible to other goroutines
// since there is no memory barrier between the two stores. The compiler
// may merge these stores into one. Is rounding forced on (x + y), or
// just (x + y + 2)?
func case3(x, y float64) {
global = x + y
global += 2
}
// A smart compiler might realize that z doesn't escape, even though its
// address is taken implicitly by the closure. Is rounding forced?
func case4(x, y float64) float64 {
var z float64
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
z = x + y
}()
wg.Wait()
return z
} I think I agree with rsc's suggestion from the second linked thread: "I would lean toward allowing FMA aggressively except for an explicit conversion." |
Thanks. You're right of course, I was being imprecise. These properties aren't necessarily clear in code. They are somewhat clearer in the backend, but that might change as the compiler gets cleverer.
I think I also agree that would be a good rule but I'm starting to wonder if a strict interpretation of the spec already implies that assignment should force rounding. The spec defines float32 and float64 as:
Since the intermediate result of a fused multiply-add may not be a valid IEEE-754 32- or 64-bit floating-point number, this definition would seem to suggest that an assignment should prevent the optimization. However gri and rsc's comments on the thread would seem to imply that they don't agree (and they obviously know much better than me). Assignments to variables currently block full precision constant propagation AFAICT, so forcing rounding at assignments would be in line with that behavior. Could the output of the following program change in future? package main
import "fmt"
const x = 1.0000000000000001
func main() {
x0 := x*10
fmt.Println(x0) // prints 10.000000000000002
x1 := x
x1 *= 10
fmt.Println(x1) // prints 10, but could be 10.000000000000002
} |
It sounds like we agree that a float64 conversion should be an explicit signal that a rounded float64 should be materialized, so that for example float64(x*y)+z cannot use an FMA, but x*y+z and (x*y)+z can. The question raised in @mundaym's latest comment is whether we're sure about case (7) above: if the code does We don't know too much about what other languages do here. We know Fortran uses parentheses, although that only helps for FMA because * has higher precedence than + so the parens are optional. We'd rather not overload parens this way. We don't think the C or C++ languages give easy control over this (possibly writing to a volatile and reading it back?). What about Java? How do they provide access to FMA? /cc @MichaelTJones for thoughts or wisdom about any of this. |
Looks like Java is making it explicit with library additions: https://www.mail-archive.com/core-libs-dev@openjdk.java.net/msg39320.html |
When using GCC the |
My sense (anecdotal…but lots of anecdata) is that there are three user postures here and one that touches on the question at hand:
Oblivious--Fused MulAdd Everywhere
Normal case: aggressive QMAF (IBM America/ etc.) is good for most everything…faster with greater precision so pervasive with the greatest reach possible is fine.
Cautious--No Fused MulAdd if I disable it (sorry, a new complier option)
Users who want their results to be unchanged from another machine, examples in a book, data in a database, etc. This may be important to Go users or not. I don’t know.
Explicit—An absolute way to force not doing the fused MuliplyAdd in a specific hand-coded careful expression. This is the user case I was speaking for previously. It is a small group, but it includes the people like Professor Kahan, and all the math library authors who need to track the difference between the two computation modes when evaluating basic functions. Both IBM and Intel/HP figured out how to get +/- 1 ULP evaluation of intrinsic functions (sin, cos, log, exp, …) using QMAF-style math but it needs to be possible to keep track of error between the one rounding and the two rounding cases. The proposal for a float64(x*y)+z to mean two ops, each rounded and prevent the fused x*y+z is just fine for this purpose. Everywhere else is just like the oblivious case.
Michael
From: Ian Lance Taylor <notifications@github.com>
Reply-To: golang/go <reply@reply.github.com>
Date: Monday, November 28, 2016 at 1:47 PM
To: golang/go <go@noreply.github.com>
Cc: Michael T Jones <michael.jones@gmail.com>, Mention <mention@noreply.github.com>
Subject: Re: [golang/go] proposal: allow the use of fused multiply-add floating point instructions (#17895)
When using GCC the -ffloat-store option can be used with C/C++ to force rounding when a floating-point value is assigned to a variable.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@MichaelTJones There's also the opposite explicit form: Have a mechanism (intrinsified function call) to specify when to use FMA (and never, otherwise). Go already allows control over when not to use FMA (if available) by forcing an explicit conversion; e.g. |
@mundaym, I think everyone agrees about cases 1, 2, 3, 4, 5, 6. We are less sure about 7 and 8, which may be the same case for a sufficiently clever compiler. I propose that we tentatively assume "yes" on 7 and 8, with the understanding that we can back down from that if a strong real-world example arrives showing that we've made a mistake. |
Thanks, that sounds good to me. I'll prototype it for ppc64{,le} and s390x. |
@rsc Can you clarify what "yes" and "no" mean in your comment? |
"Yes" means check-mark above (FMA optimization allowed here), and "no" means X above (FMA optimization not allowed here). |
On hold until prototype arrives. We still need to figure out wording for the spec. |
CL https://golang.org/cl/36963 mentions this issue. |
Complex multiplication may be implemented using fused multiply-add instructions. There is no obvious way to prevent them being emitted since multiplication is a single operation. I think this is fine, but I thought I'd note it here. |
Thanks. I agree that's probably fine, certainly until it comes up in practice. |
Explcitly block fused multiply-add pattern matching when a cast is used after the multiplication, for example: - (a * b) + c // can emit fused multiply-add - float64(a * b) + c // cannot emit fused multiply-add float{32,64} and complex{64,128} casts of matching types are now kept as OCONV operations rather than being replaced with OCONVNOP operations because they now imply a rounding operation (and therefore aren't a no-op anymore). Operations (for example, multiplication) on complex types may utilize fused multiply-add and -subtract instructions internally. There is no way to disable this behavior at the moment. Improves the performance of the floating point implementation of poly1305: name old speed new speed delta 64 246MB/s ± 0% 275MB/s ± 0% +11.48% (p=0.000 n=10+8) 1K 312MB/s ± 0% 357MB/s ± 0% +14.41% (p=0.000 n=10+10) 64Unaligned 246MB/s ± 0% 274MB/s ± 0% +11.43% (p=0.000 n=10+10) 1KUnaligned 312MB/s ± 0% 357MB/s ± 0% +14.39% (p=0.000 n=10+8) Updates #17895. Change-Id: Ia771d275bb9150d1a598f8cc773444663de5ce16 Reviewed-on: https://go-review.googlesource.com/36963 Run-TryBot: Michael Munday <munday@ca.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
CL https://golang.org/cl/38095 mentions this issue. |
Improvements on ppc64le for CL 38095: math: |
A follow on to CL 36963 adding support for ppc64x. Performance changes (as posted on the issue): poly1305: benchmark old ns/op new ns/op delta Benchmark64-16 172 151 -12.21% Benchmark1K-16 1828 1523 -16.68% Benchmark64Unaligned-16 172 151 -12.21% Benchmark1KUnaligned-16 1827 1523 -16.64% math: BenchmarkAcos-16 43.9 39.9 -9.11% BenchmarkAcosh-16 57.0 45.8 -19.65% BenchmarkAsin-16 35.8 33.0 -7.82% BenchmarkAsinh-16 68.6 60.8 -11.37% BenchmarkAtan-16 19.8 16.2 -18.18% BenchmarkAtanh-16 65.5 57.5 -12.21% BenchmarkAtan2-16 45.4 34.2 -24.67% BenchmarkGamma-16 37.6 26.0 -30.85% BenchmarkLgamma-16 40.0 28.2 -29.50% BenchmarkLog1p-16 35.1 29.1 -17.09% BenchmarkSin-16 22.7 18.4 -18.94% BenchmarkSincos-16 31.7 23.7 -25.24% BenchmarkSinh-16 146 131 -10.27% BenchmarkY0-16 130 107 -17.69% BenchmarkY1-16 127 107 -15.75% BenchmarkYn-16 278 235 -15.47% Updates #17895. Change-Id: I1c16199715d20c9c4bd97c4a950bcfa69eb688c1 Reviewed-on: https://go-review.googlesource.com/38095 Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Thanks for the implementation. We experimented with the ppc64 compiler and confirmed that this behaves the way we expected from #17895 (comment). @griesemer will send a spec CL. |
CL https://golang.org/cl/40391 mentions this issue. |
I looked through the two linked discussions, and I didn't spot any discussion about supporting FMA via the standard library. For example, we could add
to package math, and optimize it via compiler intrinsics like Sqrt. Considering this is the approach C, C++, and Java have already taken, I think we should at least discuss it. |
I'm not a fan of the current language spec proposal because:
|
From what I understand, both of your points are already true in the language. From the spec:
This means a statement like |
@btracey Hm, I think you're right. I withdraw my objections. |
One plausible tweak might be to demand that the compiler use FMA when
hardware available and spec allowed.
On Tue, Apr 11, 2017 at 3:28 PM Matthew Dempsky ***@***.***> wrote:
@btracey <https://github.com/btracey> Hm, I think you're right. I
withdraw my objections.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17895 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHgypfblQlAdDN3Rdbw3oJLmMdDpUm-kks5ru_6KgaJpZM4KwF1K>
.
--
Michael T. Jones
michael.jones@gmail.com
|
Although it's true that C/C++ provides a |
Added a paragraph and examples explaining when an implementation may use fused floating-point operations (such as FMA) and how to prevent operation fusion. For #17895. Change-Id: I64c9559fc1097e597525caca420cfa7032d67014 Reviewed-on: https://go-review.googlesource.com/40391 Reviewed-by: Matthew Dempsky <mdempsky@google.com> Reviewed-by: Rob Pike <r@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
Would there be a way to "force" the use of FMA, instead of just "allow"? I mean, would there be a way to make sure that |
@TuomLarsen No, we're not providing a way to do that here. Do other languages do that? |
@rsc I apologise, I now realise it might be out of scope for this proposal: what I meant was basically support for simple math.FMA function, I imagined that math.FMA would be kind of symmetrical to emitting FMA instruction (mandatory vs. optional). In any case, are there plans for such a function? See e.g. http://en.cppreference.com/w/c/numeric/math/fma |
@TuomLarsen please file a new issue to propose/discuss. Thanks! |
With ppc64 doing this and the spec written, I think this is done. |
Fixes #20795 Updates #17895 Updates #20587 Change-Id: Iea375f3a6ffe3f51e3ffdae1fb3fd628b6b3316c Reviewed-on: https://go-review.googlesource.com/46717 Reviewed-by: Ian Lance Taylor <iant@golang.org>
Hi everyone, I was pretty excited to try out the perf benefits of the FMA operation in 1.9. However, it seems that I am still getting the same performance. According to @laboger's comment there seem to be improvements of math functions on ppc64. I was under the impression that I will be able to reap similar benefits on amd64 too ? However, I still see the same performance. Here is the code - package stdtest
import (
"math"
"testing"
)
func BenchmarkAtan2(b *testing.B) {
for n := 0; n < b.N; n++ {
_ = math.Atan2(480.0, 123.0) * 180 / math.Pi
}
} Under 1.8.1
Under 1.9.rc2
I have verified that my processor supports fma (cat /proc/cpuinfo | grep fma). Is this expected ? Or am I doing something wrong ? |
@agnivade, if you run |
Thanks @bradfitz ! Yes, I was expecting something like that. The tone of this announcement made it seem like its there for all architectures. However the commits seemed to show only for s390x and ppc64. Hence I was a little confused. Is adding |
Thanks a lot ! Appreciate the help :) |
Fused multiply-add (FMA) floating point instructions typically provide improved accuracy and performance when compared to independent floating point multiply and add instructions. However they may change the result of such an operation because they omit the rounding operation that would normally take place between a multiply instruction and an add instruction.
This proposal seeks to clarify the guarantees that Go provides as to when rounding to float32 or float64 must be performed so that FMA operations can be safely extracted by SSA rules. I assume that complex{64,128} casts will be lowered to float{32,64} casts for the purposes of this proposal.
The consensus from previous discussions on the subject is that explicit casts should force rounding, as is already specified for constants:
There is also consensus that parentheses should not force rounding. So in the following cases the intermediate rounding stage can be omitted and a FMA used:
It is also proposed that assignments to local variables should not force rounding to take place:
I also propose that an assignment to a memory location should force rounding (I lean towards forcing rounding whenever an intermediate result is visible to the program):
(SSA rules could optimize example 8 because they will replace the load from a with a reuse of the result of x * y.)
I think the only real complexity in the implementation is how we plumb the casts from the compiler to the SSA backend so that optimization rules can be blocked as appropriate. I’m not sure if there is a pre-existing mechanism we can use.
See these links for previous discussion of this proposal on golang-dev:
https://groups.google.com/d/topic/golang-dev/qvOqcmAkKnA/discussion
https://groups.google.com/d/topic/golang-dev/cVnE1K08Aks/discussion
The text was updated successfully, but these errors were encountered: