New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: optimize f32Slice[i] += 1 across math.Float32bits, Float32frombits #17220
Comments
The duplicate bounds check is because it is reloading the length, and the length may have changed because you're reading from a global. It's reloading because the intermediate store/load of the float value is hiding the fact that the loads of the length can be combined. The extraneous write/read problem is #13095. I'd rather fix that issue than intrinsify, if we can. |
@nigeltao Could you take the address instead?
Presumably this gives you just one index check. |
@griesemer: yes, that gives me just one index check. |
Feel free to punt to Unplanned if you think this isn't worth doing any time soon. |
Was this fixed by CL 58732? |
This appears at least partially fixed by CL 58732. There are still some issues :(
On the plus side, |
Fixing this is tricky. There's a phase ordering problem. |
Out of curiosity why isn't that ok? From a performance perspective I would have thought that issuing two loads might be better than a load followed by a move (easier to schedule). Are there other limitations? |
We have to be careful that we don't break code that does something like:
We want to guarantee that y is always zero, even in the presence of races. I think the language spec doesn't require this, but as a practical matter we have to. I allowed this by accident once, and I broke sync.Mutex.Lock. It does essentially
If we allowed the m.state load to happen twice instead of once, then the two values of |
I'm not sure how often this arises in practice, but it did come up in https://go-review.googlesource.com/#/c/29691/ in a vector rasterizer
Roughly speaking, in func floatingAccumulateMask, I'm accumulating the elements of a []float32 and storing the (scaled) cumulative sums in an []uint32.
I don't actually need both of the individual and cumulative values at the same time. If the two slices were both []float32 or both []uint32, then I could halve the amount of memory that I need (and possibly have better cache access patterns) by writing the output elements in-place over the input elements.
I can actually still do this, even without using unsafe, with just one slice (of type []uint32), and sprinking some math.Float32bits and math.Float32frombits throughout my float32 code.
This works, in that it produces the correct output, but there is a performance penalty. In a simpler repro case, suppose that I had these global variables:
and these three lines of code:
The GOARCH=amd64 codegen for each of the three lines are:
The codegen for the first two lines are efficient. The codegen for the third line could be better in two ways. First, there's an unnecessary bounce via the stack from memory to XMM0:
and likewise on the way back. Second, there are two bounds checks instead of one, but the second is redundant.
I am not a compiler person, but it looks to me like the Float32{,from}bits calls are treated as black box functions and not like a (no-op) uint32 to int32 conversion until too late in the codegen for e.g. the relevant bounds check elimination.
I'll let y'all decide if this is a dupe of issue #17069.
The text was updated successfully, but these errors were encountered: