You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This has a single bounds check, but has 4 loads to get each byte, before shifting and or'ing them into place. On my test setup this operates at 1.49 ns/op.
It seems like there is some SSA logic in place that can combine two adjacent byte loads (with shifts) into a single word load. This can be seen changing the last line to ((v3 << 24) | (v2 << 16)) | ((v1 << 8) | v0). On my machine, this improves time to 1.24 ns/op.
It is possible to also load the high word using a single word, bringing this to 0.85 ns/op, which it the best I can get with the current compiler. The assembler shows two 16 bit loads.
However, it should be possible for the compiler to automatically reduce the original test case to a single 32 bit load on little endian systems.
i.e. reverse the order of the or operands, and a single load will be generated. So the compiler does recognize that pattern, but the mechanism is not very robust and you need to specify the or operators in ascending order.
ALTree
changed the title
cmd/compile: Generate wider loads when feasible
cmd/compile: generate wider loads when feasible
Feb 18, 2018
What version of Go are you using (
go version
)?go version go1.10 windows/amd64
Does this issue reproduce with the latest release?
Yes. Tip not tested.
What did you do?
When working with file compression, a typical time-critical part is reading multiple bytes from a bit stream.
A fairly typical function looks like this:
This has a single bounds check, but has 4 loads to get each byte, before shifting and or'ing them into place. On my test setup this operates at
1.49 ns/op
.It seems like there is some SSA logic in place that can combine two adjacent byte loads (with shifts) into a single word load. This can be seen changing the last line to
((v3 << 24) | (v2 << 16)) | ((v1 << 8) | v0)
. On my machine, this improves time to1.24 ns/op
.It is possible to also load the high word using a single word, bringing this to
0.85 ns/op
, which it the best I can get with the current compiler. The assembler shows two 16 bit loads.However, it should be possible for the compiler to automatically reduce the original test case to a single 32 bit load on little endian systems.
You can download the example file and benchmark file here.
What did you expect to see?
The compiler combining this to a single load on little endian platforms.
What did you see instead?
4 single byte loads.
The text was updated successfully, but these errors were encountered: