Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: generate wider loads when feasible #23907

Closed
klauspost opened this issue Feb 18, 2018 · 2 comments
Closed

cmd/compile: generate wider loads when feasible #23907

klauspost opened this issue Feb 18, 2018 · 2 comments

Comments

@klauspost
Copy link
Contributor

klauspost commented Feb 18, 2018

What version of Go are you using (go version)?

go version go1.10 windows/amd64

Does this issue reproduce with the latest release?

Yes. Tip not tested.

What did you do?

When working with file compression, a typical time-critical part is reading multiple bytes from a bit stream.

A fairly typical function looks like this:

// Uint32 will read a little endian uint32 from current offset.
func (b byteReader) Uint32() uint32 {
	b2 := b.b[b.off : b.off+4 : b.off+4]
	v3 := uint32(b2[3])
	v2 := uint32(b2[2])
	v1 := uint32(b2[1])
	v0 := uint32(b2[0])
	return (v3 << 24) | (v2 << 16) | (v1 << 8) | v0
}

This has a single bounds check, but has 4 loads to get each byte, before shifting and or'ing them into place. On my test setup this operates at 1.49 ns/op.

It seems like there is some SSA logic in place that can combine two adjacent byte loads (with shifts) into a single word load. This can be seen changing the last line to ((v3 << 24) | (v2 << 16)) | ((v1 << 8) | v0). On my machine, this improves time to 1.24 ns/op.

It is possible to also load the high word using a single word, bringing this to 0.85 ns/op, which it the best I can get with the current compiler. The assembler shows two 16 bit loads.

However, it should be possible for the compiler to automatically reduce the original test case to a single 32 bit load on little endian systems.

You can download the example file and benchmark file here.

What did you expect to see?

The compiler combining this to a single load on little endian platforms.

What did you see instead?

4 single byte loads.

@ALTree
Copy link
Member

ALTree commented Feb 18, 2018

You can change the last line of Uint32Typical to

return v0 | (v1 << 8) | (v2 << 16) | (v3 << 24)

i.e. reverse the order of the or operands, and a single load will be generated. So the compiler does recognize that pattern, but the mechanism is not very robust and you need to specify the or operators in ascending order.

@ALTree ALTree changed the title cmd/compile: Generate wider loads when feasible cmd/compile: generate wider loads when feasible Feb 18, 2018
@klauspost
Copy link
Contributor Author

You are completely correct. Awesome! This seems like a small issue, then. I will close for now.

@golang golang locked and limited conversation to collaborators Feb 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants