Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

math/bits: consider generating lookup len8tab with type int array instead of uint8 array #23919

Closed
odeke-em opened this issue Feb 19, 2018 · 10 comments

Comments

@odeke-em
Copy link
Member

I was just studying math/bits and noticed that the generated bits_table.go's len8tab is an uint8 array [256]uint8. However all its uses have an int conversion of the results i.e.

$ grep -Rn 'len8tab' *
bits.go:290:	return int(len8tab[x])
bits.go:299:	return n + int(len8tab[x])
bits.go:312:	return n + int(len8tab[x])
bits.go:329:	return n + int(len8tab[x])
bits_tables.go:66:var len8tab = [256]uint8{
make_tables.go:36:	gen(buf, "len8tab", len8)

That is:

return int(len8tab[x])

return n + int(len8tab[x])

return n + int(len8tab[x])

return n + int(len8tab[x])

I believe that the usage provides the impetus for us to directly generate a [256]int array plus we can squeeze a few point nanoseconds of efficiency out
as per gist https://gist.github.com/odeke-em/b0cb6f29d04cdda934726e23b414694a

$ benchstat old.txt new.txt 
name      old time/op    new time/op    delta
Lookup-4    22.9ns ± 8%    21.7ns ± 6%  -5.30%        (p=0.000 n=20+20)

which I ran over 20 samples

@odeke-em
Copy link
Member Author

/cc @griesemer

@cznic
Copy link
Contributor

cznic commented Feb 19, 2018

That wastes 3 out of 4 bytes in the L1 cache (7 out of 8 on 64 bit platforms).

@martisch
Copy link
Contributor

I agree that increasing the lookup table by a factor of 8 does not seem worth the performance increase.

Unfortunately our microbenchmarks do not measure performance when the cache is cold or the negative cache eviction effects on other part of the code.

The effect is larger than i would have expected. Maybe its worth looking in the assembly if the difference is more than a MOVQ vs MOVBZX and if there are any load and loop alignment effects.

@ericlagergren
Copy link
Contributor

ericlagergren commented Feb 19, 2018

I thought most of these were going to be intrinsified, in which case the could extra ns overhead shouldn't exist.

@rasky
Copy link
Member

rasky commented Feb 19, 2018

bits.Len is intrinsified on most platforms. For instance, on amd64, it generates:

	0x0000 00000 (btq.go:17)	MOVQ	"".s1+8(SP), AX
	0x0005 00005 (btq.go:18)	BSRQ	AX, AX
	0x0009 00009 (btq.go:18)	MOVQ	$-1, CX
	0x0010 00016 (btq.go:18)	CMOVQEQ	CX, AX
	0x0014 00020 (btq.go:18)	INCQ	AX
	0x0017 00023 (btq.go:18)	MOVQ	AX, "".~r1+16(SP)
	0x001c 00028 (btq.go:18)	RET

I'm not sure optimizing nanoseconds of performance of the reference implementation is worth it.

@odeke-em
Copy link
Member Author

odeke-em commented Feb 20, 2018

Thanks for the input y'all

That wastes 3 out of 4 bytes in the L1 cache (7 out of 8 on 64 bit platforms).

@cznic wasteful in deed, and naive but it just struck me that we were making the int(uint8) in every spot. I hesitated to propose a "row compression" i.e still uses 256 bytes-ish, is a [64]int array. I hesitated because I thought perhaps it would be less readable and more complex. Thankfully this feedback gave me the license to post it so:
a) We could use a [64]int table
b) Index each bit count every 4 bytes into an int, without the risk on an overflow since the max is 0x08 08 08 08. First row: t[0] = asInt(0x00 0x01 0x02 0x02) but in reverse of discovery in len8tab
c) To look up: Find the row i.e t[x>>2] -- x/4, column >> ((x & 0x03)<<3) -- bytes modulo 4

var len8Rows [64]int

func init() {
        for i, j := 0, 0; i < len(len8tabInt); {
                r := len8tabInt[i : i+4]
                len8Rows[j] = r[3]<<24 | r[2]<<16 | r[1]<<8 | r[0]
                i, j = i+4, j+1
        }
}

func len64Compressed(x uint64) (n int) {
        if x >= 1<<32 {
                x >>= 32
                n = 32
        }
        if x >= 1<<16 {
                x >>= 16
                n += 16
        }
        if x >= 1<<8 {
                x >>= 8
                n += 8
        }
        return n + (0xFF & (len8Rows[x>>2] >> ((x & 0x03)<<3)))
}

which I believe is cache friendlier, but the results from time/op aren't radically any better.

Unfortunately our microbenchmarks do not measure performance when the cache is cold or the negative cache eviction effects on other part of the code.
The effect is larger than i would have expected. Maybe its worth looking in the assembly if the difference is more than a MOVQ vs MOVBZX and if there are any load and loop alignment effects.

@martisch I'd love to learn how to do this. If one day you are bored and have time to kill, please feel free to work on this and ping me too, if it interests you.

I thought most of these were going to be intrinsified, in which case the could extra ns overhead shouldn't exist.

@ericlagergren I thought so too

I'm not sure optimizing nanoseconds of performance of the reference implementation is worth it.

@rasky gotcha gotcha.

I'll leave this open for three days just in case anyone has any thoughts or ideas, and then close it.

UPDATE: I fixed the algorithm for correctness as my order of reading didn't match that of discovery in len8tab

@griesemer
Copy link
Contributor

The choice of []uint8 (rather than []int) was deliberate; taking into the account both the savings in space and the extra cost of conversion (and mix of byte vs non-byte operations on x86 processors, which in the past cost pipeline stalls). Let's leave this as is. There's no reason to further micro-optimize the default implementation at a cost of more memory/cache use when there's optimized versions available when needed.

@odeke-em I leave this to you to close.

@odeke-em
Copy link
Member Author

@griesemer thanks, agreed. I'll close it. As a thought experiment when you get bored or free, please feel free to take a look at my version in #23919 (comment) which is just as space and cache efficient.

@griesemer
Copy link
Contributor

@odeke-em It sounds like your version is not radically better than what we have now; so I'd say let's stick with the more straight-forward (simpler) solution that we have now. Thanks.

@odeke-em
Copy link
Member Author

Cool, thanks for taking a look at it @griesemer, and everyone, it was a great discussion.

@golang golang locked and limited conversation to collaborators Feb 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants