math/bits: consider generating lookup len8tab with type int array instead of uint8 array #23919

odeke-em · 2018-02-19T09:19:55Z

I was just studying math/bits and noticed that the generated bits_table.go's len8tab is an uint8 array [256]uint8. However all its uses have an int conversion of the results i.e.

$ grep -Rn 'len8tab' *
bits.go:290:	return int(len8tab[x])
bits.go:299:	return n + int(len8tab[x])
bits.go:312:	return n + int(len8tab[x])
bits.go:329:	return n + int(len8tab[x])
bits_tables.go:66:var len8tab = [256]uint8{
make_tables.go:36:	gen(buf, "len8tab", len8)

That is:

go/src/math/bits/bits.go

Line 290 in dfb0e4f

return int(len8tab[x])

go/src/math/bits/bits.go

Line 299 in dfb0e4f

return n + int(len8tab[x])

go/src/math/bits/bits.go

Line 312 in dfb0e4f

return n + int(len8tab[x])

go/src/math/bits/bits.go

Line 329 in dfb0e4f

return n + int(len8tab[x])

I believe that the usage provides the impetus for us to directly generate a [256]int array plus we can squeeze a few point nanoseconds of efficiency out
as per gist https://gist.github.com/odeke-em/b0cb6f29d04cdda934726e23b414694a

$ benchstat old.txt new.txt 
name      old time/op    new time/op    delta
Lookup-4    22.9ns ± 8%    21.7ns ± 6%  -5.30%        (p=0.000 n=20+20)

which I ran over 20 samples

odeke-em · 2018-02-19T09:20:18Z

/cc @griesemer

cznic · 2018-02-19T11:55:26Z

That wastes 3 out of 4 bytes in the L1 cache (7 out of 8 on 64 bit platforms).

martisch · 2018-02-19T14:53:38Z

I agree that increasing the lookup table by a factor of 8 does not seem worth the performance increase.

Unfortunately our microbenchmarks do not measure performance when the cache is cold or the negative cache eviction effects on other part of the code.

The effect is larger than i would have expected. Maybe its worth looking in the assembly if the difference is more than a MOVQ vs MOVBZX and if there are any load and loop alignment effects.

ericlagergren · 2018-02-19T16:16:12Z

I thought most of these were going to be intrinsified, in which case the could extra ns overhead shouldn't exist.

rasky · 2018-02-19T21:57:11Z

bits.Len is intrinsified on most platforms. For instance, on amd64, it generates:

	0x0000 00000 (btq.go:17)	MOVQ	"".s1+8(SP), AX
	0x0005 00005 (btq.go:18)	BSRQ	AX, AX
	0x0009 00009 (btq.go:18)	MOVQ	$-1, CX
	0x0010 00016 (btq.go:18)	CMOVQEQ	CX, AX
	0x0014 00020 (btq.go:18)	INCQ	AX
	0x0017 00023 (btq.go:18)	MOVQ	AX, "".~r1+16(SP)
	0x001c 00028 (btq.go:18)	RET

I'm not sure optimizing nanoseconds of performance of the reference implementation is worth it.

odeke-em · 2018-02-20T02:49:08Z

Thanks for the input y'all

That wastes 3 out of 4 bytes in the L1 cache (7 out of 8 on 64 bit platforms).

@cznic wasteful in deed, and naive but it just struck me that we were making the int(uint8) in every spot. I hesitated to propose a "row compression" i.e still uses 256 bytes-ish, is a [64]int array. I hesitated because I thought perhaps it would be less readable and more complex. Thankfully this feedback gave me the license to post it so:
a) We could use a [64]int table
b) Index each bit count every 4 bytes into an int, without the risk on an overflow since the max is 0x08 08 08 08. First row: t[0] = asInt(0x00 0x01 0x02 0x02) but in reverse of discovery in len8tab
c) To look up: Find the row i.e t[x>>2] -- x/4, column >> ((x & 0x03)<<3) -- bytes modulo 4

var len8Rows [64]int

func init() {
        for i, j := 0, 0; i < len(len8tabInt); {
                r := len8tabInt[i : i+4]
                len8Rows[j] = r[3]<<24 | r[2]<<16 | r[1]<<8 | r[0]
                i, j = i+4, j+1
        }
}

func len64Compressed(x uint64) (n int) {
        if x >= 1<<32 {
                x >>= 32
                n = 32
        }
        if x >= 1<<16 {
                x >>= 16
                n += 16
        }
        if x >= 1<<8 {
                x >>= 8
                n += 8
        }
        return n + (0xFF & (len8Rows[x>>2] >> ((x & 0x03)<<3)))
}

which I believe is cache friendlier, but the results from time/op aren't radically any better.

Unfortunately our microbenchmarks do not measure performance when the cache is cold or the negative cache eviction effects on other part of the code.
The effect is larger than i would have expected. Maybe its worth looking in the assembly if the difference is more than a MOVQ vs MOVBZX and if there are any load and loop alignment effects.

@martisch I'd love to learn how to do this. If one day you are bored and have time to kill, please feel free to work on this and ping me too, if it interests you.

I thought most of these were going to be intrinsified, in which case the could extra ns overhead shouldn't exist.

@ericlagergren I thought so too

I'm not sure optimizing nanoseconds of performance of the reference implementation is worth it.

@rasky gotcha gotcha.

I'll leave this open for three days just in case anyone has any thoughts or ideas, and then close it.

UPDATE: I fixed the algorithm for correctness as my order of reading didn't match that of discovery in len8tab

griesemer · 2018-02-20T18:03:42Z

The choice of []uint8 (rather than []int) was deliberate; taking into the account both the savings in space and the extra cost of conversion (and mix of byte vs non-byte operations on x86 processors, which in the past cost pipeline stalls). Let's leave this as is. There's no reason to further micro-optimize the default implementation at a cost of more memory/cache use when there's optimized versions available when needed.

@odeke-em I leave this to you to close.

odeke-em · 2018-02-20T20:01:18Z

@griesemer thanks, agreed. I'll close it. As a thought experiment when you get bored or free, please feel free to take a look at my version in #23919 (comment) which is just as space and cache efficient.

griesemer · 2018-02-20T21:10:32Z

@odeke-em It sounds like your version is not radically better than what we have now; so I'd say let's stick with the more straight-forward (simpler) solution that we have now. Thanks.

odeke-em · 2018-02-21T19:16:15Z

Cool, thanks for taking a look at it @griesemer, and everyone, it was a great discussion.

odeke-em closed this as completed Feb 20, 2018

golang locked and limited conversation to collaborators Feb 21, 2019

gopherbot added the FrozenDueToAge label Feb 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

math/bits: consider generating lookup len8tab with type int array instead of uint8 array #23919

math/bits: consider generating lookup len8tab with type int array instead of uint8 array #23919

odeke-em commented Feb 19, 2018

odeke-em commented Feb 19, 2018

cznic commented Feb 19, 2018

martisch commented Feb 19, 2018

ericlagergren commented Feb 19, 2018 •

edited

rasky commented Feb 19, 2018 •

edited

odeke-em commented Feb 20, 2018 •

edited

griesemer commented Feb 20, 2018

odeke-em commented Feb 20, 2018

griesemer commented Feb 20, 2018

odeke-em commented Feb 21, 2018

Navigation Menu

math/bits: consider generating lookup len8tab with type int array instead of uint8 array #23919

math/bits: consider generating lookup len8tab with type int array instead of uint8 array #23919

Comments

odeke-em commented Feb 19, 2018

odeke-em commented Feb 19, 2018

cznic commented Feb 19, 2018

martisch commented Feb 19, 2018

ericlagergren commented Feb 19, 2018 • edited

rasky commented Feb 19, 2018 • edited

odeke-em commented Feb 20, 2018 • edited

griesemer commented Feb 20, 2018

odeke-em commented Feb 20, 2018

griesemer commented Feb 20, 2018

odeke-em commented Feb 21, 2018

ericlagergren commented Feb 19, 2018 •

edited

rasky commented Feb 19, 2018 •

edited

odeke-em commented Feb 20, 2018 •

edited