|
|
Descriptioncompress/bzip2: index huffman tree with uint for performance
Using uint instead of uint16 as the index into the tree
gives a small performance boost.
benchmark old ns/op new ns/op delta
BenchmarkDecodeDigits 24388148 23859797 -2.17%
BenchmarkDecodeTwain 70463920 67773138 -3.82%
benchmark old MB/s new MB/s speedup
BenchmarkDecodeDigits 1.77 1.81 1.02x
BenchmarkDecodeTwain 1.77 1.84 1.04x
Updates issue 6754.
Patch Set 1 #Patch Set 2 : diff -r 07a722aece2db54bb4cccc58350b8657afbcdc35 https://go.googlecode.com/hg/ #Patch Set 3 : diff -r 07a722aece2db54bb4cccc58350b8657afbcdc35 https://go.googlecode.com/hg/ #MessagesTotal messages: 9
Hello golang-codereviews@googlegroups.com, I'd like you to review this change to https://go.googlecode.com/hg/
Sign in to reply to this message.
uint16 conveyed something before. And now your nodes are bigger too. Is the compiler doing something dumb? If so, I'd like to see a compiler bug and note its URL in comments in this CL. On Wed, Aug 27, 2014 at 12:41 PM, <jra@nella.org> wrote: > Reviewers: golang-codereviews, > > Message: > Hello golang-codereviews@googlegroups.com, > > I'd like you to review this change to > https://go.googlecode.com/hg/ > > > Description: > compress/bzip2: index huffman tree with uint for performance > > Using uint instead of uint16 as the index into the tree > gives a small performance boost. > > benchmark old ns/op new ns/op delta > BenchmarkDecodeDigits 24388148 23859797 -2.17% > BenchmarkDecodeTwain 70463920 67773138 -3.82% > > benchmark old MB/s new MB/s speedup > BenchmarkDecodeDigits 1.77 1.81 1.02x > BenchmarkDecodeTwain 1.77 1.84 1.04x > > Updates issue 6754. > > Please review this at https://codereview.appspot.com/131470043/ > > Affected files (+5, -5 lines): > M src/pkg/compress/bzip2/huffman.go > > > Index: src/pkg/compress/bzip2/huffman.go > =================================================================== > --- a/src/pkg/compress/bzip2/huffman.go > +++ b/src/pkg/compress/bzip2/huffman.go > @@ -20,11 +20,11 @@ > // nodes slice of the tree. If left or right is invalidNodeValue then the > child > // is a left node and its value is in leftValue/rightValue. > // > -// The symbols are uint16s because bzip2 encodes not only MTF indexes in > the > +// The symbols are uint16's because bzip2 encodes not only MTF indexes in > the > // tree, but also two magic values for run-length encoding and an EOF > symbol. > // Thus there are more than 256 possible symbols. > type huffmanNode struct { > - left, right uint16 > + left, right uint > leftValue, rightValue uint16 > } > > @@ -34,7 +34,7 @@ > // Decode reads bits from the given bitReader and navigates the tree > until a > // symbol is found. > func (t *huffmanTree) Decode(br *bitReader) (v uint16) { > - nodeIndex := uint16(0) // node 0 is the root of the tree. > + nodeIndex := uint(0) // node 0 is the root of the tree. > > for { > node := &t.nodes[nodeIndex] > @@ -174,7 +174,7 @@ > // buildHuffmanNode takes a slice of sorted huffmanCodes and builds a > node in > // the Huffman tree at the given level. It returns the index of the newly > // constructed node. > -func buildHuffmanNode(t *huffmanTree, codes []huffmanCode, level uint32) > (nodeIndex uint16, err error) { > +func buildHuffmanNode(t *huffmanTree, codes []huffmanCode, level uint32) > (nodeIndex uint, err error) { > test := uint32(1) << (31 - level) > > // We have to search the list of codes to find the divide between > the left and right sides. > @@ -223,7 +223,7 @@ > return buildHuffmanNode(t, left, level+1) > } > > - nodeIndex = uint16(t.nextNode) > + nodeIndex = uint(t.nextNode) > node := &t.nodes[t.nextNode] > t.nextNode++ > > > > -- > You received this message because you are subscribed to the Google Groups > "golang-codereviews" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-codereviews+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. >
Sign in to reply to this message.
The tree is made up of nodes with left and right branches. Each node points to a symbol (0-255) or other special symbols (see comment). So the things the nodes point to need to be uint16s and still are. This change is about using the natural word size of the machine as the index into the array holding the tree structure, instead of using uint16s to index into the array. I'll look at the disassembly to try to see why it's better this way. It is true that the runtime cost of the expanded huffman trees are 2x (32-bit) or 4x (64-bit) bigger. Those trees are unreferenced at the end of each call to readBlock, and freed on the next GC. They are not part of the state of the bzip2 reader.
Sign in to reply to this message.
On Aug 27, 2014 5:34 PM, <jra@nella.org> wrote: > The tree is made up of nodes with left and right branches. Each node > points to a symbol (0-255) or other special symbols (see comment). So > the things the nodes point to need to be uint16s and still are. > > This change is about using the natural word size of the machine as the > index into the array holding the tree structure, instead of using > uint16s to index into the array. if this change speed things up, then i also think we should fix the compiler instead because it is not uncommon to use smaller ints as index.
Sign in to reply to this message.
not lgtm Sorry, but Brad's comments are spot on. The marginal performance improvement is not worth the semantic loss and the increased node size. > Is the compiler doing something dumb? If so, I'd like to see a compiler bug > and note its URL in comments in this CL. Nothing strikingly dumb. Two things in the generated assembly change with this modification: * The address &t.nodes[nodeIndex] on line 40 goes from an LEAQ to a multiply+add because the size of a huffmanNode increased. This is a performance regression of 16% (for those particular instructions) on my machine -- that is, this CL makes this part a tiny bit worse. * A pointless temp register gets removed. Without this CL, we get e.g. for line 57 MOVWQZX node.right,BP MOVQ BP,AX with BP unused elsewhere. With this CL, it is just: MOVQ node.right,AX There's no need for the discursion through BP. Feel free to file a compiler bug about that, if there isn't one already. If you would, please write a minimal, standalone test case. It shouldn't be more than a few lines. (To see generated assembly for a file f.go, run 'go tool 6g -S f.go'.)
Sign in to reply to this message.
The key line is this: node := &t.nodes[nodeIndex] The disassembly for the uint16 indexes is: // base of nodes[] -> DI 0808c7b4: MOVL 0(BX), DI // base + offset -> DI 0808c7b6: LEAL 0(DI)(BP*8), DI // load node 0808c7b9: MOVL DI, 0x14(SP) For the uint indexes it is: // base of nodes[] -> DI 0808c7b3: MOVL 0(BX), DI // calc nodeIndex * sizeof(nodes[0]) 0808c7b5: IMULL $0xc, BP, BP // base + offset -> DI 0808c7b8: ADDL BP, DI // load node 0808c7ba: MOVL DI, 0x14(SP) So the faster one takes more instructions. But apparently that IMULL is faster than the LEAL. Perhaps it has to do with pipeline depth? Because it takes two steps, the ADDL can be getting set up while the IMULL is finishing. Whereas with the LEAL, it all has to happen in one big operation?
Sign in to reply to this message.
Message was sent while issue was closed.
> So the faster one takes more instructions. But apparently that IMULL is faster > than the LEAL. You can benchmark this: http://play.golang.org/p/62rNi-OUFn At least on my machine, the LEAL is faster: BenchmarkB8 3000000 551 ns/op BenchmarkB24 2000000 635 ns/op
Sign in to reply to this message.
On Aug 27, 2014 5:50 PM, <jra@nella.org> wrote: > > The key line is this: > > node := &t.nodes[nodeIndex] > > The disassembly for the uint16 indexes is: > > // base of nodes[] -> DI > 0808c7b4: MOVL 0(BX), DI > // base + offset -> DI > 0808c7b6: LEAL 0(DI)(BP*8), DI > // load node > 0808c7b9: MOVL DI, 0x14(SP) > > For the uint indexes it is: > > // base of nodes[] -> DI > 0808c7b3: MOVL 0(BX), DI > // calc nodeIndex * sizeof(nodes[0]) > 0808c7b5: IMULL $0xc, BP, BP > // base + offset -> DI > 0808c7b8: ADDL BP, DI > // load node > 0808c7ba: MOVL DI, 0x14(SP) > > So the faster one takes more instructions. But apparently that IMULL is > faster than the LEAL. > > Perhaps it has to do with pipeline depth? Because it takes two steps, > the ADDL can be getting set up while the IMULL is finishing. Whereas > with the LEAL, it all has to happen in one big operation? ok. then it's not worth fixing the compiler until we want to add chip-specific optimizations. these kind of things are too micro-architecture specific that we couldn't do generally (that's why we don't do instruction scheduling on x86).
Sign in to reply to this message.
Message was sent while issue was closed.
Are you sure your results are not noise? How have you benchmarked? Running go test with default flags once won't do. This is not necessary a performance bug in the compiler, the performance difference can be just due to 16-bit types themselves. See Intel® 64 and IA-32 Architectures Optimization Reference Manual: === 3.4.2.3 Length-Changing Prefixes (LCP) The length of an instruction can be up to 15 bytes in length. Some prefixes can dynamically change the length of an instruction that the decoder must recognize. Typically, the pre-decode unit will estimate the length of an instruction in the byte stream assuming the absence of LCP. When the predecoder encoun- ters an LCP in the fetch line, it must use a slower length decoding algorithm. With the slower length decoding algorithm, the predecoder decodes the fetch in 6 cycles, instead of the usual 1 cycle. Normal queuing throughout of the machine pipeline generally cannot hide LCP penalties. The prefixes that can dynamically change the length of a instruction include: • operand size prefix (0x66) • address size prefix (0x67) The instruction MOV DX, 01234h is subject to LCP stalls in processors based on Intel Core microarchitec- ture, and in Intel Core Duo and Intel Core Solo processors. Instructions that contain imm16 as part of their fixed encoding but do not require LCP to change the immediate size are not subject to LCP stalls. The REX prefix (4xh) in 64-bit mode can change the size of two classes of instruction, but does not cause an LCP penalty. If the LCP stall happens in a tight loop, it can cause significant performance degradation. When decoding is not a bottleneck, as in floating-point heavy code, isolated LCP stalls usually do not cause performance degradation. Assembly/Compiler Coding Rule 21. (MH impact, MH generality) Favor generating code using imm8 or imm32 values instead of imm16 values. If imm16 is needed, load equivalent imm32 into a register and use the word value in the register instead. === This is exactly what we have in the current code (2 instructions with 0x66 prefix): if node.left == invalidNodeValue { 44de10: 48 0f b7 1a movzwq (%rdx),%rbx 44de14: 66 81 fb ff ff cmp $0xffff,%bx 44de19: 75 0f jne 44de2a <compress/bzip2.(*huffmanTree).Decode+0xfa> return node.leftValue 44de1b: 48 0f b7 6a 04 movzwq 0x4(%rdx),%rbp 44de20: 66 89 6c 24 50 mov %bp,0x50(%rsp) 44de25: 48 83 c4 38 add $0x38,%rsp 44de29: c3 retq https://codereview.appspot.com/135920043 replaces uint16 with uint in these cases without changing any data structures. Speedup is very difficult to measure due to noise, but here is what I got after 15 round-robin (old-new-old-new) runs of $ taskset 255 ./bzip2.test -test.run=none -test.bench=Decode -test.benchtime=3s benchmark old ns/op new ns/op delta BenchmarkDecodeDigits 7126791 7074369 -0.74% BenchmarkDecodeTwain 23769836 23635937 -0.56%
Sign in to reply to this message.
|