Issue 131470043: code review 131470043: compress/bzip2: index huffman tree with uint for performance

Issue 131470043: code review 131470043: compress/bzip2: index huffman tree with uint for performance (Closed)

Can't Edit
Can't Publish+Mail
Start Review

Created:
9 years, 7 months ago by jra

Modified:
9 years, 7 months ago

Reviewers:
bradfitz, minux, golang-codereviews, dvyukov, josharian

Visibility:
Public.

Description

compress/bzip2: index huffman tree with uint for performance Using uint instead of uint16 as the index into the tree gives a small performance boost. benchmark old ns/op new ns/op delta BenchmarkDecodeDigits 24388148 23859797 -2.17% BenchmarkDecodeTwain 70463920 67773138 -3.82% benchmark old MB/s new MB/s speedup BenchmarkDecodeDigits 1.77 1.81 1.02x BenchmarkDecodeTwain 1.77 1.84 1.04x Updates issue 6754.

Patch Set 1 #

Patch Set 2 : diff -r 07a722aece2db54bb4cccc58350b8657afbcdc35 https://go.googlecode.com/hg/ #

Patch Set 3 : diff -r 07a722aece2db54bb4cccc58350b8657afbcdc35 https://go.googlecode.com/hg/ #

Created: 9 years, 7 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+5 lines, -5 lines)			Patch
	M	src/pkg/compress/bzip2/huffman.go	View	1	4 chunks	+5 lines, -5 lines	0 comments	Download

Messages

Total messages: 9

Expand All Messages | Collapse All Messages

jra

Hello golang-codereviews@googlegroups.com, I'd like you to review this change to https://go.googlecode.com/hg/

9 years, 7 months ago (2014-08-27 19:41:47 UTC) #1

bradfitz

uint16 conveyed something before. And now your nodes are bigger too. Is the compiler doing ...

9 years, 7 months ago (2014-08-27 19:46:50 UTC) #2

uint16 conveyed something before. And now your nodes are bigger too.

Is the compiler doing something dumb? If so, I'd like to see a compiler bug
and note its URL in comments in this CL.



On Wed, Aug 27, 2014 at 12:41 PM, <jra@nella.org> wrote:

> Reviewers: golang-codereviews,
>
> Message:
> Hello golang-codereviews@googlegroups.com,
>
> I'd like you to review this change to
> https://go.googlecode.com/hg/
>
>
> Description:
> compress/bzip2: index huffman tree with uint for performance
>
> Using uint instead of uint16 as the index into the tree
> gives a small performance boost.
>
> benchmark                 old ns/op     new ns/op     delta
> BenchmarkDecodeDigits     24388148      23859797      -2.17%
> BenchmarkDecodeTwain      70463920      67773138      -3.82%
>
> benchmark                 old MB/s     new MB/s     speedup
> BenchmarkDecodeDigits     1.77         1.81         1.02x
> BenchmarkDecodeTwain      1.77         1.84         1.04x
>
> Updates issue 6754.
>
> Please review this at https://codereview.appspot.com/131470043/
>
> Affected files (+5, -5 lines):
>   M src/pkg/compress/bzip2/huffman.go
>
>
> Index: src/pkg/compress/bzip2/huffman.go
> ===================================================================
> --- a/src/pkg/compress/bzip2/huffman.go
> +++ b/src/pkg/compress/bzip2/huffman.go
> @@ -20,11 +20,11 @@
>  // nodes slice of the tree. If left or right is invalidNodeValue then the
> child
>  // is a left node and its value is in leftValue/rightValue.
>  //
> -// The symbols are uint16s because bzip2 encodes not only MTF indexes in
> the
> +// The symbols are uint16's because bzip2 encodes not only MTF indexes in
> the
>  // tree, but also two magic values for run-length encoding and an EOF
> symbol.
>  // Thus there are more than 256 possible symbols.
>  type huffmanNode struct {
> -       left, right           uint16
> +       left, right           uint
>         leftValue, rightValue uint16
>  }
>
> @@ -34,7 +34,7 @@
>  // Decode reads bits from the given bitReader and navigates the tree
> until a
>  // symbol is found.
>  func (t *huffmanTree) Decode(br *bitReader) (v uint16) {
> -       nodeIndex := uint16(0) // node 0 is the root of the tree.
> +       nodeIndex := uint(0) // node 0 is the root of the tree.
>
>         for {
>                 node := &t.nodes[nodeIndex]
> @@ -174,7 +174,7 @@
>  // buildHuffmanNode takes a slice of sorted huffmanCodes and builds a
> node in
>  // the Huffman tree at the given level. It returns the index of the newly
>  // constructed node.
> -func buildHuffmanNode(t *huffmanTree, codes []huffmanCode, level uint32)
> (nodeIndex uint16, err error) {
> +func buildHuffmanNode(t *huffmanTree, codes []huffmanCode, level uint32)
> (nodeIndex uint, err error) {
>         test := uint32(1) << (31 - level)
>
>         // We have to search the list of codes to find the divide between
> the left and right sides.
> @@ -223,7 +223,7 @@
>                 return buildHuffmanNode(t, left, level+1)
>         }
>
> -       nodeIndex = uint16(t.nextNode)
> +       nodeIndex = uint(t.nextNode)
>         node := &t.nodes[t.nextNode]
>         t.nextNode++
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-codereviews" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-codereviews+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

jra

The tree is made up of nodes with left and right branches. Each node points ...

9 years, 7 months ago (2014-08-27 21:34:30 UTC) #3

minux

On Aug 27, 2014 5:34 PM, <jra@nella.org> wrote: > The tree is made up of ...

9 years, 7 months ago (2014-08-27 21:40:07 UTC) #4

josharian

not lgtm Sorry, but Brad's comments are spot on. The marginal performance improvement is not ...

9 years, 7 months ago (2014-08-27 21:46:27 UTC) #5

jra

The key line is this: node := &t.nodes[nodeIndex] The disassembly for the uint16 indexes is: ...

9 years, 7 months ago (2014-08-27 21:50:53 UTC) #6

josharian

> So the faster one takes more instructions. But apparently that IMULL is faster > ...

9 years, 7 months ago (2014-08-27 21:54:39 UTC) #7

minux

On Aug 27, 2014 5:50 PM, <jra@nella.org> wrote: > > The key line is this: ...

9 years, 7 months ago (2014-08-27 22:18:13 UTC) #8

dvyukov

9 years, 7 months ago (2014-08-28 08:54:22 UTC) #9

Message was sent while issue was closed.

Are you sure your results are not noise? How have you benchmarked? Running go
test with default flags once won't do.

This is not necessary a performance bug in the compiler, the performance
difference can be just due to 16-bit types themselves. See Intel® 64 and IA-32
Architectures Optimization Reference Manual:

===
3.4.2.3 Length-Changing Prefixes (LCP) 
The length of an instruction can be up to 15 bytes in length. Some prefixes can
dynamically change the 
length of an instruction that the decoder must recognize. Typically, the
pre-decode unit will estimate the 
length of an instruction in the byte stream assuming the absence of LCP. When
the predecoder encoun-
ters an LCP in the fetch line, it must use a slower length decoding algorithm.
With the slower length 
decoding algorithm, the predecoder decodes the fetch in 6 cycles, instead of the
usual 1 cycle. Normal 
queuing throughout of the machine pipeline generally cannot hide LCP penalties. 
The prefixes that can dynamically change the length of a instruction include:
• operand size prefix (0x66)
• address size prefix (0x67)
The instruction MOV DX, 01234h is subject to LCP stalls in processors based on
Intel Core microarchitec-
ture, and in Intel Core Duo and Intel Core Solo processors. Instructions that
contain imm16 as part of 
their fixed encoding but do not require LCP to change the immediate size are not
subject to LCP stalls. 
The REX prefix (4xh) in 64-bit mode can change the size of two classes of
instruction, but does not cause 
an LCP penalty.
If the LCP stall happens in a tight loop, it can cause significant performance
degradation. When decoding 
is not a bottleneck, as in floating-point heavy code, isolated LCP stalls
usually do not cause performance 
degradation.
Assembly/Compiler Coding Rule 21. (MH impact, MH generality) Favor generating
code using 
imm8 or imm32 values instead of imm16 values.
If imm16 is needed, load equivalent imm32 into a register and use the word value
in the register instead.
===

This is exactly what we have in the current code (2 instructions with 0x66
prefix):

                        if node.left == invalidNodeValue {
  44de10:       48 0f b7 1a             movzwq (%rdx),%rbx
  44de14:       66 81 fb ff ff          cmp    $0xffff,%bx
  44de19:       75 0f                   jne    44de2a
<compress/bzip2.(*huffmanTree).Decode+0xfa>
                                return node.leftValue
  44de1b:       48 0f b7 6a 04          movzwq 0x4(%rdx),%rbp
  44de20:       66 89 6c 24 50          mov    %bp,0x50(%rsp)
  44de25:       48 83 c4 38             add    $0x38,%rsp
  44de29:       c3                      retq   


https://codereview.appspot.com/135920043 replaces uint16 with uint in these
cases without changing any data structures.
Speedup is very difficult to measure due to noise, but here is what I got after
15 round-robin (old-new-old-new) runs of
$ taskset 255 ./bzip2.test -test.run=none -test.bench=Decode -test.benchtime=3s

benchmark                 old ns/op     new ns/op     delta      
BenchmarkDecodeDigits     7126791       7074369       -0.74%     
BenchmarkDecodeTwain      23769836      23635937      -0.56%

Expand All Messages | Collapse All Messages