Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(2242)

Issue 131470043: code review 131470043: compress/bzip2: index huffman tree with uint for performance (Closed)

Can't Edit
Can't Publish+Mail
Start Review
Created:
9 years, 7 months ago by jra
Modified:
9 years, 7 months ago
Visibility:
Public.

Description

compress/bzip2: index huffman tree with uint for performance Using uint instead of uint16 as the index into the tree gives a small performance boost. benchmark old ns/op new ns/op delta BenchmarkDecodeDigits 24388148 23859797 -2.17% BenchmarkDecodeTwain 70463920 67773138 -3.82% benchmark old MB/s new MB/s speedup BenchmarkDecodeDigits 1.77 1.81 1.02x BenchmarkDecodeTwain 1.77 1.84 1.04x Updates issue 6754.

Patch Set 1 #

Patch Set 2 : diff -r 07a722aece2db54bb4cccc58350b8657afbcdc35 https://go.googlecode.com/hg/ #

Patch Set 3 : diff -r 07a722aece2db54bb4cccc58350b8657afbcdc35 https://go.googlecode.com/hg/ #

Unified diffs Side-by-side diffs Delta from patch set Stats (+5 lines, -5 lines) Patch
M src/pkg/compress/bzip2/huffman.go View 1 4 chunks +5 lines, -5 lines 0 comments Download

Messages

Total messages: 9
jra
Hello golang-codereviews@googlegroups.com, I'd like you to review this change to https://go.googlecode.com/hg/
9 years, 7 months ago (2014-08-27 19:41:47 UTC) #1
bradfitz
uint16 conveyed something before. And now your nodes are bigger too. Is the compiler doing ...
9 years, 7 months ago (2014-08-27 19:46:50 UTC) #2
jra
The tree is made up of nodes with left and right branches. Each node points ...
9 years, 7 months ago (2014-08-27 21:34:30 UTC) #3
minux
On Aug 27, 2014 5:34 PM, <jra@nella.org> wrote: > The tree is made up of ...
9 years, 7 months ago (2014-08-27 21:40:07 UTC) #4
josharian
not lgtm Sorry, but Brad's comments are spot on. The marginal performance improvement is not ...
9 years, 7 months ago (2014-08-27 21:46:27 UTC) #5
jra
The key line is this: node := &t.nodes[nodeIndex] The disassembly for the uint16 indexes is: ...
9 years, 7 months ago (2014-08-27 21:50:53 UTC) #6
josharian
> So the faster one takes more instructions. But apparently that IMULL is faster > ...
9 years, 7 months ago (2014-08-27 21:54:39 UTC) #7
minux
On Aug 27, 2014 5:50 PM, <jra@nella.org> wrote: > > The key line is this: ...
9 years, 7 months ago (2014-08-27 22:18:13 UTC) #8
dvyukov
9 years, 7 months ago (2014-08-28 08:54:22 UTC) #9
Message was sent while issue was closed.
Are you sure your results are not noise? How have you benchmarked? Running go
test with default flags once won't do.

This is not necessary a performance bug in the compiler, the performance
difference can be just due to 16-bit types themselves. See Intel® 64 and IA-32
Architectures Optimization Reference Manual:

===
3.4.2.3 Length-Changing Prefixes (LCP) 
The length of an instruction can be up to 15 bytes in length. Some prefixes can
dynamically change the 
length of an instruction that the decoder must recognize. Typically, the
pre-decode unit will estimate the 
length of an instruction in the byte stream assuming the absence of LCP. When
the predecoder encoun-
ters an LCP in the fetch line, it must use a slower length decoding algorithm.
With the slower length 
decoding algorithm, the predecoder decodes the fetch in 6 cycles, instead of the
usual 1 cycle. Normal 
queuing throughout of the machine pipeline generally cannot hide LCP penalties. 
The prefixes that can dynamically change the length of a instruction include:
• operand size prefix (0x66)
• address size prefix (0x67)
The instruction MOV DX, 01234h is subject to LCP stalls in processors based on
Intel Core microarchitec-
ture, and in Intel Core Duo and Intel Core Solo processors. Instructions that
contain imm16 as part of 
their fixed encoding but do not require LCP to change the immediate size are not
subject to LCP stalls. 
The REX prefix (4xh) in 64-bit mode can change the size of two classes of
instruction, but does not cause 
an LCP penalty.
If the LCP stall happens in a tight loop, it can cause significant performance
degradation. When decoding 
is not a bottleneck, as in floating-point heavy code, isolated LCP stalls
usually do not cause performance 
degradation.
Assembly/Compiler Coding Rule 21. (MH impact, MH generality) Favor generating
code using 
imm8 or imm32 values instead of imm16 values.
If imm16 is needed, load equivalent imm32 into a register and use the word value
in the register instead.
===

This is exactly what we have in the current code (2 instructions with 0x66
prefix):

                        if node.left == invalidNodeValue {
  44de10:       48 0f b7 1a             movzwq (%rdx),%rbx
  44de14:       66 81 fb ff ff          cmp    $0xffff,%bx
  44de19:       75 0f                   jne    44de2a
<compress/bzip2.(*huffmanTree).Decode+0xfa>
                                return node.leftValue
  44de1b:       48 0f b7 6a 04          movzwq 0x4(%rdx),%rbp
  44de20:       66 89 6c 24 50          mov    %bp,0x50(%rsp)
  44de25:       48 83 c4 38             add    $0x38,%rsp
  44de29:       c3                      retq   


https://codereview.appspot.com/135920043 replaces uint16 with uint in these
cases without changing any data structures.
Speedup is very difficult to measure due to noise, but here is what I got after
15 round-robin (old-new-old-new) runs of
$ taskset 255 ./bzip2.test -test.run=none -test.bench=Decode -test.benchtime=3s

benchmark                 old ns/op     new ns/op     delta      
BenchmarkDecodeDigits     7126791       7074369       -0.74%     
BenchmarkDecodeTwain      23769836      23635937      -0.56%
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld f62528b