I am not sure about this change. Why does making the hash table bigger (which ...
12 years, 3 months ago
(2012-01-24 15:36:05 UTC)
#3
I am not sure about this change.
Why does making the hash table bigger (which would seem
like it gives the compressor more work to do, and waste
memory besides) make the compressor run faster on
uncompressible data?
Why is uncompressible data something worth speeding up?
Also, this doesn't help compression ratios, which makes it
seem like not a good use of the extra memory.
On 2012/01/24 15:36:05, rsc wrote: > I am not sure about this change. > > ...
12 years, 3 months ago
(2012-01-24 17:43:41 UTC)
#4
On 2012/01/24 15:36:05, rsc wrote:
> I am not sure about this change.
>
> Why does making the hash table bigger (which would seem
> like it gives the compressor more work to do, and waste
> memory besides) make the compressor run faster on
> uncompressible data?
Every hash value that the compressor has met, it adds this hash to the
single-linker list with the head in hashHead[hash] and links between nodes in
prevHash[index]. So, the compressor knows that the given value of hash is at
hashHead[hash] position, prevHash[hashHead[hash]],
prevHash[prevHash[hashHead[hash]]] and so on.
findMatch is called to walk through the list and find the biggest match for
current hash value. If we have a hash collision, it would make findMatch to go
through list and return w/o any match.
This usually happens on uncompressable data, like .avi and almost does not
happen on text data, because text data uses only a small subset of values and
there's few hash collisions. At the same time, real data would likely be in
between.
This change does not make any extra work for compressor. It's opposite: we
significantly reduce the probability of hash collision and the case when
findMatch is called for no gain.
For example, if I compress google chrome binary, I get the following results:
gzip: 3.18 s
Go: 7.26 s
patched Go: 6.86 s [6% improvement]
Also, this memory (512K for hashHead) is only used during the compression. I can
even allocate it on the stack to avoid pressure on GC in case on frequent calls
of compressor (in fact, this is one of the optimisations which I have in pending
list, because it's likely other inefficiencies contribute much more to the
picture)
>
> Why is uncompressible data something worth speeding up?
>
> Also, this doesn't help compression ratios, which makes it
> seem like not a good use of the extra memory.
On 2012/01/24 17:44:25, imkrasin wrote: > err. s/single-linker list/single-linked list/, of course. :) I would ...
12 years, 3 months ago
(2012-01-24 17:59:38 UTC)
#6
On 2012/01/24 17:44:25, imkrasin wrote:
> err. s/single-linker list/single-linked list/, of course. :)
I would also explain why 1<<17 is much better than 1<<15. We have 1<<15 live
hash values (>= minIndex) because of the limitation on the offset value. If we
have len(hashHead) == 1 << 15, it would lead to no collisions only in the best
case (like, English text). Increasing the size of the array even just twice, to
1 << 16, would significantly reduce the amount of hash collisions.
Unfortunately, 1 << 16 is not a good value for the size of hash (verified
experimentally): the compression rate becomes slighly worse. It's because
hashShift = (hashBits + minMatchLength - 1) / minMatchLength
(15 + 3 - 1) / 3 = 5
(16 + 3 - 1) / 3 = 6
so, at hashBits == 16, hashShift is incremented, but it means that we begin to
store less information about the oldest byte. 1 << 17 adds an extra bit which
returns the quality of the compression and gives even more advantage in terms of
hash collision probability.
Issue 5569048: code review 5569048: compress/flate: increase the length of hash table from ...
Created 12 years, 3 months ago by Ivan Krasin
Modified 12 years, 3 months ago
Reviewers:
Base URL:
Comments: 0