Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/flate: HuffmanOnly is sub-optimal #16132

Closed
dsnet opened this issue Jun 21, 2016 · 2 comments
Closed

compress/flate: HuffmanOnly is sub-optimal #16132

dsnet opened this issue Jun 21, 2016 · 2 comments
Milestone

Comments

@dsnet
Copy link
Member

dsnet commented Jun 21, 2016

Using go1.7beta1

HuffmanOnly mode on flate.Writer sometimes encodes data as raw blocks instead of dynamic blocks when there is something to gain.

The attached file (repeats.bin inside the zip file) has the following byte histogram:

  |  00    01    02    03    04    05    06    07    08    09    0a    0b    0c    0d    0e    0f
--+-----------------------------------------------------------------------------------------------
00| 643,  364,  272,  689,  159,  116,  240,  200,  694,  584, 1364,  250,  317,  195,  654,  228,
10| 245,  194, 2002,  356,  185,  300,  174,  522,  258, 3314,  199,  123,  177,  420,  164,  314,
20| 287,  854,  492,  183,  183,  134,  297,  476, 1046,  283,  301,  189,  182,  186,  241,  571,
30| 815,  293, 1075,  272,  159,  340,  180,  627,  234,  267,  146,  343,  148,  252,  697,  230,
40| 118,  273,  252, 1139,  175,  276,  718,  205,  246,  197,  228,  205,  179,  233,  238,  113,
50| 200,  615,  461,  842, 1329,  311, 1260,  207,  698,  258,  893,  346,  266,  188,  193,  276,
60| 220,  311,  294, 1154,  628,  322,  198,  233,  186,  320,  177,  725,  170,  207,  263,  708,
70| 246,  222,  241,  310,  316,  886,  317, 1017,  288,  193,  534,  244,  917,  175,  303,  375,
80| 214,  259,  310,  308,  363,  774,  219,  170,  161,  149,  226, 1224,  245,  877,  155,  916,
90| 212,  226,  266,  232,  295,  463,  385,  210,  180,  204,  372,  204,  168,  269,  219,  861,
a0| 290,  257,  555,  247,  968,  526,  245,  183,  267,  344,  491,  206,   70, 1067,  614,  183,
b0| 331,  269,  868,  146,  233,  364,  244,  213,  208,  277,  151,  225,  169,  154,  237,  242,
c0| 290,  248,  215,  308,  235,  175,  295,  312,  251,  187,  696,  298,  331,  164,  355,  304,
d0| 236,  256, 1412,  664,  216,  313,  256,  713,  307,  337,  247,  327, 1133,  246,  754,  219,
e0| 240, 1182,  415,  512,  149,  935,  290,  281,  280,  882,  285,  258,  185,  208,  289,  642,
f0| 272,  406,  260,  662,  192,  180,  153,  290,  275,  261,  178,  271,  259,  479,  256,  356,

As can be seen, it is not completely random, and it is clear that some symbols appear more frequently than others. Currently, on go1.7beta1, the flate.Writer on HuffmanOnly mode does not even attempt to encode this dataset (and resorts to outputting raw blocks).

From an entropy perspective, we should be able to encode each symbol with 7.63 bits/symbol. With 10000 symbols, we should be able to compress this to approximately 95375 bytes. So unless the huffman table definition itself would have occupied more than 4625 bytes (unlikely), the writer should have chosen to use a dynamic block instead of a raw block.

On HuffmanOnly mode it may be worth exploring how we split up each block.

@dsnet dsnet added this to the Go1.8 milestone Jun 21, 2016
@dsnet dsnet self-assigned this Jun 21, 2016
@dsnet
Copy link
Member Author

dsnet commented Jun 21, 2016

/cc @nigeltao @klauspost

@klauspost
Copy link
Contributor

This is by design.

If we can only achieve less than a 16th of the input size in reduction, we do not spend CPU time encoding it, and store the block.

@golang golang locked and limited conversation to collaborators Jun 23, 2017
@rsc rsc unassigned dsnet Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants