compress/flate: HuffmanOnly is sub-optimal #16132

dsnet · 2016-06-21T04:41:10Z

Using go1.7beta1

HuffmanOnly mode on flate.Writer sometimes encodes data as raw blocks instead of dynamic blocks when there is something to gain.

The attached file (repeats.bin inside the zip file) has the following byte histogram:

  |  00    01    02    03    04    05    06    07    08    09    0a    0b    0c    0d    0e    0f
--+-----------------------------------------------------------------------------------------------
00| 643,  364,  272,  689,  159,  116,  240,  200,  694,  584, 1364,  250,  317,  195,  654,  228,
10| 245,  194, 2002,  356,  185,  300,  174,  522,  258, 3314,  199,  123,  177,  420,  164,  314,
20| 287,  854,  492,  183,  183,  134,  297,  476, 1046,  283,  301,  189,  182,  186,  241,  571,
30| 815,  293, 1075,  272,  159,  340,  180,  627,  234,  267,  146,  343,  148,  252,  697,  230,
40| 118,  273,  252, 1139,  175,  276,  718,  205,  246,  197,  228,  205,  179,  233,  238,  113,
50| 200,  615,  461,  842, 1329,  311, 1260,  207,  698,  258,  893,  346,  266,  188,  193,  276,
60| 220,  311,  294, 1154,  628,  322,  198,  233,  186,  320,  177,  725,  170,  207,  263,  708,
70| 246,  222,  241,  310,  316,  886,  317, 1017,  288,  193,  534,  244,  917,  175,  303,  375,
80| 214,  259,  310,  308,  363,  774,  219,  170,  161,  149,  226, 1224,  245,  877,  155,  916,
90| 212,  226,  266,  232,  295,  463,  385,  210,  180,  204,  372,  204,  168,  269,  219,  861,
a0| 290,  257,  555,  247,  968,  526,  245,  183,  267,  344,  491,  206,   70, 1067,  614,  183,
b0| 331,  269,  868,  146,  233,  364,  244,  213,  208,  277,  151,  225,  169,  154,  237,  242,
c0| 290,  248,  215,  308,  235,  175,  295,  312,  251,  187,  696,  298,  331,  164,  355,  304,
d0| 236,  256, 1412,  664,  216,  313,  256,  713,  307,  337,  247,  327, 1133,  246,  754,  219,
e0| 240, 1182,  415,  512,  149,  935,  290,  281,  280,  882,  285,  258,  185,  208,  289,  642,
f0| 272,  406,  260,  662,  192,  180,  153,  290,  275,  261,  178,  271,  259,  479,  256,  356,

As can be seen, it is not completely random, and it is clear that some symbols appear more frequently than others. Currently, on go1.7beta1, the flate.Writer on HuffmanOnly mode does not even attempt to encode this dataset (and resorts to outputting raw blocks).

From an entropy perspective, we should be able to encode each symbol with 7.63 bits/symbol. With 10000 symbols, we should be able to compress this to approximately 95375 bytes. So unless the huffman table definition itself would have occupied more than 4625 bytes (unlikely), the writer should have chosen to use a dynamic block instead of a raw block.

On HuffmanOnly mode it may be worth exploring how we split up each block.

The text was updated successfully, but these errors were encountered:

dsnet · 2016-06-21T04:41:19Z

/cc @nigeltao @klauspost

klauspost · 2016-06-21T07:33:56Z

This is by design.

If we can only achieve less than a 16th of the input size in reduction, we do not spend CPU time encoding it, and store the block.

dsnet added this to the Go1.8 milestone Jun 21, 2016

dsnet self-assigned this Jun 21, 2016

nigeltao closed this as completed Jun 23, 2016

golang locked and limited conversation to collaborators Jun 23, 2017

gopherbot added the FrozenDueToAge label Jun 23, 2017

rsc unassigned dsnet Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compress/flate: HuffmanOnly is sub-optimal #16132

compress/flate: HuffmanOnly is sub-optimal #16132

dsnet commented Jun 21, 2016 •

edited

dsnet commented Jun 21, 2016

klauspost commented Jun 21, 2016

compress/flate: HuffmanOnly is sub-optimal #16132

compress/flate: HuffmanOnly is sub-optimal #16132

Comments

dsnet commented Jun 21, 2016 • edited

dsnet commented Jun 21, 2016

klauspost commented Jun 21, 2016

dsnet commented Jun 21, 2016 •

edited