Web Compression 101
Compression 101
LZ77
LZ77 replaces some portions of the input string with pointers/backreferences to identical substrings already seen. These backreferences are encoded as a <length, distance>
tuple. So for example abcdeabcdf
can be compressed abcde<4,5>f
.
Huffman Coding
Huffman Coding takes an alphabet and an input string, and walks the input string to find the frequency of the characters in the alphabet. Every character in the alphabet used 1 or more times is given a variable-length prefix code, where the length is inversely proportional to its frequency. Then every character’s unique code can be laid end-to-end to losslessly encode the input string.
A prefix code is a code with the ‘prefix property’ where no whole code word in the system is a prefix of another code word in the system. My assumption was that this results in ‘reserving’ characters for unique prefixes (and would result in rarely-used characters having longer Huffman codes than their normal ascii representation, and thus a completely random input would be larger after Huffman coding), but it seems that is wrong.
From wikipedia
A fixed-length code is necessarily a prefix code. It is possible to turn any code into a fixed-length code by padding fixed symbols to the shorter prefixes in order to meet the length of the longest prefixes.
So it seems to follow that a prefix code without padding ensures every symbol in the alphabet can be represented by no more bits than the fixed-length code.
The ‘well technically’ is that a Huffman encoding of an input string must be sent with the dictionary (mapping codes to characters) in order to decompress. However, this dictionary will have size in proportion to your alphabet (small) whereas the compression achieved is mostly in proportion to your input string’s length and input string’s diversity. scrabble, zipf’s law, etaoin shrdlu are all real phenomena that suggest that symbol frequencies are power-law distributed and therefore even with the overhead of transmitting a dictionary, Huffman Coding is generally ROI > 1.
gzip
The compression algorithm used in gzip is called “deflate,” and it’s a combination of the LZ77 and Huffman algorithms discussed above.
The Huffman coding technique is not applied only to alphabet symbols, but also to the length
and distance
tokens LZ77 uses to specify substring deduplication.
For LZ77’s length and distance, gzip allows substrings of length 3..258, and distances of at most 32768 bytes.
For Huffman coding, deflate uses two alphabets. The first alphabet uses 9 bytes (286 values) to encode the input literals (“letters” 0-255), the end-of-block symbol (256), and tokens representing the length (257-285). These 29 values for length aren’t enough to represent all 256 possible length values. Each of these 29 values maps to a number of ‘extra bits’ (0-4) that should be written (when compressing) or read (when decompressing) to disambiguate between these. references: stack overflow question, RFC 1951, wikipedia
The second alphabet encodes only the distances that LZ77 found. It’s also an alphabet of 29 symbols, but here each symbol implies 4-13 additional bits.
Brotli
Brotli, deflate, and gzip
Brotli also uses a sliding window to make backreferences. However, whereas the max distance supported by gzip is 32KB, for brotli it is configurable to any power of 2 between 1kb and 16MB.
Here is a quote justifying that default which I found hilarious:
This difference is almost irrelevant in web-server context, as text files larger than 32KB are the minority.
This article is from 2015, yet somehow this makes it feel a hundred years old. Sure, google’s response body is only 19kB, but cnn is 2.5MB, amazon 571kB.
Brotli also tolerates a wider range of match lengths (2-16779333 bytes)
Static dictionary
In addition to the dynamic dictionary that is computed for LZ77 deduplication–where re-used substrings may only be deduplicated within the same block–brotli ships with a static dictionary.
Context Modelling
Context modelling makes it possible to have different Huffman tree dictionaries within a single block. When a single block contains sections expected to have different prevalence of characters, you can instruct brotli to switch context (and thus which dictionaries get used) to get locally improved compression.
In brotli, each meta-block is made up of a header and a sequence of commands. Commands have 3 parts. The first part is a word <insert, copy>
, both specify the number of commands of each type to follow. insert
can be 0, copy
can only be 0 if it’s the last command in the meta-block. The next insert
words will be <lit>
words (“insert literals”, values specified by their Huffman code). After that will be copy
instances of <distance>
words (specifying literals to copy from backreferences).
There are 3 dictionaries here. one for <lit>
s (where the alphabet covers all byte values, 0-255) one for <distance>
s (where the alphabet size depends on parameters like the sliding window size), and one for <insert, copy>
s (where the alphabet is 704 characters, encoding ).
It means that for any of the alphabets—up to 256 different Huffman trees—can be used in the same block. The switch between different trees is determined by “context”. This can be useful when the compressed file consists of different types of characters. For example binary data interleaved with UTF-8 strings, or a multilingual dictionary.
Benchmarking
Compression Quality
Compression Speed
Connection Speedup
For on-the-fly compression, the most important question is how much time to invest in compression to make data transfer faster. Because increasing compression quality only gives incremental improvement over a given level, we need the added compressed bytes to outweigh the additional time spent compressing those bytes.
Conclusions
There is no yes/no answer to the question “Is Brotli better than gzip?”. It definitely looks like a big win for static content compression, but on the web where the content is dynamic we also need to consider on-the-fly compression.
The way I see it, Brotli already has an advantage over zlib for large files (larger than 64KB) on slow connections. However, those constitute only 20% of our sampled dataset (and 80% of the total size).