HPACK was designed to provide header compression in HTTP/2. It uses both Huffman encoding for compressing strings and index tables to reduce the amount of over-the-wire header data needed in an HTTP/2 request. I found the specification interesting and wrote a Golang library that implements it, skip here to read about it.
Huffman encoding works by assigning short codes to frequently used characters. The HPACK specification defines the various codes to encode all characters, so when a peer recieves a header that is Huffman encoded it consults the list of short codes to resolve the character (aka symbol). The end result is that when a header uses frequently used characters, like alphanumeric, it takes less bits to transmit.
For example, in the string accept we would count the frequencies of each character and generate a code (with the shortest codes going to the most frequent):
c - 0 a - 101 e - 111 p - 110 t - 100
Then, to transmit accept we just replace each character with its code:
101 0 0 111 110 100
The original string encoded with 7 bits per character (ASCII) would take 6 bytes. Our Huffman encoded version only takes 14 bits, or 2 bytes.
Header indexing is another method in HPACK to reduce the amount of data needed for headers.
The Static Table contains frequently used headers (like :method, :status, user-agent, etc,.) and is defined in the HPACK specification. If a peer sends a header that exists in the static table, the index is used instead of the string literal.
The Dynamic Table is per-connection and takes advantage of the fact that during HTTP message exchanges there are headers that are frequently re-used which can benefit from being indexed. The User-Agent header is a good example of this.
The library that I wrote was written with performance in mind and implements a very fast Huffman decoding routine.
Simple Huffman decoders use a binary tree and traverse the tree while reading individual bits from the input stream (in this case, a header field). This can take quite a few operations to resolve the symbol as each bit needs to be looked at. The technique that I implemented is to use a series of lookup tables that work within the bounds of the max code length, which is 30 bits in HPACK.
The lookup table design works off the notion that Huffman codes can’t contain a prefix that maps to an existing code (aka the “prefix property”). This unique property allows us to build tables that are indexed by the Huffman code plus all other values (up until the 8-bit boundary).
Huffman and the “Prefix Property”
The prefix property states that there exists no other code that has, as a prefix, another existing code. For example, given a system with the following code:
If we were to add the following code to the system
it violates the prefix property because it has a prefix that is an already existing code. But, this can be confusing, as the following is a valid code:
And one might comment, but don’t codes 1 and 3 have the same prefix? The answer is yes, but while they both start with the same series of bits, there is no code in the system for 01 and therefore there is no ambiguity. Thus, the prefix property states that no other codes can start with a series of bits that corresponds to an existing code.
If you look at a Huffman tree it is easy to see this property in action – only leaf nodes have a symbol associated with them. Therefore, by definition, a leaf node has no children and thus can never be part of another Huffman code.