Huffman Coding: Squeezing Names into Fewer Bits
How a 1952 idea by a graduate student lets us pack a name into 60 bits instead of 200 — and why a3.0 primes can be reversed.
Huffman Coding: Squeezing Names into Fewer Bits
One of the prettiest ideas in computer science — and directly useful any time you need to pack text into as few bits as possible.
The problem fixed-width encoding solves badly
ASCII gives every character 8 bits. So “Alexander” (9 chars) = 72 bits regardless of what the letters are. But:
- We only need ~30 symbols for names:
A-Z, space, hyphen, apostrophe. - 30 symbols fits in 5 bits (2^5 = 32). Already we’re at 45 bits for “Alexander”. Better.
- But even 5 bits is wasteful: we’re paying the same for the rare “Q” as for the common “E”.
Letter frequency in English names (approximate, from US census data):
A 8.1% H 3.4% O 6.4% V 1.0%
B 1.5% I 6.5% P 2.0% W 1.3%
C 3.8% J 1.5% Q 0.1% X 0.2%
D 3.5% K 1.5% R 6.7% Y 2.0%
E 10.5% L 5.5% S 6.8% Z 0.3%
F 1.2% M 3.3% T 6.5% space 8%
G 2.0% N 7.8% U 2.7% other ~1%
Notice: A, E, N, R, S, T are each >6%. Q, X, Z are <0.5%. A 10x ratio.
The core insight: variable-length codes
What if common letters got short codes and rare letters got long codes? Like Morse code:
E = . (1 symbol)
T = - (1 symbol)
Q = --.- (4 symbols)
Morse achieves this informally; Huffman achieves it provably optimally for any given frequency distribution.
The prefix-code constraint
There’s a problem with variable-length codes: how do you know where one letter ends and the next begins? If A=01 and B=011, then 011 is ambiguous.
Solution: require that no codeword is a prefix of another. This is called a prefix code (or “prefix-free code”). Then decoding is unambiguous — you read bits left-to-right and emit a letter the moment you’ve matched a complete codeword.
Prefix codes correspond exactly to binary trees where letters live at the leaves:
root
/ \
0 1
/ \ / \
A B ...
To decode: start at root, read a bit, go left (0) or right (1), repeat until you hit a leaf, emit that letter, restart at root.
Huffman’s algorithm
Given letter frequencies, build the optimal prefix-code tree:
- Make a leaf node for each letter, weighted by its frequency.
- Pick the two lowest-frequency nodes. Merge them into a new internal node whose weight = sum of children.
- Repeat until one node remains. That’s the root.
- Label left edges 0, right edges 1. Each leaf’s path from root is its codeword.
Worked example with a tiny alphabet (frequencies in %):
E: 40 A: 25 N: 20 R: 10 Q: 5
Step 1 — pick R(10) and Q(5), merge → RQ(15):
E: 40 A: 25 N: 20 RQ: 15
Step 2 — pick N(20) and RQ(15), merge → NRQ(35):
E: 40 NRQ: 35 A: 25
Step 3 — pick A(25) and NRQ(35), merge → ANRQ(60):
ANRQ: 60 E: 40
Step 4 — merge → root(100). Done. The tree:
root
/ \
E ANRQ
(0) / \
A NRQ
(10) / \
N RQ
(110) / \
R Q
(1110)(1111)
Codebook:
E = 0 (1 bit) — the most common letter, cheapest
A = 10 (2 bits)
N = 110 (3 bits)
R = 1110 (4 bits)
Q = 1111 (4 bits) — rarest, most expensive
Average bits per letter = 0.40·1 + 0.25·2 + 0.20·3 + 0.10·4 + 0.05·4 = 1.9 bits, vs. 3 bits for fixed-width (5 letters needs ⌈log₂5⌉ = 3 bits). Already a ~37% saving.
Why it works
- Huffman is greedy but provably optimal — a rare property. The proof is a slick exchange argument: in any optimal tree, the two least-frequent symbols must be siblings at maximum depth, so you might as well pair them first.
- The lower bound it approaches is Shannon entropy:
H = -Σ p_i log₂(p_i). For English letters, H ≈ 4.1 bits/char. Huffman gets within 1 bit of entropy, always. - Huffman is optimal per-symbol. Arithmetic coding and ANS (the algorithm in Zstandard and most modern compressors) beat it by encoding multiple symbols into fractional bits, getting arbitrarily close to entropy.
A bit of history
Huffman coding was invented in 1952 by David Huffman, then a 25-year-old MIT grad student. His professor, Robert Fano, offered students a choice: take a final exam, or write a term paper finding an optimal prefix code. Huffman chose the paper, struggled for months, and almost gave up — until he hit on the bottom-up merging approach. The result beat the best-known scheme (Shannon-Fano coding, devised by his own professor and Claude Shannon).
It has since been used in JPEG, MP3, ZIP, PNG, HTTP/2 header compression (HPACK), and countless other formats.
Applied: how much can it save?
For a full English-name alphabet (~30 symbols including space/hyphen/apostrophe), realistic numbers:
Fixed 8-bit ASCII: "Alexander" = 72 bits
Fixed 5-bit (30 symbols): "Alexander" = 45 bits
Huffman (English names): "Alexander" ≈ 36 bits (avg ~4 bits/char)
Huffman (entropy bound): "Alexander" ≈ 37 bits (very close to optimal)
For “Alexander Oscar Daniel Taggart” (30 chars):
Fixed 8-bit: 240 bits
Fixed 5-bit: 150 bits
Huffman ~4-bit: 120 bits — half the size of ASCII
Huffman buys you the most savings on long inputs, where the per-character savings compound.
The codebook problem
To decode a Huffman-compressed string, you need the codebook (the tree). Two options:
- Static codebook — bake one fixed tree into the algorithm, derived from a corpus. Both encoder and decoder hardcode it. Simple, fast, no overhead per message. Slight inefficiency for unusual inputs (“Xander” pays a lot for that X).
- Dynamic codebook — build a fresh tree per message and ship it alongside the data. Optimal compression but huge overhead for short inputs.
For most applications where messages are small and the domain is known (names, log entries, a known protocol), static is the obvious choice. Build the tree once from a representative corpus, commit it as a constant, and version it if it ever changes.
Practical encoding shape
In TypeScript, the encoder is delightfully short:
// codebook derived offline from a representative corpus
const HUFFMAN: Record<string, string> = {
E: '0', A: '100', N: '101',
R: '1100', S: '1101', T: '11100',
// ...
Q: '111111110',
Z: '111111111',
};
function encode(name: string): Uint8Array {
const bits = [...name.toUpperCase()]
.map(c => HUFFMAN[c] ?? HUFFMAN['?'])
.join('');
return packBits(bits); // pad to byte boundary
}
Decoding walks the tree bit-by-bit, exactly as described above.
Caveats worth knowing
- Unknown characters — accented letters (José), CJK, emoji. You need an “escape hatch” symbol in the codebook that means “next 16/24 bits are a literal Unicode code point.” This is what real compressors do.
- The padding has to be unambiguous. If you pad bits to fit a byte or block boundary, the decoder needs to know where the real data ends. Easiest fix: encode length as a prefix (e.g., 6 bits for “length up to 63”), or use a dedicated end-of-message symbol in the tree.
- Frequency drift — a codebook tuned on US English will compress international inputs worse. Worth measuring against the actual data distribution.
- Bigram / trigram coding beats single-character Huffman significantly for natural text — “TH”, “ER”, “IN”, “AN” are extremely common pairs. But the codebook gets large and the marginal gain is small. Worth it only when every bit counts.
The bigger picture
Huffman coding is a window into one of the deep ideas of computer science: information has a measurable size, and it’s smaller than the way we usually write it down. Shannon’s entropy formula gives the theoretical floor; Huffman gets you within one bit of it; arithmetic coding and ANS get arbitrarily close. The reason .zip and .jpg and .mp3 files exist at all is that the naive encoding of text, images, and sound is wildly redundant — and that redundancy is exactly what compression algorithms eat.