Lossless Compression Explained

How to represent the same information in less space

Sep 02, 2022

1 bit = {0 or 1} = 2 possible values

1 byte = 8 bits = 256 possible combinations of 0s and 1s = 256 possible values

These 256 values are usually the letters of the english alphabet (one alphabet in lowercase, one in uppercase), the digits 0 through 9, and symbols like a backslash ( \ ) or a percent symbol ( % ).

If word is “bananarama” why represent each character with 8 bits?

01100010 01100001 01101110 01100001 01101110 01100001 01110010 01100001 01101101 01100001 = 10 bytes

First of all, there are only 5 unique values, so we only need 3 bits maximum.

b - 000
a - 001
n - 010
r - 011
m - 100

Making:

000 001 010 001 010 001 011 001 100 001
= 00000101 00010100 01011001 100001(00) = 4 bytes

But we can compress even further if we assign shorter codes to values that appear more frequently (this is called “entropy coding”. It reduces the amount of space taken up by repeating patterns, thus increasing entropy)

| value | frequency | code |
|-------|-----------|------|
| a     | 5         |1     |
| n     | 2         |01    |
| b     | 1         |001   |
| r     | 1         |0001  |
| m     | 1         |00001 |

Making:

001 1 01 1 01 1 0001 1 00001 1
= 00110110 11000110 00011(000) = 3 bytes

Notice that each code is just adding an extra 0 to the front of the previous code. It is important that no code is a prefix of another code, since there are no spaces to differentiate when one code ends and another begins. With a standard encoding scheme, every code is 8 bits long, so this is only a consideration when you have variable length codes. Think about this encoding scheme:

a - 1
b - 10
c - 01

With the word:

“cab”

The encoding of the word would be:

01110

When decoding the word, we start at the first bit:

There are no values with the code [0] so we add the next bit:

The letter “c” corresponds to the code [01] so this must be a “c”. We now have:

c110

Preceding to the next bit, we see:

The letter “a” corresponds to the code [1], so this must be an “a”. We now have:

ca10

Preceding to the next bit, it is another 1 so we add another a:

caa0

Preceding to the next bit, it is a 0, which does not match any of our codes. There are no more bits left, so we discard the leftover 0 and are left with:

As you can see, since we had no way to know that we were supposed to read “10” and get “b” instead of reading “1” and getting a, we decoded the message incorrectly.

This is a basic technique known as Huffman Coding. If there’s interest, I will possibly explore more complex techniques.

Dark Wittgenstein’s Simulacra

Discussion about this post