Problem 3: LZW

Simple routines to encode/decode sequences using the Lempel–Ziv–Welch algorithm [LZW]. Note that this implementation does not deal with bytes and such. It uses a bag of unbounded size (well, I guess Py_ssize_t although Python’s integer type claims to be unbounded, and so maybe are the sizes of collections) and represents sequences of arbitrary elements as a sequence of integers. It is the job of the user to provide an adequate initial “alphabet” to cover all possible elements.

Wikipedia says “[t]he algorithm is simple to implement”, and sure enough both lzw_encode and lzw_decode are essentially one-liners with a bag initialisation step to boot.

Here is an example:

>>> list(lzw_encode('supercalifragilisticexpialidocious', string.ascii_lowercase))
[18, 20, 15, 4, 17, 2, 0, 11, 8, 5, 17, 0, 6, 8, 33, 18, 19, 8, 2, 4, 23, 15, 8,
 32, 8, 3, 14, 2, 8, 14, 20, 18]

If you do the counting, you’ll notice we saved two characters (compressed from 34 characters to 32 integers), accounting for the repetitions supercalifragi-li-sticexpi-al-idocious (these get indices 33 and 32 in the bag, respectively). You’ll notice that LZW is a little short-sighted since it only looks one character ahead (so that it can rebuild the bag afterwards), and so it misses the opportunity to repeat the whole “ali” the second time it occurs.

Round-tripping is exact only elementwise (i.e. strings get decoded as sequences of characters). Also, decoding requires the alphabet again, which can be annoying… or “useful” to e.g. make a needlessly complicated capitaliser:

>>> list(
...     lzw_decode(
...         lzw_encode('supercalifragilisticexpialidocious', string.ascii_lowercase),
...         string.ascii_uppercase))
['S', 'U', 'P', 'E', 'R', 'C', 'A', 'L', 'I', 'F', 'R', 'A', 'G', 'I', 'L', 'I', 'S',
 'T', 'I', 'C', 'E', 'X', 'P', 'I', 'A', 'L', 'I', 'D', 'O', 'C', 'I', 'O', 'U', 'S']

Since this is scientific computing, we will finish off with a plot. Have you ever wondered how compressible the digits of \(\pi\) in different bases are?1

_images/lzw-pi-bases.svg

Implicit information is something I made up in order to have pronounced peak and trough around 26 and 42. Get in touch if you want to know exactly what it is.

1

I know for one that \(\pi\) in base-\(\pi\) is very compressible!…


scicomp.exam.kk.lzw_encode(seq: Sequence[_T], alphabet: Iterable[_T])Iterable[int][source]

Encode a sequence using the Lempel–Ziv–Welch algorithm.

Parameters
  • seq – the sequence to encode.

  • alphabet – a collection of possible unique elements in seq. If you don’t know the alphabet, go back to first grade, or get it via set(seq).

Returns

The LZW encoding of seq as an iterable of indices into an ordered “bag” that starts off with just the alphabet and is dynamically built.

Warning

Returns an iterator, so make sure to “consume” it before e.g. saving, or if you plan to re-use the result.

See also

lzw_decode

scicomp.exam.kk.lzw_decode(idx: Iterable[int], alphabet: Sequence[_T])Iterable[_T][source]

Decode a sequence previously encoded using lzw_encode.

Parameters
  • idx – the output of lzw_encode, being an iterable of indices into a bag of elements.

  • alphabet – the alphabet used to encode. This has to be the same as the one passed to lzw_encode: no more, no fewer elements, in the same order, otherwise gibberish will ensue!

Returns

The decoded sequence. It is generally not of the same object type as the original, though, so care should be taken to do item-by-item comparison, especially with strings, which get decoded into an iterator of length-one strings (characters).

Warning

Returns an iterator, so make sure to “consume” it before e.g. saving, or if you plan to re-use the result.

See also

lzw_encode


References

LZW

Welch, Terry (1984). “A Technique for High-Performance Data Compression”. Computer. 17 (6): 8–19. doi:10.1109/MC.1984.1659158.