Problem 2: BWT

Wikipedia has a sample Python implementation of the Burrows-Wheeler transform. Let’s one it up and write a one-liner:

map(last, sorted(circular_shifts(s + '\0')))

Arguably, we need more_itertools.circular_shifts (source) and more_itertools.last (source), but they are also essentially one-liners with error-prone padding and some preparation.

This is naïve and “slow”. The alternative implementation here follows [F11], itself after [OS09], and as per the assignment uses suffix arrays. As a result it has linear time complexity.1 2 Turns out that even with all the bloat and Python involved, it is faster than the naïve variant for sequences above around 1000 elements. In addition, the memory requirements are “reasonable” (i.e. again linear) in contrast to the naïve method that requires storing all cyclic permutations at the same time for the sorting.

_images/bwt-runtime.svg

Finally, the claim in the problem statement is that BW encodings are more compressible. We verify this for random strings (and our own implementation of LZW compression!):

_images/bwt-compression.svg

Looks like I should have said “refute this”: for random strings there is no difference between compressing the original and the BWT: supposedly, they are both equally random; structured (and entertaining) text3, however, suffers noticeably from being weirdly transformed: the structures captured by the LZW algorithm are destroyed by the BW transform.4

A note on the implementation

In contrast to the codes in the other “modules” in this library, the implementation of the BWT does not focus (so much) on one-liner-ness. As such, it contains functions and even classes! The main workhorse is SABucket that, naturally, represents a “character bucket” for the suffix array construction. A hierarchy of helper classes is present in bwt.collections, while the suffix array-related functionality resides in bwt.suffix_array. Some thoughts and comments can be found throughout the source code, but they will not be expanded upon in the documentation.

Footnotes

1

The only caveat is the complexity in terms of the alphabet size, which is definitely not linear in our implementation since it requires sorting “character” buckets. Nevertheless, we are working under the paradigm that the given sequences are much longer than the alphabet size.

2

To do the inverse transformation also in linear time, we follow a recipe by [G11], which relies on being extremely clever.

3

It is left as an exercise to the reader to guess the “corpus” used. A hint can be found at the end of the discussion of Problem 3: LZW.

4

For reference, the compression level reaches 65% for the “meaningful” text. There is a slight tendency towards decreasing of the “BWT surplus” for very long sequences, which we do not explore.


scicomp.exam.kk.bwt.bwt_encode(seq: Iterable[_T])Iterable[_T][source]

Burrows-Wheeler transform (forward) using suffix arrays.

Parameters

seq – the sequence to encode. If not a collections.abc.Sequence, it will be implicitly converted to a list. Also, an end marker is not strictly required and will be appended automatically. If seq is a string, a zero byte ('\0') is recognised as an end marker and converted to the internal representation.

Returns

The Burrows-Wheeler encoded seq. See Wikipedia for more details.

Warning

Returns an iterator, so make sure to “consume” it before e.g. saving, or if you plan to re-use the result.

See also

bwt_decode

scicomp.exam.kk.bwt.bwt_decode(seq: Iterable[_T])Iterable[_T][source]

Burrows-Wheeler transform (reverse).

Method due to Gusfield [link].

Parameters

seq – the sequence to decode. In practice any iterables are accepted, particularly the result of bwt_encode. Note that the end marker has to be the internal representation. The easiest way to get it right is to use bwt_encode, and you should build custom sequences to decode only if you know what you’re doing.

Returns

The Burrows-Wheeler decoded seq. It is generally not of the same object type as the original, though, so care should be taken to do item-by-item comparison, especially with strings, which get decoded into an iterator of length-one strings (characters).

Warning

Returns an iterator, so make sure to “consume” it before e.g. saving, or if you plan to re-use the result.

See also

bwt_encode


References

F11

Fischer, Johannes (2011). Inducing the LCP-Array. Algorithms and Data Structures. Lecture Notes in Computer Science. 6844. p. 374. arXiv:1101.3448. doi:10.1007/978-3-642-22300-6_32.

OS09

D. Okanohara and K. Sadakane. A linear-time Burrows-Wheeler transform using induced sorting. Proc. SPIRE. 5721. pp. 90–101. Springer, 2009.

G11

Gusfield, Daniel (2011). A linear time BWT inversion method. https://www.cs.ucdavis.edu/~gusfield/cs224f11/BWTcs224.pdf