Problem 2: BWT¶
Wikipedia has a sample Python implementation of the Burrows-Wheeler transform. Let’s one it up and write a one-liner:
map(last, sorted(circular_shifts(s + '\0')))
Arguably, we need more_itertools.circular_shifts (source)
and more_itertools.last (source),
but they are also essentially one-liners with error-prone padding and some
preparation.
This is naïve and “slow”. The alternative implementation here follows [F11], itself after [OS09], and as per the assignment uses suffix arrays. As a result it has linear time complexity.1 2 Turns out that even with all the bloat and Python involved, it is faster than the naïve variant for sequences above around 1000 elements. In addition, the memory requirements are “reasonable” (i.e. again linear) in contrast to the naïve method that requires storing all cyclic permutations at the same time for the sorting.
Finally, the claim in the problem statement is that BW encodings are more compressible. We verify this for random strings (and our own implementation of LZW compression!):
Looks like I should have said “refute this”: for random strings there is no difference between compressing the original and the BWT: supposedly, they are both equally random; structured (and entertaining) text3, however, suffers noticeably from being weirdly transformed: the structures captured by the LZW algorithm are destroyed by the BW transform.4
A note on the implementation
In contrast to the codes in the other “modules” in this library, the
implementation of the BWT does not focus (so much) on one-liner-ness. As such,
it contains functions and even classes! The main workhorse is SABucket that,
naturally, represents a “character bucket” for the suffix array construction.
A hierarchy of helper classes is present in bwt.collections, while the suffix array-related
functionality resides in bwt.suffix_array.
Some thoughts and comments can be found throughout the source code, but they
will not be expanded upon in the documentation.
Footnotes
- 1
The only caveat is the complexity in terms of the alphabet size, which is definitely not linear in our implementation since it requires sorting “character” buckets. Nevertheless, we are working under the paradigm that the given sequences are much longer than the alphabet size.
- 2
To do the inverse transformation also in linear time, we follow a recipe by [G11], which relies on being extremely clever.
- 3
It is left as an exercise to the reader to guess the “corpus” used. A hint can be found at the end of the discussion of Problem 3: LZW.
- 4
For reference, the compression level reaches 65% for the “meaningful” text. There is a slight tendency towards decreasing of the “BWT surplus” for very long sequences, which we do not explore.
-
scicomp.exam.kk.bwt.bwt_encode(seq: Iterable[_T]) → Iterable[_T][source]¶ Burrows-Wheeler transform (forward) using suffix arrays.
- Parameters
seq – the sequence to encode. If not a
collections.abc.Sequence, it will be implicitly converted to alist. Also, an end marker is not strictly required and will be appended automatically. Ifseqis a string, a zero byte ('\0') is recognised as an end marker and converted to the internal representation.- Returns
The Burrows-Wheeler encoded
seq. See Wikipedia for more details.
Warning
Returns an iterator, so make sure to “consume” it before e.g. saving, or if you plan to re-use the result.
See also
-
scicomp.exam.kk.bwt.bwt_decode(seq: Iterable[_T]) → Iterable[_T][source]¶ Burrows-Wheeler transform (reverse).
Method due to Gusfield [link].
- Parameters
seq – the sequence to decode. In practice any iterables are accepted, particularly the result of
bwt_encode. Note that the end marker has to be the internal representation. The easiest way to get it right is to usebwt_encode, and you should build custom sequences to decode only if you know what you’re doing.- Returns
The Burrows-Wheeler decoded
seq. It is generally not of the same object type as the original, though, so care should be taken to do item-by-item comparison, especially with strings, which get decoded into an iterator of length-one strings (characters).
Warning
Returns an iterator, so make sure to “consume” it before e.g. saving, or if you plan to re-use the result.
See also
References
- F11
Fischer, Johannes (2011). Inducing the LCP-Array. Algorithms and Data Structures. Lecture Notes in Computer Science. 6844. p. 374. arXiv:1101.3448. doi:10.1007/978-3-642-22300-6_32.
- OS09
D. Okanohara and K. Sadakane. A linear-time Burrows-Wheeler transform using induced sorting. Proc. SPIRE. 5721. pp. 90–101. Springer, 2009.
- G11
Gusfield, Daniel (2011). A linear time BWT inversion method. https://www.cs.ucdavis.edu/~gusfield/cs224f11/BWTcs224.pdf