Problem 1: KMP

The Knuth–Morris–Pratt [KMP] algorithm for locating a needle in a haystack (if, of course, the needle and haystack are ordered sequences and the needle is not infinite). This is not a particularly difficult task if you are smart enough to fugure out a “(non-)backtracking table” which helps you exploit the fact that you already know the “characters” you looked at.

kmp_search1 works as you would expect (or not?):

>>> kmp_search('Waldo', 'Where\'s Waldo?')
8
>>> kmp_search('', '')
0
>>> kmp_search('love', 'a hopeless place')  # None

It is faster, due to a better algorithmic complexity, than the naïve search-at-every-position approach2 and scales asymptotically the same as the builtin subsequence search routines in Python. There is a mere \(\sim 426 \times\) difference in the constant; however, we totally don’t care about it.

_images/kmp.svg

Footnotes

1

It is a rather perverted one-liner: see the source of kmp_search. In fact, the table initialisation can be further inlined, but enough is enough.

2

Also a one-liner: first((i for i in range(len(haystack) - len(needle)) if haystack[i:i+len(needle)] == needle), None) using more_itertools.first.


Find the first occurence of needle in haystack using the KMP algorithm.

Parameters
  • needle – subsequence to search for

  • haystack – sequence to search in

Returns

The index of the first occurence of needle in haystack or None if not found. In general (unless None, of course),:

haystack[kmp_search(needle, haystack):][:len(needle)] == needle

which means that, like the builtin index / find methods of Python strings, for an empty needle, this returns 0.

Notes

Since random access into the needle is needed, it is required to be a sequence. The haystack on the other hand can be any iterable.

Comparison is performed using ==, so the usual equality rules apply.


References

KMP

Knuth, Donald; Morris, James H.; Pratt, Vaughan (1977). “Fast pattern matching in strings”. SIAM Journal on Computing. 6 (2): 323–350. doi:10.1137/0206024.