Problem 1: KMP¶
The Knuth–Morris–Pratt [KMP] algorithm for locating a needle in a haystack (if, of course, the needle and haystack are ordered sequences and the needle is not infinite). This is not a particularly difficult task if you are smart enough to fugure out a “(non-)backtracking table” which helps you exploit the fact that you already know the “characters” you looked at.
kmp_search1 works as you would expect (or not?):
>>> kmp_search('Waldo', 'Where\'s Waldo?')
8
>>> kmp_search('', '')
0
>>> kmp_search('love', 'a hopeless place') # None
It is faster, due to a better algorithmic complexity, than the naïve search-at-every-position approach2 and scales asymptotically the same as the builtin subsequence search routines in Python. There is a mere \(\sim 426 \times\) difference in the constant; however, we totally don’t care about it.
Footnotes
- 1
It is a rather perverted one-liner: see the source of kmp_search. In fact, the table initialisation can be further inlined, but enough is enough.
- 2
Also a one-liner:
first((i for i in range(len(haystack) - len(needle)) if haystack[i:i+len(needle)] == needle), None)usingmore_itertools.first.
-
scicomp.exam.kk.kmp.kmp_search(needle: Sequence[_T], haystack: Iterable[_T]) → Optional[int][source]¶ Find the first occurence of
needleinhaystackusing the KMP algorithm.- Parameters
needle – subsequence to search for
haystack – sequence to search in
- Returns
The index of the first occurence of
needleinhaystackorNoneif not found. In general (unlessNone, of course),:haystack[kmp_search(needle, haystack):][:len(needle)] == needle
which means that, like the builtin
index/findmethods of Python strings, for an emptyneedle, this returns0.
Notes
Since random access into the
needleis needed, it is required to be a sequence. Thehaystackon the other hand can be any iterable.Comparison is performed using
==, so the usual equality rules apply.
References
- KMP
Knuth, Donald; Morris, James H.; Pratt, Vaughan (1977). “Fast pattern matching in strings”. SIAM Journal on Computing. 6 (2): 323–350. doi:10.1137/0206024.