String Algorithms
String algorithms are fundamental to both competitive programming and NLP applications. This lesson covers five contest-style problems using KMP pattern matching, Rabin-Karp hashing, trie data structures, suffix arrays, and polynomial string hashing. Each problem is framed in an AI/ML context with complete solutions.
Problem 1: KMP Pattern Matching — Log Pattern Detection
Input: String S (1 ≤ |S| ≤ 10^6), String P (1 ≤ |P| ≤ |S|)
Output: Count of occurrences, followed by the 0-indexed starting positions
Approach: KMP (Knuth-Morris-Pratt) Algorithm
KMP avoids redundant comparisons by precomputing a failure function (also called the prefix function or partial match table). When a mismatch occurs at position j of the pattern, instead of restarting from the beginning, KMP jumps to the longest proper prefix of P[0..j-1] that is also a suffix. This guarantees O(n + m) time complexity.
def compute_lps(pattern):
"""Compute Longest Proper Prefix which is also Suffix array.
lps[i] = length of longest proper prefix of pattern[0..i]
that is also a suffix of pattern[0..i].
Time: O(m), Space: O(m)
"""
m = len(pattern)
lps = [0] * m
length = 0 # length of the previous longest prefix suffix
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length - 1] # Key insight: don't restart from 0
else:
lps[i] = 0
i += 1
return lps
def kmp_search(text, pattern):
"""Find all occurrences of pattern in text using KMP.
Time: O(n + m), Space: O(m)
"""
n, m = len(text), len(pattern)
if m == 0:
return []
lps = compute_lps(pattern)
matches = []
i = 0 # index for text
j = 0 # index for pattern
while i < n:
if text[i] == pattern[j]:
i += 1
j += 1
if j == m:
matches.append(i - j)
j = lps[j - 1] # Look for next match
elif i < n and text[i] != pattern[j]:
if j != 0:
j = lps[j - 1]
else:
i += 1
return matches
# Example: Find error patterns in training log
log = "EPOCH1_OK_EPOCH2_ERROR_EPOCH3_OK_EPOCH4_ERROR_EPOCH5_OK"
pattern = "ERROR"
matches = kmp_search(log, pattern)
print(f"Found {len(matches)} occurrences at positions: {matches}")
# Found 2 occurrences at positions: [16, 37]
# Stress test with repeated pattern
text = "aaaaaa"
pattern = "aa"
print(kmp_search(text, pattern)) # [0, 1, 2, 3, 4] - overlapping matches
Complexity: O(N + M) time, O(M) space where N = |text|, M = |pattern|.
Problem 2: Rabin-Karp — Multi-Pattern Token Search
Input: String S (|S| ≤ 10^6), Q patterns of length L (Q ≤ 10^4, L ≤ 1000)
Output: List of patterns found in S
Approach: Rabin-Karp Rolling Hash
Compute the hash of each query pattern and store in a set. Then slide a window of length L across S, maintaining the hash using the rolling hash formula. When the window hash matches a query hash, verify with a string comparison to handle collisions.
def rabin_karp_multi(text, patterns):
"""Find which patterns (all same length) appear in text.
Uses double hashing to minimize collision probability.
Time: O(N + Q*L) average, Space: O(Q*L)
"""
if not patterns:
return []
L = len(patterns[0])
N = len(text)
if L > N:
return []
BASE1, MOD1 = 131, 10**9 + 7
BASE2, MOD2 = 137, 10**9 + 9
def compute_hash(s, base, mod):
h = 0
for c in s:
h = (h * base + ord(c)) % mod
return h
# Precompute hashes of all patterns
pattern_set = {} # (hash1, hash2) -> list of patterns
for p in patterns:
h1 = compute_hash(p, BASE1, MOD1)
h2 = compute_hash(p, BASE2, MOD2)
key = (h1, h2)
if key not in pattern_set:
pattern_set[key] = []
pattern_set[key].append(p)
# Precompute BASE^L mod for rolling hash removal
pow1 = pow(BASE1, L, MOD1)
pow2 = pow(BASE2, L, MOD2)
# Compute hash of first window
h1 = compute_hash(text[:L], BASE1, MOD1)
h2 = compute_hash(text[:L], BASE2, MOD2)
found = set()
# Check first window
key = (h1, h2)
if key in pattern_set:
window = text[:L]
for p in pattern_set[key]:
if window == p:
found.add(p)
# Slide window
for i in range(1, N - L + 1):
# Remove leftmost char, add rightmost char
h1 = (h1 * BASE1 - ord(text[i - 1]) * pow1 + ord(text[i + L - 1])) % MOD1
h2 = (h2 * BASE2 - ord(text[i - 1]) * pow2 + ord(text[i + L - 1])) % MOD2
key = (h1, h2)
if key in pattern_set:
window = text[i:i + L]
for p in pattern_set[key]:
if window == p:
found.add(p)
return list(found)
# Example: Search for NLP tokens in corpus
corpus = "the quick brown fox jumps over the lazy dog and the cat"
tokens = ["the", "fox", "cat", "owl", "dog"]
print(rabin_karp_multi(corpus, tokens))
# ['the', 'fox', 'cat', 'dog']
Complexity: O(N + Q*L) average time with double hashing. O(Q*L) space for pattern storage.
Problem 3: Trie Operations — Autocomplete for Code Suggestions
Input: Q operations (Q ≤ 10^5), words of length ≤ 100
Output: Results for each SEARCH and PREFIX query
Approach: Trie with Frequency Tracking
class TrieNode:
__slots__ = ['children', 'is_end', 'word', 'freq']
def __init__(self):
self.children = {}
self.is_end = False
self.word = None
self.freq = 0
class AutocompleteTrie:
def __init__(self):
self.root = TrieNode()
def insert(self, word, freq):
"""Insert word with frequency: O(L)"""
node = self.root
for ch in word:
if ch not in node.children:
node.children[ch] = TrieNode()
node = node.children[ch]
node.is_end = True
node.word = word
node.freq = freq
def search(self, word):
"""Exact match search: O(L)"""
node = self.root
for ch in word:
if ch not in node.children:
return False
node = node.children[ch]
return node.is_end
def _collect_words(self, node, results):
"""DFS to collect all words under a node"""
if node.is_end:
results.append((node.freq, node.word))
for ch in sorted(node.children):
self._collect_words(node.children[ch], results)
def prefix_search(self, prefix, k):
"""Find top-K words by frequency with given prefix: O(N log K)"""
node = self.root
for ch in prefix:
if ch not in node.children:
return []
node = node.children[ch]
results = []
self._collect_words(node, results)
# Sort by frequency descending, return top K
results.sort(key=lambda x: -x[0])
return [word for freq, word in results[:k]]
# Example: AI code assistant autocomplete
trie = AutocompleteTrie()
functions = [
("model.fit", 95), ("model.predict", 90), ("model.evaluate", 80),
("model.save", 70), ("model.load", 65), ("model.compile", 85),
("matplotlib.pyplot", 60), ("matplotlib.figure", 40),
("numpy.array", 92), ("numpy.zeros", 75), ("numpy.ones", 70),
]
for word, freq in functions:
trie.insert(word, freq)
print(trie.search("model.fit")) # True
print(trie.search("model.train")) # False
print(trie.prefix_search("model.", 3)) # ['model.fit', 'model.predict', 'model.compile']
print(trie.prefix_search("numpy.", 2)) # ['numpy.array', 'numpy.zeros']
print(trie.prefix_search("matplotlib.", 5)) # ['matplotlib.pyplot', 'matplotlib.figure']
Complexity: Insert/Search O(L) where L is word length. Prefix search O(N log K) where N is total words under the prefix node.
Problem 4: Suffix Arrays — Longest Repeated Substring in Training Data
Input: String S (1 ≤ |S| ≤ 10^5)
Output: The longest repeated substring (or empty string if none)
Approach: Suffix Array + LCP Array
Build a suffix array (sorted array of all suffixes), then compute the Longest Common Prefix (LCP) array between adjacent suffixes. The maximum value in the LCP array gives the length of the longest repeated substring.
def build_suffix_array(s):
"""Build suffix array using Python's built-in sort.
For competitive programming, O(N log^2 N) is sufficient.
Production code would use SA-IS for O(N).
"""
n = len(s)
suffixes = [(s[i:], i) for i in range(n)]
suffixes.sort()
return [idx for _, idx in suffixes]
def build_lcp_array(s, sa):
"""Build LCP array using Kasai's algorithm: O(N)"""
n = len(s)
rank = [0] * n
lcp = [0] * n
for i in range(n):
rank[sa[i]] = i
k = 0
for i in range(n):
if rank[i] == 0:
k = 0
continue
j = sa[rank[i] - 1] # Previous suffix in sorted order
while i + k < n and j + k < n and s[i + k] == s[j + k]:
k += 1
lcp[rank[i]] = k
if k > 0:
k -= 1
return lcp
def longest_repeated_substring(s):
"""Find longest substring appearing at least twice: O(N log N)"""
if len(s) <= 1:
return ""
sa = build_suffix_array(s)
lcp = build_lcp_array(s, sa)
max_lcp = 0
max_idx = 0
for i in range(1, len(s)):
if lcp[i] > max_lcp:
max_lcp = lcp[i]
max_idx = sa[i]
return s[max_idx:max_idx + max_lcp]
# Example: Detect duplicated training data
training_data = "the cat sat on the mat the cat sat on the rug"
result = longest_repeated_substring(training_data)
print(f"Longest repeated substring: '{result}'")
# 'the cat sat on the '
# Another example
code = "def train_model(): pass\ndef train_model(): pass"
print(f"Duplicated code: '{longest_repeated_substring(code)}'")
# 'def train_model(): pass'
Complexity: O(N log N) for suffix array construction (with Python sort), O(N) for Kasai's LCP algorithm.
Problem 5: String Hashing — Counting Distinct Substrings
Input: String S (|S| ≤ 10^5), integer L (1 ≤ L ≤ |S|)
Output: Number of distinct substrings of length L
Approach: Polynomial Rolling Hash
Use rolling polynomial hash to compute the hash of each substring of length L in O(1) amortized time. Store hashes in a set. Use double hashing to virtually eliminate collision probability.
def count_distinct_substrings(s, L):
"""Count distinct substrings of length L using double rolling hash.
Time: O(N), Space: O(N)
"""
n = len(s)
if L > n:
return 0
BASE1, MOD1 = 131, 10**9 + 7
BASE2, MOD2 = 137, 10**9 + 9
# Precompute BASE^L for hash removal
pow1 = pow(BASE1, L, MOD1)
pow2 = pow(BASE2, L, MOD2)
# Compute hash of first window
h1 = h2 = 0
for i in range(L):
h1 = (h1 * BASE1 + ord(s[i])) % MOD1
h2 = (h2 * BASE2 + ord(s[i])) % MOD2
seen = {(h1, h2)}
# Slide window
for i in range(1, n - L + 1):
h1 = (h1 * BASE1 - ord(s[i - 1]) * pow1 + ord(s[i + L - 1])) % MOD1
h2 = (h2 * BASE2 - ord(s[i - 1]) * pow2 + ord(s[i + L - 1])) % MOD2
seen.add((h1, h2))
return len(seen)
# Example: Count distinct n-grams
text = "abracadabra"
for L in range(1, 6):
count = count_distinct_substrings(text, L)
print(f"Distinct {L}-grams: {count}")
# Distinct 1-grams: 5 (a, b, r, c, d)
# Distinct 2-grams: 7 (ab, br, ra, ac, ca, ad, da)
# Distinct 3-grams: 8 (abr, bra, rac, aca, cad, ada, dab, abr->already counted)
# Distinct 4-grams: 8
# Distinct 5-grams: 7
# ML application: vocabulary size at different n-gram levels
corpus = "the quick brown fox jumps over the lazy dog"
print(f"\nCorpus vocabulary analysis:")
for n in [1, 2, 3]:
print(f" {n}-char substrings: {count_distinct_substrings(corpus, n)}")
Complexity: O(N) time with rolling hash, O(N) space for the hash set.
Key Takeaways
- KMP achieves O(N + M) pattern matching by precomputing the failure function to avoid redundant comparisons.
- Rabin-Karp uses rolling hash for efficient multi-pattern search, making it ideal when searching for many patterns simultaneously.
- Tries provide O(L) insert/search and are the foundation for autocomplete systems, spell checkers, and IP routing.
- Suffix arrays with LCP arrays solve the longest repeated substring problem and many other substring queries efficiently.
- Polynomial rolling hash enables O(1) substring hash computation after O(N) preprocessing — use double hashing for collision resistance.
Lilly Tech Systems