Intermediate

String Algorithms

String algorithms are fundamental to both competitive programming and NLP applications. This lesson covers five contest-style problems using KMP pattern matching, Rabin-Karp hashing, trie data structures, suffix arrays, and polynomial string hashing. Each problem is framed in an AI/ML context with complete solutions.

Problem 1: KMP Pattern Matching — Log Pattern Detection

📝
Problem Statement: Your ML training pipeline generates a massive log file as a string S. You need to find all occurrences of an error pattern P in S. Return the count of non-overlapping occurrences and the starting indices of all matches.

Input: String S (1 ≤ |S| ≤ 10^6), String P (1 ≤ |P| ≤ |S|)
Output: Count of occurrences, followed by the 0-indexed starting positions

Approach: KMP (Knuth-Morris-Pratt) Algorithm

KMP avoids redundant comparisons by precomputing a failure function (also called the prefix function or partial match table). When a mismatch occurs at position j of the pattern, instead of restarting from the beginning, KMP jumps to the longest proper prefix of P[0..j-1] that is also a suffix. This guarantees O(n + m) time complexity.

def compute_lps(pattern):
    """Compute Longest Proper Prefix which is also Suffix array.
    lps[i] = length of longest proper prefix of pattern[0..i]
             that is also a suffix of pattern[0..i].
    Time: O(m), Space: O(m)
    """
    m = len(pattern)
    lps = [0] * m
    length = 0  # length of the previous longest prefix suffix
    i = 1

    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]  # Key insight: don't restart from 0
            else:
                lps[i] = 0
                i += 1
    return lps

def kmp_search(text, pattern):
    """Find all occurrences of pattern in text using KMP.
    Time: O(n + m), Space: O(m)
    """
    n, m = len(text), len(pattern)
    if m == 0:
        return []

    lps = compute_lps(pattern)
    matches = []
    i = 0  # index for text
    j = 0  # index for pattern

    while i < n:
        if text[i] == pattern[j]:
            i += 1
            j += 1

        if j == m:
            matches.append(i - j)
            j = lps[j - 1]  # Look for next match
        elif i < n and text[i] != pattern[j]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1

    return matches

# Example: Find error patterns in training log
log = "EPOCH1_OK_EPOCH2_ERROR_EPOCH3_OK_EPOCH4_ERROR_EPOCH5_OK"
pattern = "ERROR"
matches = kmp_search(log, pattern)
print(f"Found {len(matches)} occurrences at positions: {matches}")
# Found 2 occurrences at positions: [16, 37]

# Stress test with repeated pattern
text = "aaaaaa"
pattern = "aa"
print(kmp_search(text, pattern))  # [0, 1, 2, 3, 4] - overlapping matches

Complexity: O(N + M) time, O(M) space where N = |text|, M = |pattern|.

Problem 2: Rabin-Karp — Multi-Pattern Token Search

📝
Problem Statement: Given a corpus string S and Q query patterns, each of the same length L, find which query patterns appear in S. Use rolling hash to achieve average O(N + Q*L) time instead of O(N*Q*L) with naive approach.

Input: String S (|S| ≤ 10^6), Q patterns of length L (Q ≤ 10^4, L ≤ 1000)
Output: List of patterns found in S

Approach: Rabin-Karp Rolling Hash

Compute the hash of each query pattern and store in a set. Then slide a window of length L across S, maintaining the hash using the rolling hash formula. When the window hash matches a query hash, verify with a string comparison to handle collisions.

def rabin_karp_multi(text, patterns):
    """Find which patterns (all same length) appear in text.
    Uses double hashing to minimize collision probability.
    Time: O(N + Q*L) average, Space: O(Q*L)
    """
    if not patterns:
        return []

    L = len(patterns[0])
    N = len(text)
    if L > N:
        return []

    BASE1, MOD1 = 131, 10**9 + 7
    BASE2, MOD2 = 137, 10**9 + 9

    def compute_hash(s, base, mod):
        h = 0
        for c in s:
            h = (h * base + ord(c)) % mod
        return h

    # Precompute hashes of all patterns
    pattern_set = {}  # (hash1, hash2) -> list of patterns
    for p in patterns:
        h1 = compute_hash(p, BASE1, MOD1)
        h2 = compute_hash(p, BASE2, MOD2)
        key = (h1, h2)
        if key not in pattern_set:
            pattern_set[key] = []
        pattern_set[key].append(p)

    # Precompute BASE^L mod for rolling hash removal
    pow1 = pow(BASE1, L, MOD1)
    pow2 = pow(BASE2, L, MOD2)

    # Compute hash of first window
    h1 = compute_hash(text[:L], BASE1, MOD1)
    h2 = compute_hash(text[:L], BASE2, MOD2)

    found = set()
    # Check first window
    key = (h1, h2)
    if key in pattern_set:
        window = text[:L]
        for p in pattern_set[key]:
            if window == p:
                found.add(p)

    # Slide window
    for i in range(1, N - L + 1):
        # Remove leftmost char, add rightmost char
        h1 = (h1 * BASE1 - ord(text[i - 1]) * pow1 + ord(text[i + L - 1])) % MOD1
        h2 = (h2 * BASE2 - ord(text[i - 1]) * pow2 + ord(text[i + L - 1])) % MOD2

        key = (h1, h2)
        if key in pattern_set:
            window = text[i:i + L]
            for p in pattern_set[key]:
                if window == p:
                    found.add(p)

    return list(found)

# Example: Search for NLP tokens in corpus
corpus = "the quick brown fox jumps over the lazy dog and the cat"
tokens = ["the", "fox", "cat", "owl", "dog"]
print(rabin_karp_multi(corpus, tokens))
# ['the', 'fox', 'cat', 'dog']

Complexity: O(N + Q*L) average time with double hashing. O(Q*L) space for pattern storage.

Problem 3: Trie Operations — Autocomplete for Code Suggestions

📝
Problem Statement: Build an autocomplete system for an AI code assistant. Support three operations: (1) INSERT a word with a frequency score, (2) SEARCH for exact match, (3) PREFIX query returning the top-K words by frequency that start with the given prefix.

Input: Q operations (Q ≤ 10^5), words of length ≤ 100
Output: Results for each SEARCH and PREFIX query

Approach: Trie with Frequency Tracking

class TrieNode:
    __slots__ = ['children', 'is_end', 'word', 'freq']

    def __init__(self):
        self.children = {}
        self.is_end = False
        self.word = None
        self.freq = 0

class AutocompleteTrie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word, freq):
        """Insert word with frequency: O(L)"""
        node = self.root
        for ch in word:
            if ch not in node.children:
                node.children[ch] = TrieNode()
            node = node.children[ch]
        node.is_end = True
        node.word = word
        node.freq = freq

    def search(self, word):
        """Exact match search: O(L)"""
        node = self.root
        for ch in word:
            if ch not in node.children:
                return False
            node = node.children[ch]
        return node.is_end

    def _collect_words(self, node, results):
        """DFS to collect all words under a node"""
        if node.is_end:
            results.append((node.freq, node.word))
        for ch in sorted(node.children):
            self._collect_words(node.children[ch], results)

    def prefix_search(self, prefix, k):
        """Find top-K words by frequency with given prefix: O(N log K)"""
        node = self.root
        for ch in prefix:
            if ch not in node.children:
                return []
            node = node.children[ch]

        results = []
        self._collect_words(node, results)

        # Sort by frequency descending, return top K
        results.sort(key=lambda x: -x[0])
        return [word for freq, word in results[:k]]

# Example: AI code assistant autocomplete
trie = AutocompleteTrie()
functions = [
    ("model.fit", 95), ("model.predict", 90), ("model.evaluate", 80),
    ("model.save", 70), ("model.load", 65), ("model.compile", 85),
    ("matplotlib.pyplot", 60), ("matplotlib.figure", 40),
    ("numpy.array", 92), ("numpy.zeros", 75), ("numpy.ones", 70),
]

for word, freq in functions:
    trie.insert(word, freq)

print(trie.search("model.fit"))              # True
print(trie.search("model.train"))            # False
print(trie.prefix_search("model.", 3))       # ['model.fit', 'model.predict', 'model.compile']
print(trie.prefix_search("numpy.", 2))       # ['numpy.array', 'numpy.zeros']
print(trie.prefix_search("matplotlib.", 5))  # ['matplotlib.pyplot', 'matplotlib.figure']

Complexity: Insert/Search O(L) where L is word length. Prefix search O(N log K) where N is total words under the prefix node.

Problem 4: Suffix Arrays — Longest Repeated Substring in Training Data

📝
Problem Statement: Given a string S representing tokenized training data, find the longest substring that appears at least twice. This helps detect data duplication in training sets.

Input: String S (1 ≤ |S| ≤ 10^5)
Output: The longest repeated substring (or empty string if none)

Approach: Suffix Array + LCP Array

Build a suffix array (sorted array of all suffixes), then compute the Longest Common Prefix (LCP) array between adjacent suffixes. The maximum value in the LCP array gives the length of the longest repeated substring.

def build_suffix_array(s):
    """Build suffix array using Python's built-in sort.
    For competitive programming, O(N log^2 N) is sufficient.
    Production code would use SA-IS for O(N).
    """
    n = len(s)
    suffixes = [(s[i:], i) for i in range(n)]
    suffixes.sort()
    return [idx for _, idx in suffixes]

def build_lcp_array(s, sa):
    """Build LCP array using Kasai's algorithm: O(N)"""
    n = len(s)
    rank = [0] * n
    lcp = [0] * n

    for i in range(n):
        rank[sa[i]] = i

    k = 0
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue

        j = sa[rank[i] - 1]  # Previous suffix in sorted order
        while i + k < n and j + k < n and s[i + k] == s[j + k]:
            k += 1

        lcp[rank[i]] = k
        if k > 0:
            k -= 1

    return lcp

def longest_repeated_substring(s):
    """Find longest substring appearing at least twice: O(N log N)"""
    if len(s) <= 1:
        return ""

    sa = build_suffix_array(s)
    lcp = build_lcp_array(s, sa)

    max_lcp = 0
    max_idx = 0
    for i in range(1, len(s)):
        if lcp[i] > max_lcp:
            max_lcp = lcp[i]
            max_idx = sa[i]

    return s[max_idx:max_idx + max_lcp]

# Example: Detect duplicated training data
training_data = "the cat sat on the mat the cat sat on the rug"
result = longest_repeated_substring(training_data)
print(f"Longest repeated substring: '{result}'")
# 'the cat sat on the '

# Another example
code = "def train_model(): pass\ndef train_model(): pass"
print(f"Duplicated code: '{longest_repeated_substring(code)}'")
# 'def train_model(): pass'

Complexity: O(N log N) for suffix array construction (with Python sort), O(N) for Kasai's LCP algorithm.

Problem 5: String Hashing — Counting Distinct Substrings

📝
Problem Statement: Given a text corpus S, count the number of distinct substrings of length exactly L. This is useful for computing vocabulary size at a given n-gram level for language models.

Input: String S (|S| ≤ 10^5), integer L (1 ≤ L ≤ |S|)
Output: Number of distinct substrings of length L

Approach: Polynomial Rolling Hash

Use rolling polynomial hash to compute the hash of each substring of length L in O(1) amortized time. Store hashes in a set. Use double hashing to virtually eliminate collision probability.

def count_distinct_substrings(s, L):
    """Count distinct substrings of length L using double rolling hash.
    Time: O(N), Space: O(N)
    """
    n = len(s)
    if L > n:
        return 0

    BASE1, MOD1 = 131, 10**9 + 7
    BASE2, MOD2 = 137, 10**9 + 9

    # Precompute BASE^L for hash removal
    pow1 = pow(BASE1, L, MOD1)
    pow2 = pow(BASE2, L, MOD2)

    # Compute hash of first window
    h1 = h2 = 0
    for i in range(L):
        h1 = (h1 * BASE1 + ord(s[i])) % MOD1
        h2 = (h2 * BASE2 + ord(s[i])) % MOD2

    seen = {(h1, h2)}

    # Slide window
    for i in range(1, n - L + 1):
        h1 = (h1 * BASE1 - ord(s[i - 1]) * pow1 + ord(s[i + L - 1])) % MOD1
        h2 = (h2 * BASE2 - ord(s[i - 1]) * pow2 + ord(s[i + L - 1])) % MOD2
        seen.add((h1, h2))

    return len(seen)

# Example: Count distinct n-grams
text = "abracadabra"
for L in range(1, 6):
    count = count_distinct_substrings(text, L)
    print(f"Distinct {L}-grams: {count}")

# Distinct 1-grams: 5 (a, b, r, c, d)
# Distinct 2-grams: 7 (ab, br, ra, ac, ca, ad, da)
# Distinct 3-grams: 8 (abr, bra, rac, aca, cad, ada, dab, abr->already counted)
# Distinct 4-grams: 8
# Distinct 5-grams: 7

# ML application: vocabulary size at different n-gram levels
corpus = "the quick brown fox jumps over the lazy dog"
print(f"\nCorpus vocabulary analysis:")
for n in [1, 2, 3]:
    print(f"  {n}-char substrings: {count_distinct_substrings(corpus, n)}")

Complexity: O(N) time with rolling hash, O(N) space for the hash set.

💡
Contest tip: String hashing is the most versatile string technique in competitive programming. It can solve pattern matching, palindrome detection, longest common substring, and distinct substring counting — all in linear or near-linear time. Always use double hashing (two different base/mod pairs) to avoid anti-hash test cases that some contest setters include.

Key Takeaways

  • KMP achieves O(N + M) pattern matching by precomputing the failure function to avoid redundant comparisons.
  • Rabin-Karp uses rolling hash for efficient multi-pattern search, making it ideal when searching for many patterns simultaneously.
  • Tries provide O(L) insert/search and are the foundation for autocomplete systems, spell checkers, and IP routing.
  • Suffix arrays with LCP arrays solve the longest repeated substring problem and many other substring queries efficiently.
  • Polynomial rolling hash enables O(1) substring hash computation after O(N) preprocessing — use double hashing for collision resistance.