Intermediate

Hash Maps & Sets

Hash maps (Python dictionaries) and sets provide O(1) average-case lookup, insertion, and deletion. They are the most powerful tool for eliminating nested loops and are tested in nearly every coding interview.

💡

Why this matters for AI/ML: Feature stores use hash maps for O(1) feature retrieval. Vocabulary indices in NLP are dictionaries mapping tokens to IDs. Deduplication of training data uses sets. LRU caches are essential for model serving and embedding lookups in production.

Problem 1: Top K Frequent Elements

📝

Problem: Given an integer array nums and an integer k, return the k most frequent elements. You may return the answer in any order.

Brute Force: O(n log n) Time, O(n) Space

def top_k_frequent_brute(nums, k):
    """Count frequencies, sort by frequency, return top k."""
    freq = {}
    for num in nums:
        freq[num] = freq.get(num, 0) + 1

    # Sort by frequency (descending)
    sorted_items = sorted(freq.items(), key=lambda x: -x[1])
    return [item[0] for item in sorted_items[:k]]

# Test
print(top_k_frequent_brute([1, 1, 1, 2, 2, 3], 2))  # [1, 2]

Optimal (Bucket Sort): O(n) Time, O(n) Space

def top_k_frequent(nums, k):
    """
    Bucket sort approach:
    1. Count frequencies
    2. Create buckets where index = frequency
    3. Collect elements from highest frequency buckets

    The maximum frequency is at most n, so we create n+1 buckets.
    """
    freq = {}
    for num in nums:
        freq[num] = freq.get(num, 0) + 1

    # Buckets: index i holds elements that appear i times
    buckets = [[] for _ in range(len(nums) + 1)]
    for num, count in freq.items():
        buckets[count].append(num)

    # Collect from highest frequency buckets
    result = []
    for i in range(len(buckets) - 1, 0, -1):
        for num in buckets[i]:
            result.append(num)
            if len(result) == k:
                return result

    return result

# Test
print(top_k_frequent([1, 1, 1, 2, 2, 3], 2))    # [1, 2]
print(top_k_frequent([1], 1))                      # [1]
print(top_k_frequent([4, 1, -1, 2, -1, 2, 3], 2)) # [-1, 2]

AI/ML context: Finding top-K frequent elements is the core operation in building vocabulary for NLP models. When training a tokenizer, you count word frequencies across the entire corpus and keep only the top-K most frequent tokens.

Problem 2: Group Anagrams

📝

Problem: Given an array of strings strs, group the anagrams together. Return the groups in any order.

Brute Force: O(n² * m) Time, O(n * m) Space

def group_anagrams_brute(strs):
    """Compare each pair of strings for anagram relationship."""
    used = [False] * len(strs)
    groups = []

    for i in range(len(strs)):
        if used[i]:
            continue
        group = [strs[i]]
        used[i] = True
        for j in range(i + 1, len(strs)):
            if not used[j] and sorted(strs[i]) == sorted(strs[j]):
                group.append(strs[j])
                used[j] = True
        groups.append(group)

    return groups

print(group_anagrams_brute(["eat", "tea", "tan", "ate", "nat", "bat"]))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]

Optimal (Sorted Key): O(n * m log m) Time, O(n * m) Space

from collections import defaultdict

def group_anagrams(strs):
    """
    Use sorted string as hash key. All anagrams produce
    the same sorted string, so they map to the same bucket.
    """
    groups = defaultdict(list)

    for s in strs:
        key = tuple(sorted(s))  # "eat" -> ('a', 'e', 't')
        groups[key].append(s)

    return list(groups.values())

# Test
print(group_anagrams(["eat", "tea", "tan", "ate", "nat", "bat"]))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]

print(group_anagrams([""]))   # [['']]
print(group_anagrams(["a"]))  # [['a']]

Even Better (Character Count Key): O(n * m) Time, O(n * m) Space

def group_anagrams_optimal(strs):
    """
    Use character frequency as key instead of sorting.
    This avoids the O(m log m) sort per string.
    """
    groups = defaultdict(list)

    for s in strs:
        # Create a frequency tuple as key
        count = [0] * 26
        for char in s:
            count[ord(char) - ord('a')] += 1
        groups[tuple(count)].append(s)

    return list(groups.values())

# Test
print(group_anagrams_optimal(["eat", "tea", "tan", "ate", "nat", "bat"]))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]

AI/ML context: Grouping similar items by a canonical representation is fundamental in data preprocessing. In NLP, this pattern is used for lemmatization grouping, finding duplicate documents with different word orders, and clustering similar feature vectors.

Problem 3: Two Sum II - Sorted Array

📝

Problem: Given a sorted array of integers and a target, return the 1-indexed positions of two numbers that add up to the target. Use O(1) extra space.

Hash Map Approach: O(n) Time, O(n) Space

def two_sum_sorted_hash(numbers, target):
    """Works but uses O(n) space - not optimal for sorted input."""
    seen = {}
    for i, num in enumerate(numbers):
        complement = target - num
        if complement in seen:
            return [seen[complement] + 1, i + 1]
        seen[num] = i
    return []

Optimal (Two Pointers): O(n) Time, O(1) Space

def two_sum_sorted(numbers, target):
    """
    Two pointer approach for sorted arrays.
    Left pointer starts at beginning, right at end.
    - If sum too small: move left pointer right
    - If sum too large: move right pointer left
    - If sum equals target: found it
    """
    left, right = 0, len(numbers) - 1

    while left < right:
        current_sum = numbers[left] + numbers[right]
        if current_sum == target:
            return [left + 1, right + 1]  # 1-indexed
        elif current_sum < target:
            left += 1
        else:
            right -= 1

    return []

# Test
print(two_sum_sorted([2, 7, 11, 15], 9))    # [1, 2]
print(two_sum_sorted([2, 3, 4], 6))          # [1, 3]
print(two_sum_sorted([-1, 0], -1))           # [1, 2]

AI/ML context: Two-pointer techniques on sorted data are used in k-nearest neighbor search, threshold-based classification, and efficiently finding feature pairs that satisfy constraints in feature selection.

Problem 4: Intersection of Two Arrays II

📝

Problem: Given two integer arrays nums1 and nums2, return an array of their intersection. Each element should appear as many times as it shows in both arrays.

Brute Force: O(n * m) Time, O(min(n,m)) Space

def intersect_brute(nums1, nums2):
    """For each element in nums1, find it in nums2."""
    result = []
    nums2_copy = list(nums2)
    for num in nums1:
        if num in nums2_copy:
            result.append(num)
            nums2_copy.remove(num)  # O(m) removal
    return result

Optimal (Hash Map): O(n + m) Time, O(min(n,m)) Space

from collections import Counter

def intersect(nums1, nums2):
    """
    Count frequencies in the smaller array, then iterate
    through the larger array collecting matches.
    """
    # Optimize: count the smaller array
    if len(nums1) > len(nums2):
        return intersect(nums2, nums1)

    freq = Counter(nums1)
    result = []

    for num in nums2:
        if freq.get(num, 0) > 0:
            result.append(num)
            freq[num] -= 1

    return result

# Test
print(intersect([1, 2, 2, 1], [2, 2]))        # [2, 2]
print(intersect([4, 9, 5], [9, 4, 9, 8, 4]))  # [4, 9] or [9, 4]

AI/ML context: Set intersection is used in computing Jaccard similarity between document representations, finding common features between datasets during data integration, and identifying shared vocabulary between different text corpora.

Problem 5: LRU Cache

📝

Problem: Design a Least Recently Used (LRU) cache that supports get(key) and put(key, value) in O(1) time. When the cache reaches capacity, evict the least recently used item.

Optimal (OrderedDict): O(1) Time per Operation

from collections import OrderedDict

class LRUCache:
    """
    LRU Cache using OrderedDict.

    OrderedDict remembers insertion order and supports
    move_to_end() and popitem() in O(1).

    Strategy:
    - On get: move key to end (most recently used)
    - On put: add/update key at end, evict from front if over capacity
    """
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = OrderedDict()

    def get(self, key):
        if key not in self.cache:
            return -1
        # Move to end (most recently used)
        self.cache.move_to_end(key)
        return self.cache[key]

    def put(self, key, value):
        if key in self.cache:
            # Update and move to end
            self.cache.move_to_end(key)
        self.cache[key] = value

        if len(self.cache) > self.capacity:
            # Evict least recently used (first item)
            self.cache.popitem(last=False)


# Test
cache = LRUCache(2)
cache.put(1, 1)                    # cache: {1: 1}
cache.put(2, 2)                    # cache: {1: 1, 2: 2}
print(cache.get(1))                # 1, cache: {2: 2, 1: 1}
cache.put(3, 3)                    # evicts key 2, cache: {1: 1, 3: 3}
print(cache.get(2))                # -1 (evicted)
cache.put(4, 4)                    # evicts key 1, cache: {3: 3, 4: 4}
print(cache.get(1))                # -1 (evicted)
print(cache.get(3))                # 3
print(cache.get(4))                # 4

From Scratch (Hash Map + Doubly Linked List): O(1) per Operation

class Node:
    """Doubly linked list node."""
    def __init__(self, key=0, val=0):
        self.key = key
        self.val = val
        self.prev = None
        self.next = None


class LRUCacheManual:
    """
    Hash map: key -> Node (O(1) lookup)
    Doubly linked list: maintains access order (O(1) move/remove)

    head -> least recent ... most recent <- tail
    """
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = {}  # key -> Node

        # Dummy head and tail to simplify edge cases
        self.head = Node()
        self.tail = Node()
        self.head.next = self.tail
        self.tail.prev = self.head

    def _remove(self, node):
        """Remove node from doubly linked list."""
        node.prev.next = node.next
        node.next.prev = node.prev

    def _add_to_end(self, node):
        """Add node right before tail (most recently used)."""
        node.prev = self.tail.prev
        node.next = self.tail
        self.tail.prev.next = node
        self.tail.prev = node

    def get(self, key):
        if key not in self.cache:
            return -1
        node = self.cache[key]
        self._remove(node)
        self._add_to_end(node)
        return node.val

    def put(self, key, value):
        if key in self.cache:
            self._remove(self.cache[key])

        node = Node(key, value)
        self._add_to_end(node)
        self.cache[key] = node

        if len(self.cache) > self.capacity:
            # Remove least recently used (right after head)
            lru = self.head.next
            self._remove(lru)
            del self.cache[lru.key]


# Test
cache = LRUCacheManual(2)
cache.put(1, 1)
cache.put(2, 2)
print(cache.get(1))    # 1
cache.put(3, 3)        # evicts key 2
print(cache.get(2))    # -1

AI/ML context: LRU caches are critical in ML serving systems. Embedding lookup caches for recommendation models, tokenizer caches in NLP pipelines, and model prediction caches all use LRU eviction to balance memory and latency.

Problem 6: Word Frequency in Corpus

📝

Problem: Given a list of documents (strings), compute TF (term frequency) for each document and IDF (inverse document frequency) across all documents. Return the TF-IDF score for a query word in a specific document.

Complete Solution: O(n * m) Time, O(n * m) Space

import math
from collections import Counter

def compute_tf(document):
    """
    Term Frequency: count of word / total words in document.
    TF("neural", doc) = (times "neural" appears in doc) / (total words in doc)
    """
    words = document.lower().split()
    word_count = Counter(words)
    total_words = len(words)

    tf = {}
    for word, count in word_count.items():
        tf[word] = count / total_words

    return tf


def compute_idf(documents):
    """
    Inverse Document Frequency:
    IDF(word) = log(total documents / documents containing word)

    Words that appear in many documents get lower IDF (less discriminating).
    Words that appear in few documents get higher IDF (more discriminating).
    """
    n_docs = len(documents)
    doc_freq = Counter()

    for doc in documents:
        # Use set to count each word once per document
        unique_words = set(doc.lower().split())
        for word in unique_words:
            doc_freq[word] += 1

    idf = {}
    for word, count in doc_freq.items():
        idf[word] = math.log(n_docs / count)

    return idf


def tf_idf(documents, query_word, doc_index):
    """
    TF-IDF = TF * IDF
    Combines local importance (TF) with global rarity (IDF).
    """
    tf = compute_tf(documents[doc_index])
    idf = compute_idf(documents)

    word = query_word.lower()
    word_tf = tf.get(word, 0)
    word_idf = idf.get(word, 0)

    return word_tf * word_idf


# Test with a mini corpus
documents = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are friends",
    "neural networks process data efficiently"
]

# "the" appears in 3 docs -> low IDF -> low TF-IDF
print(f"TF-IDF('the', doc 0): {tf_idf(documents, 'the', 0):.4f}")
# 0.0959 (common word, low score)

# "neural" appears in 1 doc -> high IDF -> high TF-IDF
print(f"TF-IDF('neural', doc 3): {tf_idf(documents, 'neural', 3):.4f}")
# 0.2773 (rare word, high score)

# "cat" appears in 1 doc
print(f"TF-IDF('cat', doc 0): {tf_idf(documents, 'cat', 0):.4f}")
# 0.2310 (specific to document)

# Word not in document
print(f"TF-IDF('neural', doc 0): {tf_idf(documents, 'neural', 0):.4f}")
# 0.0000 (word not in this document)

AI/ML context: TF-IDF is one of the most fundamental algorithms in NLP and information retrieval. It powers search engines, document classification, and text feature extraction. Understanding its hash-map-based implementation helps you build custom text processing pipelines and debug scikit-learn's TfidfVectorizer.

Summary: Complexity Comparison

Problem	Brute Force	Optimal	Key Technique
Top K Frequent	O(n log n)	O(n)	Bucket sort by frequency
Group Anagrams	O(n² * m)	O(n * m)	Character count as hash key
Two Sum II	O(n)	O(n) / O(1) space	Two pointers on sorted array
Intersection	O(n * m)	O(n + m)	Counter on smaller array
LRU Cache	N/A	O(1) per op	OrderedDict or hash + DLL
TF-IDF	N/A	O(n * m)	Frequency counting + math

← Previous Arrays & Strings Next → Stacks & Queues