Hash Maps & Sets
Hash maps (Python dictionaries) and sets provide O(1) average-case lookup, insertion, and deletion. They are the most powerful tool for eliminating nested loops and are tested in nearly every coding interview.
Problem 1: Top K Frequent Elements
nums and an integer k, return the k most frequent elements. You may return the answer in any order.Brute Force: O(n log n) Time, O(n) Space
def top_k_frequent_brute(nums, k):
"""Count frequencies, sort by frequency, return top k."""
freq = {}
for num in nums:
freq[num] = freq.get(num, 0) + 1
# Sort by frequency (descending)
sorted_items = sorted(freq.items(), key=lambda x: -x[1])
return [item[0] for item in sorted_items[:k]]
# Test
print(top_k_frequent_brute([1, 1, 1, 2, 2, 3], 2)) # [1, 2]
Optimal (Bucket Sort): O(n) Time, O(n) Space
def top_k_frequent(nums, k):
"""
Bucket sort approach:
1. Count frequencies
2. Create buckets where index = frequency
3. Collect elements from highest frequency buckets
The maximum frequency is at most n, so we create n+1 buckets.
"""
freq = {}
for num in nums:
freq[num] = freq.get(num, 0) + 1
# Buckets: index i holds elements that appear i times
buckets = [[] for _ in range(len(nums) + 1)]
for num, count in freq.items():
buckets[count].append(num)
# Collect from highest frequency buckets
result = []
for i in range(len(buckets) - 1, 0, -1):
for num in buckets[i]:
result.append(num)
if len(result) == k:
return result
return result
# Test
print(top_k_frequent([1, 1, 1, 2, 2, 3], 2)) # [1, 2]
print(top_k_frequent([1], 1)) # [1]
print(top_k_frequent([4, 1, -1, 2, -1, 2, 3], 2)) # [-1, 2]
AI/ML context: Finding top-K frequent elements is the core operation in building vocabulary for NLP models. When training a tokenizer, you count word frequencies across the entire corpus and keep only the top-K most frequent tokens.
Problem 2: Group Anagrams
strs, group the anagrams together. Return the groups in any order.Brute Force: O(n² * m) Time, O(n * m) Space
def group_anagrams_brute(strs):
"""Compare each pair of strings for anagram relationship."""
used = [False] * len(strs)
groups = []
for i in range(len(strs)):
if used[i]:
continue
group = [strs[i]]
used[i] = True
for j in range(i + 1, len(strs)):
if not used[j] and sorted(strs[i]) == sorted(strs[j]):
group.append(strs[j])
used[j] = True
groups.append(group)
return groups
print(group_anagrams_brute(["eat", "tea", "tan", "ate", "nat", "bat"]))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]
Optimal (Sorted Key): O(n * m log m) Time, O(n * m) Space
from collections import defaultdict
def group_anagrams(strs):
"""
Use sorted string as hash key. All anagrams produce
the same sorted string, so they map to the same bucket.
"""
groups = defaultdict(list)
for s in strs:
key = tuple(sorted(s)) # "eat" -> ('a', 'e', 't')
groups[key].append(s)
return list(groups.values())
# Test
print(group_anagrams(["eat", "tea", "tan", "ate", "nat", "bat"]))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]
print(group_anagrams([""])) # [['']]
print(group_anagrams(["a"])) # [['a']]
Even Better (Character Count Key): O(n * m) Time, O(n * m) Space
def group_anagrams_optimal(strs):
"""
Use character frequency as key instead of sorting.
This avoids the O(m log m) sort per string.
"""
groups = defaultdict(list)
for s in strs:
# Create a frequency tuple as key
count = [0] * 26
for char in s:
count[ord(char) - ord('a')] += 1
groups[tuple(count)].append(s)
return list(groups.values())
# Test
print(group_anagrams_optimal(["eat", "tea", "tan", "ate", "nat", "bat"]))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]
AI/ML context: Grouping similar items by a canonical representation is fundamental in data preprocessing. In NLP, this pattern is used for lemmatization grouping, finding duplicate documents with different word orders, and clustering similar feature vectors.
Problem 3: Two Sum II - Sorted Array
Hash Map Approach: O(n) Time, O(n) Space
def two_sum_sorted_hash(numbers, target):
"""Works but uses O(n) space - not optimal for sorted input."""
seen = {}
for i, num in enumerate(numbers):
complement = target - num
if complement in seen:
return [seen[complement] + 1, i + 1]
seen[num] = i
return []
Optimal (Two Pointers): O(n) Time, O(1) Space
def two_sum_sorted(numbers, target):
"""
Two pointer approach for sorted arrays.
Left pointer starts at beginning, right at end.
- If sum too small: move left pointer right
- If sum too large: move right pointer left
- If sum equals target: found it
"""
left, right = 0, len(numbers) - 1
while left < right:
current_sum = numbers[left] + numbers[right]
if current_sum == target:
return [left + 1, right + 1] # 1-indexed
elif current_sum < target:
left += 1
else:
right -= 1
return []
# Test
print(two_sum_sorted([2, 7, 11, 15], 9)) # [1, 2]
print(two_sum_sorted([2, 3, 4], 6)) # [1, 3]
print(two_sum_sorted([-1, 0], -1)) # [1, 2]
AI/ML context: Two-pointer techniques on sorted data are used in k-nearest neighbor search, threshold-based classification, and efficiently finding feature pairs that satisfy constraints in feature selection.
Problem 4: Intersection of Two Arrays II
nums1 and nums2, return an array of their intersection. Each element should appear as many times as it shows in both arrays.Brute Force: O(n * m) Time, O(min(n,m)) Space
def intersect_brute(nums1, nums2):
"""For each element in nums1, find it in nums2."""
result = []
nums2_copy = list(nums2)
for num in nums1:
if num in nums2_copy:
result.append(num)
nums2_copy.remove(num) # O(m) removal
return result
Optimal (Hash Map): O(n + m) Time, O(min(n,m)) Space
from collections import Counter
def intersect(nums1, nums2):
"""
Count frequencies in the smaller array, then iterate
through the larger array collecting matches.
"""
# Optimize: count the smaller array
if len(nums1) > len(nums2):
return intersect(nums2, nums1)
freq = Counter(nums1)
result = []
for num in nums2:
if freq.get(num, 0) > 0:
result.append(num)
freq[num] -= 1
return result
# Test
print(intersect([1, 2, 2, 1], [2, 2])) # [2, 2]
print(intersect([4, 9, 5], [9, 4, 9, 8, 4])) # [4, 9] or [9, 4]
AI/ML context: Set intersection is used in computing Jaccard similarity between document representations, finding common features between datasets during data integration, and identifying shared vocabulary between different text corpora.
Problem 5: LRU Cache
get(key) and put(key, value) in O(1) time. When the cache reaches capacity, evict the least recently used item.Optimal (OrderedDict): O(1) Time per Operation
from collections import OrderedDict
class LRUCache:
"""
LRU Cache using OrderedDict.
OrderedDict remembers insertion order and supports
move_to_end() and popitem() in O(1).
Strategy:
- On get: move key to end (most recently used)
- On put: add/update key at end, evict from front if over capacity
"""
def __init__(self, capacity):
self.capacity = capacity
self.cache = OrderedDict()
def get(self, key):
if key not in self.cache:
return -1
# Move to end (most recently used)
self.cache.move_to_end(key)
return self.cache[key]
def put(self, key, value):
if key in self.cache:
# Update and move to end
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.capacity:
# Evict least recently used (first item)
self.cache.popitem(last=False)
# Test
cache = LRUCache(2)
cache.put(1, 1) # cache: {1: 1}
cache.put(2, 2) # cache: {1: 1, 2: 2}
print(cache.get(1)) # 1, cache: {2: 2, 1: 1}
cache.put(3, 3) # evicts key 2, cache: {1: 1, 3: 3}
print(cache.get(2)) # -1 (evicted)
cache.put(4, 4) # evicts key 1, cache: {3: 3, 4: 4}
print(cache.get(1)) # -1 (evicted)
print(cache.get(3)) # 3
print(cache.get(4)) # 4
From Scratch (Hash Map + Doubly Linked List): O(1) per Operation
class Node:
"""Doubly linked list node."""
def __init__(self, key=0, val=0):
self.key = key
self.val = val
self.prev = None
self.next = None
class LRUCacheManual:
"""
Hash map: key -> Node (O(1) lookup)
Doubly linked list: maintains access order (O(1) move/remove)
head -> least recent ... most recent <- tail
"""
def __init__(self, capacity):
self.capacity = capacity
self.cache = {} # key -> Node
# Dummy head and tail to simplify edge cases
self.head = Node()
self.tail = Node()
self.head.next = self.tail
self.tail.prev = self.head
def _remove(self, node):
"""Remove node from doubly linked list."""
node.prev.next = node.next
node.next.prev = node.prev
def _add_to_end(self, node):
"""Add node right before tail (most recently used)."""
node.prev = self.tail.prev
node.next = self.tail
self.tail.prev.next = node
self.tail.prev = node
def get(self, key):
if key not in self.cache:
return -1
node = self.cache[key]
self._remove(node)
self._add_to_end(node)
return node.val
def put(self, key, value):
if key in self.cache:
self._remove(self.cache[key])
node = Node(key, value)
self._add_to_end(node)
self.cache[key] = node
if len(self.cache) > self.capacity:
# Remove least recently used (right after head)
lru = self.head.next
self._remove(lru)
del self.cache[lru.key]
# Test
cache = LRUCacheManual(2)
cache.put(1, 1)
cache.put(2, 2)
print(cache.get(1)) # 1
cache.put(3, 3) # evicts key 2
print(cache.get(2)) # -1
AI/ML context: LRU caches are critical in ML serving systems. Embedding lookup caches for recommendation models, tokenizer caches in NLP pipelines, and model prediction caches all use LRU eviction to balance memory and latency.
Problem 6: Word Frequency in Corpus
Complete Solution: O(n * m) Time, O(n * m) Space
import math
from collections import Counter
def compute_tf(document):
"""
Term Frequency: count of word / total words in document.
TF("neural", doc) = (times "neural" appears in doc) / (total words in doc)
"""
words = document.lower().split()
word_count = Counter(words)
total_words = len(words)
tf = {}
for word, count in word_count.items():
tf[word] = count / total_words
return tf
def compute_idf(documents):
"""
Inverse Document Frequency:
IDF(word) = log(total documents / documents containing word)
Words that appear in many documents get lower IDF (less discriminating).
Words that appear in few documents get higher IDF (more discriminating).
"""
n_docs = len(documents)
doc_freq = Counter()
for doc in documents:
# Use set to count each word once per document
unique_words = set(doc.lower().split())
for word in unique_words:
doc_freq[word] += 1
idf = {}
for word, count in doc_freq.items():
idf[word] = math.log(n_docs / count)
return idf
def tf_idf(documents, query_word, doc_index):
"""
TF-IDF = TF * IDF
Combines local importance (TF) with global rarity (IDF).
"""
tf = compute_tf(documents[doc_index])
idf = compute_idf(documents)
word = query_word.lower()
word_tf = tf.get(word, 0)
word_idf = idf.get(word, 0)
return word_tf * word_idf
# Test with a mini corpus
documents = [
"the cat sat on the mat",
"the dog sat on the log",
"cats and dogs are friends",
"neural networks process data efficiently"
]
# "the" appears in 3 docs -> low IDF -> low TF-IDF
print(f"TF-IDF('the', doc 0): {tf_idf(documents, 'the', 0):.4f}")
# 0.0959 (common word, low score)
# "neural" appears in 1 doc -> high IDF -> high TF-IDF
print(f"TF-IDF('neural', doc 3): {tf_idf(documents, 'neural', 3):.4f}")
# 0.2773 (rare word, high score)
# "cat" appears in 1 doc
print(f"TF-IDF('cat', doc 0): {tf_idf(documents, 'cat', 0):.4f}")
# 0.2310 (specific to document)
# Word not in document
print(f"TF-IDF('neural', doc 0): {tf_idf(documents, 'neural', 0):.4f}")
# 0.0000 (word not in this document)
AI/ML context: TF-IDF is one of the most fundamental algorithms in NLP and information retrieval. It powers search engines, document classification, and text feature extraction. Understanding its hash-map-based implementation helps you build custom text processing pipelines and debug scikit-learn's TfidfVectorizer.
Summary: Complexity Comparison
| Problem | Brute Force | Optimal | Key Technique |
|---|---|---|---|
| Top K Frequent | O(n log n) | O(n) | Bucket sort by frequency |
| Group Anagrams | O(n² * m) | O(n * m) | Character count as hash key |
| Two Sum II | O(n) | O(n) / O(1) space | Two pointers on sorted array |
| Intersection | O(n * m) | O(n + m) | Counter on smaller array |
| LRU Cache | N/A | O(1) per op | OrderedDict or hash + DLL |
| TF-IDF | N/A | O(n * m) | Frequency counting + math |
Lilly Tech Systems