Intermediate

Unsupervised Learning

Discover patterns in unlabeled data with clustering, dimensionality reduction, anomaly detection, and association rules.

What is Unsupervised Learning?

Unlike supervised learning, unsupervised learning works with unlabeled data. There are no correct answers to learn from — instead, the algorithm discovers hidden structure, patterns, and groupings in the data on its own.

Clustering

Clustering algorithms group similar data points together. Common applications include customer segmentation, document categorization, and image segmentation.

K-Means Clustering

The most popular clustering algorithm. It partitions data into K clusters by iteratively assigning points to the nearest cluster center and updating the centers:

Python (sklearn)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import numpy as np

# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.6, random_state=42)

# Scale features (important for distance-based algorithms)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)

print(f"Cluster centers:\n{kmeans.cluster_centers_}")
print(f"Inertia: {kmeans.inertia_:.2f}")

# Finding optimal K with the Elbow Method
inertias = []
for k in range(1, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)
# Plot inertias vs K to find the "elbow"

DBSCAN

Density-Based Spatial Clustering of Applications with Noise. Unlike K-Means, DBSCAN does not require specifying the number of clusters. It finds clusters based on density — areas with many nearby points. Key advantages:

  • Automatically determines the number of clusters.
  • Can find clusters of arbitrary shape (not just spherical).
  • Identifies outliers as noise points.

Hierarchical Clustering

Builds a tree (dendrogram) of clusters by either merging small clusters (agglomerative, bottom-up) or splitting large ones (divisive, top-down). You can cut the dendrogram at any level to get the desired number of clusters. Useful for understanding the hierarchical structure of data.

Dimensionality Reduction

Reducing the number of features while preserving important information. Useful for visualization, denoising, and speeding up other algorithms.

PCA (Principal Component Analysis)

Finds the directions (principal components) of maximum variance in the data and projects the data onto these directions. The first component captures the most variance, the second captures the next most, and so on.

Python (sklearn)
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

# Load high-dimensional data (64 features)
digits = load_digits()
X = digits.data  # Shape: (1797, 64)

# Reduce to 2D for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance preserved: {sum(pca.explained_variance_ratio_):.2%}")

# Keep 95% of variance
pca_95 = PCA(n_components=0.95)
X_reduced = pca_95.fit_transform(X)
print(f"Components needed for 95% variance: {pca_95.n_components_}")

t-SNE and UMAP

  • t-SNE (t-Distributed Stochastic Neighbor Embedding): Non-linear dimensionality reduction designed for visualization. Preserves local structure — nearby points in high dimensions stay nearby in low dimensions. Computationally expensive; mainly for 2D/3D visualization.
  • UMAP (Uniform Manifold Approximation and Projection): Faster alternative to t-SNE that also preserves more global structure. Increasingly preferred for visualization and as a preprocessing step.

Anomaly Detection

Identifying data points that deviate significantly from the norm. Critical for fraud detection, network intrusion detection, and manufacturing quality control.

  • Isolation Forest: Isolates anomalies by randomly partitioning features. Anomalies are isolated quickly (fewer partitions needed) because they are rare and different. Fast and effective.
  • One-Class SVM: Learns a boundary around normal data. Points outside the boundary are anomalies. Works well in high dimensions.
Python (Anomaly Detection)
from sklearn.ensemble import IsolationForest
import numpy as np

# Generate normal data with some anomalies
np.random.seed(42)
X_normal = np.random.randn(200, 2)
X_anomaly = np.random.uniform(-4, 4, (10, 2))
X = np.vstack([X_normal, X_anomaly])

# Fit Isolation Forest
iso_forest = IsolationForest(contamination=0.05,
                             random_state=42)
predictions = iso_forest.fit_predict(X)
# -1 = anomaly, 1 = normal

anomalies = X[predictions == -1]
print(f"Detected {len(anomalies)} anomalies")

Association Rules

Discover relationships between items in transactional data. The classic example is market basket analysis: "customers who buy bread and butter also tend to buy milk." Key metrics:

  • Support: How frequently an itemset appears in the data.
  • Confidence: How often a rule is true (given A, how often B?).
  • Lift: How much more likely B is given A, compared to B alone. Lift > 1 means a positive association.
Key difference from supervised learning: There is no "correct answer" in unsupervised learning. Evaluation is harder — you use metrics like silhouette score for clustering or explained variance for PCA, but ultimately domain expertise is needed to judge whether the discovered patterns are useful.