Unsupervised Learning
Discover patterns in unlabeled data with clustering, dimensionality reduction, anomaly detection, and association rules.
What is Unsupervised Learning?
Unlike supervised learning, unsupervised learning works with unlabeled data. There are no correct answers to learn from — instead, the algorithm discovers hidden structure, patterns, and groupings in the data on its own.
Clustering
Clustering algorithms group similar data points together. Common applications include customer segmentation, document categorization, and image segmentation.
K-Means Clustering
The most popular clustering algorithm. It partitions data into K clusters by iteratively assigning points to the nearest cluster center and updating the centers:
from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_blobs import numpy as np # Generate sample data X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42) # Scale features (important for distance-based algorithms) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Fit K-Means kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) labels = kmeans.fit_predict(X_scaled) print(f"Cluster centers:\n{kmeans.cluster_centers_}") print(f"Inertia: {kmeans.inertia_:.2f}") # Finding optimal K with the Elbow Method inertias = [] for k in range(1, 11): km = KMeans(n_clusters=k, random_state=42, n_init=10) km.fit(X_scaled) inertias.append(km.inertia_) # Plot inertias vs K to find the "elbow"
DBSCAN
Density-Based Spatial Clustering of Applications with Noise. Unlike K-Means, DBSCAN does not require specifying the number of clusters. It finds clusters based on density — areas with many nearby points. Key advantages:
- Automatically determines the number of clusters.
- Can find clusters of arbitrary shape (not just spherical).
- Identifies outliers as noise points.
Hierarchical Clustering
Builds a tree (dendrogram) of clusters by either merging small clusters (agglomerative, bottom-up) or splitting large ones (divisive, top-down). You can cut the dendrogram at any level to get the desired number of clusters. Useful for understanding the hierarchical structure of data.
Dimensionality Reduction
Reducing the number of features while preserving important information. Useful for visualization, denoising, and speeding up other algorithms.
PCA (Principal Component Analysis)
Finds the directions (principal components) of maximum variance in the data and projects the data onto these directions. The first component captures the most variance, the second captures the next most, and so on.
from sklearn.decomposition import PCA from sklearn.datasets import load_digits # Load high-dimensional data (64 features) digits = load_digits() X = digits.data # Shape: (1797, 64) # Reduce to 2D for visualization pca = PCA(n_components=2) X_2d = pca.fit_transform(X) print(f"Explained variance ratio: {pca.explained_variance_ratio_}") print(f"Total variance preserved: {sum(pca.explained_variance_ratio_):.2%}") # Keep 95% of variance pca_95 = PCA(n_components=0.95) X_reduced = pca_95.fit_transform(X) print(f"Components needed for 95% variance: {pca_95.n_components_}")
t-SNE and UMAP
- t-SNE (t-Distributed Stochastic Neighbor Embedding): Non-linear dimensionality reduction designed for visualization. Preserves local structure — nearby points in high dimensions stay nearby in low dimensions. Computationally expensive; mainly for 2D/3D visualization.
- UMAP (Uniform Manifold Approximation and Projection): Faster alternative to t-SNE that also preserves more global structure. Increasingly preferred for visualization and as a preprocessing step.
Anomaly Detection
Identifying data points that deviate significantly from the norm. Critical for fraud detection, network intrusion detection, and manufacturing quality control.
- Isolation Forest: Isolates anomalies by randomly partitioning features. Anomalies are isolated quickly (fewer partitions needed) because they are rare and different. Fast and effective.
- One-Class SVM: Learns a boundary around normal data. Points outside the boundary are anomalies. Works well in high dimensions.
from sklearn.ensemble import IsolationForest import numpy as np # Generate normal data with some anomalies np.random.seed(42) X_normal = np.random.randn(200, 2) X_anomaly = np.random.uniform(-4, 4, (10, 2)) X = np.vstack([X_normal, X_anomaly]) # Fit Isolation Forest iso_forest = IsolationForest(contamination=0.05, random_state=42) predictions = iso_forest.fit_predict(X) # -1 = anomaly, 1 = normal anomalies = X[predictions == -1] print(f"Detected {len(anomalies)} anomalies")
Association Rules
Discover relationships between items in transactional data. The classic example is market basket analysis: "customers who buy bread and butter also tend to buy milk." Key metrics:
- Support: How frequently an itemset appears in the data.
- Confidence: How often a rule is true (given A, how often B?).
- Lift: How much more likely B is given A, compared to B alone. Lift > 1 means a positive association.
Lilly Tech Systems