Logistic Regression Deep Dive
Despite its name, Logistic Regression is a classification algorithm — the go-to method for predicting categories like spam/not-spam, sick/healthy, or buy/don't-buy.
Why Linear Regression Fails for Classification
If you try to use Linear Regression for a binary classification problem (predicting 0 or 1), you run into several issues:
- Unbounded output: Linear Regression can predict values like -3.5 or 7.2, but probabilities must be between 0 and 1.
- Sensitive to outliers: A single extreme data point can shift the decision boundary dramatically.
- No probability interpretation: Predictions don't represent meaningful probabilities.
The Sigmoid Function
The sigmoid (logistic) function is the heart of Logistic Regression:
sigmoid(z) = 1 / (1 + e^(-z))
Where z = W^T * X + b (the linear combination, same as Linear Regression)
Properties:
- Output is always between 0 and 1
- sigmoid(0) = 0.5 (the decision boundary)
- sigmoid(large positive) approaches 1
- sigmoid(large negative) approaches 0
- S-shaped curve, smooth and differentiable
Interpretation:
P(y=1|X) = sigmoid(W^T * X + b)
P(y=0|X) = 1 - P(y=1|X)
Log-Odds and Decision Boundary
The log-odds (logit) gives us a linear model inside the sigmoid:
log(p / (1-p)) = W^T * X + b
Where:
p = probability of class 1
p/(1-p) = odds ratio
log(p/(1-p)) = log-odds (logit)
Decision Boundary:
- Predict class 1 if P(y=1|X) >= 0.5 (i.e., z >= 0)
- Predict class 0 if P(y=1|X) < 0.5 (i.e., z < 0)
- The boundary is where W^T * X + b = 0
- This forms a linear boundary (line in 2D, plane in 3D)
- You can adjust the threshold (e.g., 0.3) for imbalanced classes
Cost Function: Binary Cross-Entropy
We can't use MSE for Logistic Regression because the sigmoid makes the cost function non-convex (full of local minima). Instead, we use Binary Cross-Entropy (Log Loss):
Cost = -(1/n) * SUM[y_i * log(p_i) + (1-y_i) * log(1-p_i)]
Intuition:
When y=1: Cost = -log(p) -> high cost if p is near 0
When y=0: Cost = -log(1-p) -> high cost if p is near 1
Properties:
- Convex: guaranteed to find global minimum
- Penalizes confident wrong predictions heavily
- Equivalent to maximizing likelihood (MLE)
Multi-Class Classification
Logistic Regression extends beyond binary classification using two strategies:
Softmax (Multinomial)
P(y=k|X) = e^(z_k) / SUM(e^(z_j)) for all classes j
# Each class gets its own weight vector
# Outputs sum to 1 (valid probability distribution)
# Used when classes are mutually exclusive
# sklearn: LogisticRegression(multi_class='multinomial')
One-vs-Rest (OvR)
# Train K separate binary classifiers:
# Classifier 1: class 1 vs. all others
# Classifier 2: class 2 vs. all others
# ...
# Classifier K: class K vs. all others
# Predict: class with highest probability
# sklearn: LogisticRegression(multi_class='ovr')
Regularization
Just like Linear Regression, Logistic Regression benefits from regularization to prevent overfitting:
- L2 (Ridge): Default in sklearn. Shrinks coefficients, keeps all features.
penalty='l2' - L1 (Lasso): Produces sparse models, automatic feature selection.
penalty='l1', solver='liblinear' - ElasticNet: Combines L1 and L2.
penalty='elasticnet', solver='saga', l1_ratio=0.5 - C parameter: Inverse of regularization strength. Smaller C = stronger regularization. Default: C=1.0
Evaluation Metrics for Classification
| Metric | Formula | Use When |
|---|---|---|
| Accuracy | (TP+TN) / Total | Balanced classes; quick overview |
| Precision | TP / (TP+FP) | Cost of false positives is high (spam filter) |
| Recall | TP / (TP+FN) | Cost of false negatives is high (disease detection) |
| F1 Score | 2 * (Prec * Rec) / (Prec + Rec) | Need balance between precision and recall |
| ROC-AUC | Area under ROC curve | Overall model ranking ability; threshold-independent |
TP (True Positive): Predicted positive, actually positive
TN (True Negative): Predicted negative, actually negative
FP (False Positive): Predicted positive, actually negative (Type I error)
FN (False Negative): Predicted negative, actually positive (Type II error)
Python Implementation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix,
classification_report, roc_curve
)
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
# Load real dataset: Breast Cancer Wisconsin
data = load_breast_cancer()
X, y = data.data, data.target
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Classes: {data.target_names}") # ['malignant', 'benign']
print(f"Class distribution: {np.bincount(y)}")
# Preprocessing: scale features (important for Logistic Regression)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42, stratify=y
)
# Train Logistic Regression
model = LogisticRegression(C=1.0, penalty='l2', max_iter=1000)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] # probability of class 1
# --- Evaluation Metrics ---
print(f"\n--- Classification Report ---")
print(classification_report(y_test, y_pred, target_names=data.target_names))
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
# --- Confusion Matrix ---
cm = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:")
print(f" TN={cm[0][0]} FP={cm[0][1]}")
print(f" FN={cm[1][0]} TP={cm[1][1]}")
# --- ROC Curve Visualization ---
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2,
label=f'Logistic Regression (AUC = {roc_auc_score(y_test, y_prob):.3f})')
plt.plot([0, 1], [0, 1], 'r--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Breast Cancer Classification')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- Top Features ---
importance = np.abs(model.coef_[0])
top_features = np.argsort(importance)[::-1][:10]
print(f"\nTop 10 Most Important Features:")
for i, idx in enumerate(top_features):
print(f" {i+1}. {data.feature_names[idx]:<25} coef={model.coef_[0][idx]:+.4f}")
StandardScaler. Without scaling, features with large ranges will dominate the model.
Lilly Tech Systems