Intermediate

Scikit-learn

Master the most popular Python ML library. Learn the Pipeline API, preprocessing tools, model training, hyperparameter search, and build a complete end-to-end project.

The Scikit-learn Ecosystem

Scikit-learn (sklearn) is the go-to Python library for classical machine learning. It provides a consistent API for dozens of algorithms, preprocessing tools, model selection utilities, and evaluation metrics. Every sklearn object follows the same pattern:

  • fit(X, y) — Learn from training data
  • predict(X) — Make predictions on new data
  • transform(X) — Transform data (for preprocessors)
  • score(X, y) — Evaluate model performance

The Pipeline API

Pipelines chain multiple steps (preprocessing + model) into a single object. This prevents data leakage, simplifies code, and makes deployment easier:

Python (Pipeline)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Simple pipeline: scale, then classify
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Fit and predict in one go
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
score = pipeline.score(X_test, y_test)

# Cross-validation works on entire pipeline
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X, y, cv=5)

Column Transformer for Mixed Data Types

Real datasets have both numerical and categorical features that need different preprocessing:

Python (ColumnTransformer)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numeric_features = ['age', 'salary', 'experience']
categorical_features = ['department', 'education']

# Different processing for different column types
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numeric_features),
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]), categorical_features)
    ]
)

# Full pipeline: preprocess + classify
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

Hyperparameter Search

Finding the best hyperparameters is crucial for model performance. Sklearn provides two main approaches:

Python (Hyperparameter Tuning)
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Grid Search: try every combination (exhaustive)
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [5, 10, 20, None],
    'classifier__min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    full_pipeline, param_grid,
    cv=5, scoring='f1_weighted', n_jobs=-1
)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
best_model = grid_search.best_estimator_

# Random Search: sample random combinations (faster)
from scipy.stats import randint, uniform
param_distributions = {
    'classifier__n_estimators': randint(50, 500),
    'classifier__max_depth': randint(3, 30),
    'classifier__min_samples_split': randint(2, 20)
}

random_search = RandomizedSearchCV(
    full_pipeline, param_distributions,
    n_iter=50, cv=5, scoring='f1_weighted',
    random_state=42, n_jobs=-1
)
random_search.fit(X_train, y_train)
When to use which: Use GridSearchCV when you have a small parameter space (under 100 combinations). Use RandomizedSearchCV for larger spaces — it often finds equally good parameters in a fraction of the time. For even better results, consider Optuna or Bayesian optimization.

Model Persistence

Save trained models for later use or deployment:

Python (Model Saving)
import joblib
import pickle

# joblib (recommended for sklearn models)
joblib.dump(best_model, 'model.joblib')
loaded_model = joblib.load('model.joblib')

# pickle (Python standard library)
with open('model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

# Use the loaded model
predictions = loaded_model.predict(new_data)

Complete End-to-End Example

Python (End-to-End)
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
import joblib

# 1. Load data
df = pd.read_csv('customer_churn.csv')

# 2. Define features and target
target = 'churned'
numeric = ['tenure', 'monthly_charges', 'total_charges']
categorical = ['contract', 'payment_method']

X = df[numeric + categorical]
y = df[target]

# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# 4. Build pipeline
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), numeric),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant',
                                  fill_value='missing')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]), categorical)
])

model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(
        n_estimators=200, max_depth=5,
        learning_rate=0.1, random_state=42))
])

# 5. Cross-validate
cv_scores = cross_val_score(model, X_train, y_train,
                            cv=5, scoring='f1')
print(f"CV F1: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")

# 6. Train final model
model.fit(X_train, y_train)

# 7. Evaluate on test set
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# 8. Save model
joblib.dump(model, 'churn_model.joblib')