Scikit-learn
Master the most popular Python ML library. Learn the Pipeline API, preprocessing tools, model training, hyperparameter search, and build a complete end-to-end project.
The Scikit-learn Ecosystem
Scikit-learn (sklearn) is the go-to Python library for classical machine learning. It provides a consistent API for dozens of algorithms, preprocessing tools, model selection utilities, and evaluation metrics. Every sklearn object follows the same pattern:
fit(X, y)— Learn from training datapredict(X)— Make predictions on new datatransform(X)— Transform data (for preprocessors)score(X, y)— Evaluate model performance
The Pipeline API
Pipelines chain multiple steps (preprocessing + model) into a single object. This prevents data leakage, simplifies code, and makes deployment easier:
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier # Simple pipeline: scale, then classify pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(n_estimators=100)) ]) # Fit and predict in one go pipeline.fit(X_train, y_train) predictions = pipeline.predict(X_test) score = pipeline.score(X_test, y_test) # Cross-validation works on entire pipeline from sklearn.model_selection import cross_val_score scores = cross_val_score(pipeline, X, y, cv=5)
Column Transformer for Mixed Data Types
Real datasets have both numerical and categorical features that need different preprocessing:
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer numeric_features = ['age', 'salary', 'experience'] categorical_features = ['department', 'education'] # Different processing for different column types preprocessor = ColumnTransformer( transformers=[ ('num', Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]), numeric_features), ('cat', Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ]), categorical_features) ] ) # Full pipeline: preprocess + classify full_pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier()) ])
Hyperparameter Search
Finding the best hyperparameters is crucial for model performance. Sklearn provides two main approaches:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # Grid Search: try every combination (exhaustive) param_grid = { 'classifier__n_estimators': [50, 100, 200], 'classifier__max_depth': [5, 10, 20, None], 'classifier__min_samples_split': [2, 5, 10] } grid_search = GridSearchCV( full_pipeline, param_grid, cv=5, scoring='f1_weighted', n_jobs=-1 ) grid_search.fit(X_train, y_train) print(f"Best params: {grid_search.best_params_}") print(f"Best score: {grid_search.best_score_:.4f}") best_model = grid_search.best_estimator_ # Random Search: sample random combinations (faster) from scipy.stats import randint, uniform param_distributions = { 'classifier__n_estimators': randint(50, 500), 'classifier__max_depth': randint(3, 30), 'classifier__min_samples_split': randint(2, 20) } random_search = RandomizedSearchCV( full_pipeline, param_distributions, n_iter=50, cv=5, scoring='f1_weighted', random_state=42, n_jobs=-1 ) random_search.fit(X_train, y_train)
Model Persistence
Save trained models for later use or deployment:
import joblib import pickle # joblib (recommended for sklearn models) joblib.dump(best_model, 'model.joblib') loaded_model = joblib.load('model.joblib') # pickle (Python standard library) with open('model.pkl', 'wb') as f: pickle.dump(best_model, f) with open('model.pkl', 'rb') as f: loaded_model = pickle.load(f) # Use the loaded model predictions = loaded_model.predict(new_data)
Complete End-to-End Example
import pandas as pd from sklearn.model_selection import train_test_split, cross_val_score from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import classification_report import joblib # 1. Load data df = pd.read_csv('customer_churn.csv') # 2. Define features and target target = 'churned' numeric = ['tenure', 'monthly_charges', 'total_charges'] categorical = ['contract', 'payment_method'] X = df[numeric + categorical] y = df[target] # 3. Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y) # 4. Build pipeline preprocessor = ColumnTransformer([ ('num', Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]), numeric), ('cat', Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ]), categorical) ]) model = Pipeline([ ('preprocessor', preprocessor), ('classifier', GradientBoostingClassifier( n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42)) ]) # 5. Cross-validate cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1') print(f"CV F1: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}") # 6. Train final model model.fit(X_train, y_train) # 7. Evaluate on test set y_pred = model.predict(X_test) print(classification_report(y_test, y_pred)) # 8. Save model joblib.dump(model, 'churn_model.joblib')
Lilly Tech Systems