Beginner

Scikit-learn Basics

Master the Scikit-learn API pattern: fit, predict, transform. Learn to load datasets, split data, cross-validate, build pipelines, and persist models.

The Scikit-learn API

Every sklearn estimator follows the same consistent pattern:

Python
from sklearn.linear_model import LinearRegression

# 1. Instantiate the model
model = LinearRegression()

# 2. Fit (train) the model
model.fit(X_train, y_train)

# 3. Predict
predictions = model.predict(X_test)

# 4. Evaluate
score = model.score(X_test, y_test)  # R² for regression
💡
Estimator types: fit() + predict() for supervised models, fit() + transform() for preprocessing, and fit_predict() for clustering. This consistency is what makes sklearn so powerful.

Loading Datasets

Python
from sklearn.datasets import load_iris, make_classification

# Built-in datasets
iris = load_iris()
X, y = iris.data, iris.target

# Synthetic datasets
X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=10, random_state=42)

Train/Test Split

Python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Cross-Validation

Python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"Mean: {scores.mean():.4f} (+/- {scores.std():.4f})")

Pipelines

Python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", LogisticRegression())
])

pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)

Model Persistence

Python
import joblib

# Save model
joblib.dump(pipe, "model.joblib")

# Load model
loaded_model = joblib.load("model.joblib")
predictions = loaded_model.predict(X_new)

Complete Workflow Example

Python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build pipeline
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("rf", RandomForestClassifier(n_estimators=100, random_state=42))
])

# Cross-validate
scores = cross_val_score(pipe, X_train, y_train, cv=5)
print(f"CV Accuracy: {scores.mean():.4f}")

# Train and evaluate
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))