Beginner
Scikit-learn Basics
Master the Scikit-learn API pattern: fit, predict, transform. Learn to load datasets, split data, cross-validate, build pipelines, and persist models.
The Scikit-learn API
Every sklearn estimator follows the same consistent pattern:
Python
from sklearn.linear_model import LinearRegression # 1. Instantiate the model model = LinearRegression() # 2. Fit (train) the model model.fit(X_train, y_train) # 3. Predict predictions = model.predict(X_test) # 4. Evaluate score = model.score(X_test, y_test) # R² for regression
Estimator types:
fit() + predict() for supervised models, fit() + transform() for preprocessing, and fit_predict() for clustering. This consistency is what makes sklearn so powerful.Loading Datasets
Python
from sklearn.datasets import load_iris, make_classification # Built-in datasets iris = load_iris() X, y = iris.data, iris.target # Synthetic datasets X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)
Train/Test Split
Python
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )
Cross-Validation
Python
from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring="accuracy") print(f"Mean: {scores.mean():.4f} (+/- {scores.std():.4f})")
Pipelines
Python
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipe = Pipeline([ ("scaler", StandardScaler()), ("classifier", LogisticRegression()) ]) pipe.fit(X_train, y_train) score = pipe.score(X_test, y_test)
Model Persistence
Python
import joblib # Save model joblib.dump(pipe, "model.joblib") # Load model loaded_model = joblib.load("model.joblib") predictions = loaded_model.predict(X_new)
Complete Workflow Example
Python
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report # Load data X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Build pipeline pipe = Pipeline([ ("scaler", StandardScaler()), ("rf", RandomForestClassifier(n_estimators=100, random_state=42)) ]) # Cross-validate scores = cross_val_score(pipe, X_train, y_train, cv=5) print(f"CV Accuracy: {scores.mean():.4f}") # Train and evaluate pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) print(classification_report(y_test, y_pred))
Lilly Tech Systems