Advanced

Model Training

Master classification, regression, and clustering algorithms in Spark ML, plus hyperparameter tuning with CrossValidator and TrainValidationSplit.

Classification Algorithms

from pyspark.ml.classification import (
    LogisticRegression,
    DecisionTreeClassifier,
    RandomForestClassifier,
    GBTClassifier
)

# Logistic Regression
lr = LogisticRegression(featuresCol="features", labelCol="label",
                        maxIter=100, regParam=0.01)

# Random Forest
rf = RandomForestClassifier(featuresCol="features", labelCol="label",
                            numTrees=100, maxDepth=5)

# Gradient Boosted Trees
gbt = GBTClassifier(featuresCol="features", labelCol="label",
                     maxIter=50, maxDepth=5)

# Train
model = rf.fit(train_df)
predictions = model.transform(test_df)

Regression Algorithms

from pyspark.ml.regression import (
    LinearRegression,
    DecisionTreeRegressor,
    RandomForestRegressor,
    GBTRegressor
)

# Linear Regression
lr = LinearRegression(featuresCol="features", labelCol="label",
                       maxIter=100, regParam=0.01, elasticNetParam=0.5)

# Random Forest Regressor
rf = RandomForestRegressor(featuresCol="features", labelCol="label",
                            numTrees=100, maxDepth=5)

Clustering Algorithms

from pyspark.ml.clustering import KMeans, BisectingKMeans

# K-Means
kmeans = KMeans(featuresCol="features", k=3, seed=42)
model = kmeans.fit(df)
predictions = model.transform(df)

# Evaluate with silhouette score
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator(predictionCol="prediction",
                                 featuresCol="features",
                                 metricName="silhouette")
score = evaluator.evaluate(predictions)

Hyperparameter Tuning

CrossValidator

CrossValidator performs k-fold cross-validation. It splits data into k folds, trains on k-1 folds, evaluates on the remaining fold, and averages the results.

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Define parameter grid
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [50, 100, 200]) \
    .addGrid(rf.maxDepth, [3, 5, 10]) \
    .build()

# Define evaluator
evaluator = BinaryClassificationEvaluator(metricName="areaUnderROC")

# Create CrossValidator
cv = CrossValidator(
    estimator=pipeline,       # Pipeline or single estimator
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=5,               # 5-fold cross-validation
    parallelism=4             # Train 4 models in parallel
)

# Fit CrossValidator
cv_model = cv.fit(train_df)
best_model = cv_model.bestModel

💡

Exam focus: CrossValidator trains numFolds x numParamCombinations models. With 5 folds and 9 parameter combinations (3x3 grid), it trains 45 models. This is computationally expensive. Use TrainValidationSplit for faster (but less robust) tuning.

TrainValidationSplit

TrainValidationSplit is a faster alternative that uses a single train/validation split instead of k-fold cross-validation.

from pyspark.ml.tuning import TrainValidationSplit

tvs = TrainValidationSplit(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    trainRatio=0.8  # 80% train, 20% validation
)

tvs_model = tvs.fit(train_df)
best_model = tvs_model.bestModel

⚠

CrossValidator vs. TrainValidationSplit: CrossValidator is more robust (uses all data for validation across folds) but slower. TrainValidationSplit is faster but uses only one split. The exam tests when to use each: CrossValidator for small datasets, TrainValidationSplit for large datasets.

Practice Questions

Question 1

A CrossValidator is configured with numFolds=3 and a ParamGridBuilder with 4 parameter combinations. How many models are trained in total?

A) 3
B) 4
C) 7
D) 12

Answer: D — CrossValidator trains numFolds x numParamCombinations models. With 3 folds and 4 parameter combinations: 3 x 4 = 12 models. Each combination is trained 3 times (once per fold).

Question 2

Which Spark ML class should you use for fast hyperparameter tuning on a very large dataset where cross-validation would be too slow?

A) CrossValidator with numFolds=10
B) TrainValidationSplit with trainRatio=0.8
C) ParamGridBuilder alone
D) BinaryClassificationEvaluator

Answer: B — TrainValidationSplit uses a single train/validation split, making it faster than CrossValidator. For large datasets, the single split provides sufficient validation data. ParamGridBuilder (C) defines the grid but does not perform tuning. Evaluator (D) measures metrics but does not tune.

Question 3

Which algorithm does NOT support the Spark ML Pipeline API?

A) LogisticRegression
B) RandomForestClassifier
C) scikit-learn RandomForestClassifier
D) GBTClassifier

Answer: C — scikit-learn's RandomForestClassifier is a separate library that does not implement the Spark ML Estimator/Transformer interface. It cannot be used as a pipeline stage. Only Spark ML classes (from pyspark.ml) work with the Pipeline API.

Question 4

After tuning with CrossValidator, how do you access the best model?

A) cv_model.bestParams
B) cv_model.bestModel
C) cv_model.getModel()
D) cv_model.result

Answer: B — The bestModel attribute of the CrossValidatorModel contains the model trained with the best parameter combination (the one with the highest evaluation metric averaged across folds).

Question 5

Which evaluation metric is appropriate for measuring a clustering model's quality in Spark ML?

A) areaUnderROC
B) accuracy
C) silhouette
D) rmse

Answer: C — The silhouette score measures how similar each point is to its own cluster compared to other clusters. ClusteringEvaluator with metricName="silhouette" is the standard for clustering evaluation. areaUnderROC and accuracy are for classification. RMSE is for regression.

← Previous Feature Engineering Next → Practice Exam