Advanced
Model Training
Master classification, regression, and clustering algorithms in Spark ML, plus hyperparameter tuning with CrossValidator and TrainValidationSplit.
Classification Algorithms
from pyspark.ml.classification import (
LogisticRegression,
DecisionTreeClassifier,
RandomForestClassifier,
GBTClassifier
)
# Logistic Regression
lr = LogisticRegression(featuresCol="features", labelCol="label",
maxIter=100, regParam=0.01)
# Random Forest
rf = RandomForestClassifier(featuresCol="features", labelCol="label",
numTrees=100, maxDepth=5)
# Gradient Boosted Trees
gbt = GBTClassifier(featuresCol="features", labelCol="label",
maxIter=50, maxDepth=5)
# Train
model = rf.fit(train_df)
predictions = model.transform(test_df)
Regression Algorithms
from pyspark.ml.regression import (
LinearRegression,
DecisionTreeRegressor,
RandomForestRegressor,
GBTRegressor
)
# Linear Regression
lr = LinearRegression(featuresCol="features", labelCol="label",
maxIter=100, regParam=0.01, elasticNetParam=0.5)
# Random Forest Regressor
rf = RandomForestRegressor(featuresCol="features", labelCol="label",
numTrees=100, maxDepth=5)
Clustering Algorithms
from pyspark.ml.clustering import KMeans, BisectingKMeans
# K-Means
kmeans = KMeans(featuresCol="features", k=3, seed=42)
model = kmeans.fit(df)
predictions = model.transform(df)
# Evaluate with silhouette score
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator(predictionCol="prediction",
featuresCol="features",
metricName="silhouette")
score = evaluator.evaluate(predictions)
Hyperparameter Tuning
CrossValidator
CrossValidator performs k-fold cross-validation. It splits data into k folds, trains on k-1 folds, evaluates on the remaining fold, and averages the results.
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Define parameter grid
paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [50, 100, 200]) \
.addGrid(rf.maxDepth, [3, 5, 10]) \
.build()
# Define evaluator
evaluator = BinaryClassificationEvaluator(metricName="areaUnderROC")
# Create CrossValidator
cv = CrossValidator(
estimator=pipeline, # Pipeline or single estimator
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=5, # 5-fold cross-validation
parallelism=4 # Train 4 models in parallel
)
# Fit CrossValidator
cv_model = cv.fit(train_df)
best_model = cv_model.bestModel
Exam focus: CrossValidator trains numFolds x numParamCombinations models. With 5 folds and 9 parameter combinations (3x3 grid), it trains 45 models. This is computationally expensive. Use TrainValidationSplit for faster (but less robust) tuning.
TrainValidationSplit
TrainValidationSplit is a faster alternative that uses a single train/validation split instead of k-fold cross-validation.
from pyspark.ml.tuning import TrainValidationSplit
tvs = TrainValidationSplit(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
trainRatio=0.8 # 80% train, 20% validation
)
tvs_model = tvs.fit(train_df)
best_model = tvs_model.bestModel
CrossValidator vs. TrainValidationSplit: CrossValidator is more robust (uses all data for validation across folds) but slower. TrainValidationSplit is faster but uses only one split. The exam tests when to use each: CrossValidator for small datasets, TrainValidationSplit for large datasets.
Practice Questions
Question 1
A CrossValidator is configured with numFolds=3 and a ParamGridBuilder with 4 parameter combinations. How many models are trained in total?
A) 3
B) 4
C) 7
D) 12
Answer: D — CrossValidator trains numFolds x numParamCombinations models. With 3 folds and 4 parameter combinations: 3 x 4 = 12 models. Each combination is trained 3 times (once per fold).
A) 3
B) 4
C) 7
D) 12
Answer: D — CrossValidator trains numFolds x numParamCombinations models. With 3 folds and 4 parameter combinations: 3 x 4 = 12 models. Each combination is trained 3 times (once per fold).
Question 2
Which Spark ML class should you use for fast hyperparameter tuning on a very large dataset where cross-validation would be too slow?
A) CrossValidator with numFolds=10
B) TrainValidationSplit with trainRatio=0.8
C) ParamGridBuilder alone
D) BinaryClassificationEvaluator
Answer: B — TrainValidationSplit uses a single train/validation split, making it faster than CrossValidator. For large datasets, the single split provides sufficient validation data. ParamGridBuilder (C) defines the grid but does not perform tuning. Evaluator (D) measures metrics but does not tune.
A) CrossValidator with numFolds=10
B) TrainValidationSplit with trainRatio=0.8
C) ParamGridBuilder alone
D) BinaryClassificationEvaluator
Answer: B — TrainValidationSplit uses a single train/validation split, making it faster than CrossValidator. For large datasets, the single split provides sufficient validation data. ParamGridBuilder (C) defines the grid but does not perform tuning. Evaluator (D) measures metrics but does not tune.
Question 3
Which algorithm does NOT support the Spark ML Pipeline API?
A) LogisticRegression
B) RandomForestClassifier
C) scikit-learn RandomForestClassifier
D) GBTClassifier
Answer: C — scikit-learn's RandomForestClassifier is a separate library that does not implement the Spark ML Estimator/Transformer interface. It cannot be used as a pipeline stage. Only Spark ML classes (from pyspark.ml) work with the Pipeline API.
A) LogisticRegression
B) RandomForestClassifier
C) scikit-learn RandomForestClassifier
D) GBTClassifier
Answer: C — scikit-learn's RandomForestClassifier is a separate library that does not implement the Spark ML Estimator/Transformer interface. It cannot be used as a pipeline stage. Only Spark ML classes (from pyspark.ml) work with the Pipeline API.
Question 4
After tuning with CrossValidator, how do you access the best model?
A) cv_model.bestParams
B) cv_model.bestModel
C) cv_model.getModel()
D) cv_model.result
Answer: B — The
A) cv_model.bestParams
B) cv_model.bestModel
C) cv_model.getModel()
D) cv_model.result
Answer: B — The
bestModel attribute of the CrossValidatorModel contains the model trained with the best parameter combination (the one with the highest evaluation metric averaged across folds).
Question 5
Which evaluation metric is appropriate for measuring a clustering model's quality in Spark ML?
A) areaUnderROC
B) accuracy
C) silhouette
D) rmse
Answer: C — The silhouette score measures how similar each point is to its own cluster compared to other clusters. ClusteringEvaluator with metricName="silhouette" is the standard for clustering evaluation. areaUnderROC and accuracy are for classification. RMSE is for regression.
A) areaUnderROC
B) accuracy
C) silhouette
D) rmse
Answer: C — The silhouette score measures how similar each point is to its own cluster compared to other clusters. ClusteringEvaluator with metricName="silhouette" is the standard for clustering evaluation. areaUnderROC and accuracy are for classification. RMSE is for regression.
Lilly Tech Systems