Intermediate

Spark Fundamentals

Master the foundation of every Spark application — RDDs, DataFrames, SparkSQL, transformations, actions, lazy evaluation, and the distributed architecture that powers large-scale data processing.

Spark Architecture

Apache Spark is a distributed computing engine that processes data in parallel across a cluster. Understanding the architecture is essential for the exam.

  • Driver: The master process that coordinates the application, creates the SparkContext/SparkSession, and distributes work to executors
  • Executors: Worker processes that run on cluster nodes, execute tasks, and store data in memory or disk
  • Cluster Manager: Allocates resources (YARN, Mesos, Kubernetes, or Standalone)
  • SparkSession: The unified entry point for all Spark functionality (replaces SparkContext in modern Spark)
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ML Pipeline") \
    .config("spark.sql.shuffle.partitions", 200) \
    .getOrCreate()

RDDs vs. DataFrames

RDD (Resilient Distributed Dataset)

  • Low-level distributed collection
  • Unstructured: no schema or columns
  • Functional API (map, filter, reduce)
  • No Catalyst optimizer
  • Use only when you need fine-grained control

DataFrame (Preferred for ML)

  • High-level structured data with schema
  • Columnar: named columns with types
  • SQL-like API (select, filter, groupBy)
  • Catalyst optimizer for query planning
  • Required by Spark ML Pipeline API
💡
Exam focus: The Spark ML API exclusively uses DataFrames. RDD knowledge is tested for fundamentals, but all ML pipeline questions use DataFrame operations. Know when to use each.

Transformations vs. Actions

This distinction is critical for the exam. Spark uses lazy evaluation — transformations build a plan but do not execute. Actions trigger execution.

Transformations (Lazy)

Create a new DataFrame/RDD from an existing one. Nothing is computed until an action is called.

  • Narrow: select(), filter(), withColumn(), map() — no shuffle, each partition processes independently
  • Wide: groupBy(), join(), orderBy(), repartition() — require shuffling data across partitions

Actions (Trigger Execution)

Return results or write data. These trigger the execution of all preceding transformations.

  • show(), collect(), count(), first(), take(n)
  • write.parquet(), write.csv(), save()
# Transformations (lazy - nothing executes yet)
df_filtered = df.filter(df["age"] > 25)
df_selected = df_filtered.select("name", "age", "salary")
df_grouped = df_selected.groupBy("age").avg("salary")

# Action (triggers execution of the entire plan)
df_grouped.show()
Common exam trap: cache() and persist() are transformations, not actions. They mark a DataFrame for caching but the actual caching happens when the next action is called. The exam tests this distinction.

SparkSQL

SparkSQL allows you to query DataFrames using standard SQL syntax:

# Register DataFrame as a temp view
df.createOrReplaceTempView("customers")

# Query with SQL
result = spark.sql("""
    SELECT age, AVG(salary) as avg_salary
    FROM customers
    WHERE age > 25
    GROUP BY age
    ORDER BY avg_salary DESC
""")

Caching and Persistence

Caching stores intermediate results in memory to avoid recomputation:

  • cache(): Stores in memory only (MEMORY_AND_DISK by default in DataFrame API)
  • persist(level): Stores with specified storage level (MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, etc.)
  • unpersist(): Removes from cache
📚
When to cache: Cache DataFrames that are used multiple times in your pipeline (e.g., training data used for both model training and evaluation). Do not cache DataFrames that are used only once — it wastes memory.

Practice Questions

Question 1

Q1
Which of the following is an ACTION in Spark (not a transformation)?

A) filter()
B) select()
C) count()
D) withColumn()

Answer: Ccount() is an action that triggers execution and returns a value (the number of rows). filter(), select(), and withColumn() are all transformations that return new DataFrames without triggering execution.

Question 2

Q2
Why does Spark use lazy evaluation?

A) To save memory on the driver
B) To optimize the execution plan by analyzing all transformations before running them
C) To make the API easier to use
D) To support Python syntax

Answer: B — Lazy evaluation allows Spark's Catalyst optimizer to analyze the entire chain of transformations and produce an optimized physical execution plan. It can eliminate unnecessary computations, push down predicates, and optimize join strategies.

Question 3

Q3
A DataFrame is used twice in a pipeline: once for model training and once for evaluation. What should you do to avoid recomputing it?

A) Create two separate DataFrames
B) Cache the DataFrame using .cache() before the first use
C) Write it to disk and read it back
D) Use an RDD instead

Answer: B — Caching stores the DataFrame in memory so it is not recomputed when used the second time. Creating two DataFrames (A) would duplicate the computation. Writing to disk (C) is slower. Using an RDD (D) is not compatible with Spark ML.

Question 4

Q4
Which transformation causes a shuffle across partitions?

A) filter()
B) select()
C) groupBy()
D) withColumn()

Answer: CgroupBy() is a wide transformation that requires shuffling data across partitions to group rows with the same key onto the same partition. filter(), select(), and withColumn() are narrow transformations that operate on each partition independently.

Question 5

Q5
What is the modern unified entry point for all Spark functionality?

A) SparkContext
B) SparkSession
C) SQLContext
D) HiveContext

Answer: B — SparkSession (introduced in Spark 2.0) is the unified entry point that combines SparkContext, SQLContext, and HiveContext. It is used for creating DataFrames, registering tables, executing SQL, and configuring Spark settings.