In this chapter, you’ll learn how Apache Spark’s internal architecture works. Whether you’re running a job on your local machine or a massive cluster, Spark follows the same fundamental principles. Understanding these will help you write better PySpark code, optimize performance, and debug issues more effectively.
Core Components: Driver, Executors, Cluster Manager
Apache Spark follows a master-slave architecture with three main components:
1. Driver Program
- The driver is your PySpark script or application entry point.
- It creates the SparkSession and translates your code into execution plans.
- It coordinates the cluster by:
- Requesting resources from the Cluster Manager
- Dividing jobs into tasks
- Collecting and returning results
The Driver stays active for the entire lifetime of the Spark job.
2. Executors
- Executors are launched on worker nodes in the cluster.
- Each executor:
- Executes assigned tasks
- Stores data in memory or disk
- Reports back results to the driver
3. Cluster Manager
- Manages the physical resources (CPU, memory) across nodes.
- Spark supports multiple managers:
- Standalone (default)
- YARN (used with Hadoop clusters)
- Apache Mesos
- Kubernetes (popular in cloud-native setups)
+-----------------+ Resource Management +-------------------+
| Driver App | <------------------------------> | Cluster Manager |
| | +-------------------+
| Tasks | Send tasks to executors / \
+--------+--------+ / \
| / \
v v v
+-------------------+ +----------------+ +----------------+
| Executor Node 1 | | Executor Node 2| | Executor Node 3|
+-------------------+ +----------------+ +----------------+
DAG, Jobs, Stages, and Tasks
Spark uses a Directed Acyclic Graph (DAG) to track and optimize the flow of operations.
What is a DAG?
A DAG is a logical flowchart of operations (transformations and actions) where each node is an RDD/DataFrame and each edge is an operation. It has no cycles.
How Execution Happens:
- User code is parsed by the Driver and converted to a DAG.
- The DAG is optimized and broken into Stages.
- Each Stage contains Tasks that run in parallel.
- Results are returned to the Driver.
Example Flow
# User code
rdd = spark.sparkContext.textFile("file.txt")
rdd = rdd.map(lambda x: x.upper())
rdd = rdd.filter(lambda x: "ERROR" in x)
rdd.count()
This triggers:
- 1 Job (due to
count()
) - Multiple Stages (split by shuffles)
- Many Tasks (one per data partition)
Execution Terms:
Term | Meaning |
---|---|
Job | Triggered by an action like count() |
Stage | Group of tasks that can run in parallel |
Task | Smallest unit of work |
Partition | Logical chunk of data |
Transformations vs Actions
Transformations
- Lazy operations that define a new RDD/DataFrame.
- Executed only when an action is triggered.
- Examples:
map()
,filter()
,groupBy()
,select()
rdd = rdd.map(lambda x: x.upper())
No computation happens here yet.
Actions
- Trigger actual execution.
- Results are returned or written to external storage.
- Examples:
count()
,collect()
,show()
,write()
print(rdd.count()) # This triggers computation
Lazy Evaluation: Why It Matters
Spark builds a plan first and executes it only when needed. This helps:
- Optimize queries
- Minimize data shuffling
- Avoid unnecessary computation
Spark Internals and Optimization
Spark has two key components for internal processing:
1. Catalyst Optimizer (for DataFrames/Spark SQL)
- Converts SQL and DSL operations into an optimized logical plan.
- Applies rules like predicate pushdown, constant folding, etc.
- Generates the final physical plan (DAG)
df = spark.read.csv("data.csv", header=True)
df.filter("age > 30").select("name").explain()
explain()
shows the Catalyst-generated plan.
2. Tungsten Execution Engine
- Optimizes memory usage and code generation
- Uses bytecode instead of interpreted JVM for better performance
- Handles caching, pipelining, and vectorization
Visualization Tools
Spark UI (runs at http://localhost:4040)
When a job runs, Spark launches a local web UI with:
- DAG Visualization
- Job/Stage/Task status
- Executor metrics (CPU, memory)
- Environment and logs
Example Screenshots:
(Insert UI diagrams here: DAG graph, Task timeline, Storage tab)
Recap
- The Driver manages the application and controls job flow.
- Executors perform actual computations and store data.
- Cluster Managers assign resources across a cluster.
- Spark builds a DAG, breaks it into Stages and Tasks.
- Transformations are lazy; Actions trigger computation.
- Internally, Spark uses Catalyst and Tungsten for optimization.
- Use Spark UI for debugging and performance tuning.
Exercise: Use
df.explain()
anddf.queryExecution()
on any DataFrame to view the logical and physical plans. Then run an action likedf.count()
and observe the DAG in Spark UI.