Understanding Spark Architecture

In this chapter, you’ll learn how Apache Spark’s internal architecture works. Whether you’re running a job on your local machine or a massive cluster, Spark follows the same fundamental principles. Understanding these will help you write better PySpark code, optimize performance, and debug issues more effectively.

Core Components: Driver, Executors, Cluster Manager

Apache Spark follows a master-slave architecture with three main components:

1. Driver Program

The driver is your PySpark script or application entry point.
It creates the SparkSession and translates your code into execution plans.
It coordinates the cluster by:
- Requesting resources from the Cluster Manager
- Dividing jobs into tasks
- Collecting and returning results

The Driver stays active for the entire lifetime of the Spark job.

2. Executors

Executors are launched on worker nodes in the cluster.
Each executor:
- Executes assigned tasks
- Stores data in memory or disk
- Reports back results to the driver

3. Cluster Manager

Manages the physical resources (CPU, memory) across nodes.
Spark supports multiple managers:
- Standalone (default)
- YARN (used with Hadoop clusters)
- Apache Mesos
- Kubernetes (popular in cloud-native setups)

   +-----------------+        Resource Management        +-------------------+
   |    Driver App   |  <------------------------------> |  Cluster Manager  |
   |                 |                                  +-------------------+
   |     Tasks       |        Send tasks to executors        /        \
   +--------+--------+                                   /          \
            |                                          /              \
            v                                         v                v
   +-------------------+                    +----------------+    +----------------+
   |   Executor Node 1 |                    | Executor Node 2|    | Executor Node 3|
   +-------------------+                    +----------------+    +----------------+

DAG, Jobs, Stages, and Tasks

Spark uses a Directed Acyclic Graph (DAG) to track and optimize the flow of operations.

What is a DAG?

A DAG is a logical flowchart of operations (transformations and actions) where each node is an RDD/DataFrame and each edge is an operation. It has no cycles.

How Execution Happens:

User code is parsed by the Driver and converted to a DAG.
The DAG is optimized and broken into Stages.
Each Stage contains Tasks that run in parallel.
Results are returned to the Driver.

Example Flow

# User code
rdd = spark.sparkContext.textFile("file.txt")
rdd = rdd.map(lambda x: x.upper())
rdd = rdd.filter(lambda x: "ERROR" in x)
rdd.count()

This triggers:

1 Job (due to count())
Multiple Stages (split by shuffles)
Many Tasks (one per data partition)

Execution Terms:

Term	Meaning
Job	Triggered by an action like `count()`
Stage	Group of tasks that can run in parallel
Task	Smallest unit of work
Partition	Logical chunk of data

Transformations vs Actions

Transformations

Lazy operations that define a new RDD/DataFrame.
Executed only when an action is triggered.
Examples: map(), filter(), groupBy(), select()

rdd = rdd.map(lambda x: x.upper())

No computation happens here yet.

Actions

Trigger actual execution.
Results are returned or written to external storage.
Examples: count(), collect(), show(), write()

print(rdd.count())  # This triggers computation

Lazy Evaluation: Why It Matters

Spark builds a plan first and executes it only when needed. This helps:

Optimize queries
Minimize data shuffling
Avoid unnecessary computation

Spark Internals and Optimization

Spark has two key components for internal processing:

1. Catalyst Optimizer (for DataFrames/Spark SQL)

Converts SQL and DSL operations into an optimized logical plan.
Applies rules like predicate pushdown, constant folding, etc.
Generates the final physical plan (DAG)

df = spark.read.csv("data.csv", header=True)
df.filter("age > 30").select("name").explain()

explain() shows the Catalyst-generated plan.

2. Tungsten Execution Engine

Optimizes memory usage and code generation
Uses bytecode instead of interpreted JVM for better performance
Handles caching, pipelining, and vectorization

Visualization Tools

Spark UI (runs at http://localhost:4040)

When a job runs, Spark launches a local web UI with:

DAG Visualization
Job/Stage/Task status
Executor metrics (CPU, memory)
Environment and logs

Example Screenshots:

(Insert UI diagrams here: DAG graph, Task timeline, Storage tab)

Recap

The Driver manages the application and controls job flow.
Executors perform actual computations and store data.
Cluster Managers assign resources across a cluster.
Spark builds a DAG, breaks it into Stages and Tasks.
Transformations are lazy; Actions trigger computation.
Internally, Spark uses Catalyst and Tungsten for optimization.
Use Spark UI for debugging and performance tuning.

Exercise: Use df.explain() and df.queryExecution() on any DataFrame to view the logical and physical plans. Then run an action like df.count() and observe the DAG in Spark UI.