Understanding Spark Architecture

In this chapter, you’ll learn how Apache Spark’s internal architecture works. Whether you’re running a job on your local machine or a massive cluster, Spark follows the same fundamental principles. Understanding these will help you write better PySpark code, optimize performance, and debug issues more effectively.

Core Components: Driver, Executors, Cluster Manager

Apache Spark follows a master-slave architecture with three main components:

1. Driver Program

  • The driver is your PySpark script or application entry point.
  • It creates the SparkSession and translates your code into execution plans.
  • It coordinates the cluster by:
    • Requesting resources from the Cluster Manager
    • Dividing jobs into tasks
    • Collecting and returning results

The Driver stays active for the entire lifetime of the Spark job.

2. Executors

  • Executors are launched on worker nodes in the cluster.
  • Each executor:
    • Executes assigned tasks
    • Stores data in memory or disk
    • Reports back results to the driver

3. Cluster Manager

  • Manages the physical resources (CPU, memory) across nodes.
  • Spark supports multiple managers:
    • Standalone (default)
    • YARN (used with Hadoop clusters)
    • Apache Mesos
    • Kubernetes (popular in cloud-native setups)
   +-----------------+        Resource Management        +-------------------+
   |    Driver App   |  <------------------------------> |  Cluster Manager  |
   |                 |                                  +-------------------+
   |     Tasks       |        Send tasks to executors        /        \
   +--------+--------+                                   /          \
            |                                          /              \
            v                                         v                v
   +-------------------+                    +----------------+    +----------------+
   |   Executor Node 1 |                    | Executor Node 2|    | Executor Node 3|
   +-------------------+                    +----------------+    +----------------+

DAG, Jobs, Stages, and Tasks

Spark uses a Directed Acyclic Graph (DAG) to track and optimize the flow of operations.

What is a DAG?

A DAG is a logical flowchart of operations (transformations and actions) where each node is an RDD/DataFrame and each edge is an operation. It has no cycles.

How Execution Happens:

  1. User code is parsed by the Driver and converted to a DAG.
  2. The DAG is optimized and broken into Stages.
  3. Each Stage contains Tasks that run in parallel.
  4. Results are returned to the Driver.

Example Flow

# User code
rdd = spark.sparkContext.textFile("file.txt")
rdd = rdd.map(lambda x: x.upper())
rdd = rdd.filter(lambda x: "ERROR" in x)
rdd.count()

This triggers:

  • 1 Job (due to count())
  • Multiple Stages (split by shuffles)
  • Many Tasks (one per data partition)

Execution Terms:

TermMeaning
JobTriggered by an action like count()
StageGroup of tasks that can run in parallel
TaskSmallest unit of work
PartitionLogical chunk of data

Transformations vs Actions

Transformations

  • Lazy operations that define a new RDD/DataFrame.
  • Executed only when an action is triggered.
  • Examples: map(), filter(), groupBy(), select()
rdd = rdd.map(lambda x: x.upper())

No computation happens here yet.

Actions

  • Trigger actual execution.
  • Results are returned or written to external storage.
  • Examples: count(), collect(), show(), write()
print(rdd.count())  # This triggers computation

Lazy Evaluation: Why It Matters

Spark builds a plan first and executes it only when needed. This helps:

  • Optimize queries
  • Minimize data shuffling
  • Avoid unnecessary computation

Spark Internals and Optimization

Spark has two key components for internal processing:

1. Catalyst Optimizer (for DataFrames/Spark SQL)

  • Converts SQL and DSL operations into an optimized logical plan.
  • Applies rules like predicate pushdown, constant folding, etc.
  • Generates the final physical plan (DAG)
df = spark.read.csv("data.csv", header=True)
df.filter("age > 30").select("name").explain()

explain() shows the Catalyst-generated plan.

2. Tungsten Execution Engine

  • Optimizes memory usage and code generation
  • Uses bytecode instead of interpreted JVM for better performance
  • Handles caching, pipelining, and vectorization

Visualization Tools

Spark UI (runs at http://localhost:4040)

When a job runs, Spark launches a local web UI with:

  • DAG Visualization
  • Job/Stage/Task status
  • Executor metrics (CPU, memory)
  • Environment and logs

Example Screenshots:

(Insert UI diagrams here: DAG graph, Task timeline, Storage tab)

Recap

  • The Driver manages the application and controls job flow.
  • Executors perform actual computations and store data.
  • Cluster Managers assign resources across a cluster.
  • Spark builds a DAG, breaks it into Stages and Tasks.
  • Transformations are lazy; Actions trigger computation.
  • Internally, Spark uses Catalyst and Tungsten for optimization.
  • Use Spark UI for debugging and performance tuning.

Exercise: Use df.explain() and df.queryExecution() on any DataFrame to view the logical and physical plans. Then run an action like df.count() and observe the DAG in Spark UI.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top