- What is an RDD?
- Creating RDDs
- RDD Operations
- Lazy Evaluation and Lineage
- RDD Caching and Persistence
- When to Use RDD vs DataFrame
- Summary
In this chapter, we will explore one of the foundational abstractions in Apache Spark—Resilient Distributed Datasets (RDDs). Understanding RDDs is essential for working with low-level distributed data processing tasks in PySpark.
What is an RDD?
RDD stands for Resilient Distributed Dataset. It is the original core data stucture in Apache Spark, representing an immutable, distributed collection of objects that can be processed in parallel across a cluster.
Key Characteristics:
-
- Resilient: Automatically recovers from node failures.
-
- Distributed: Data is partitioned across multiple nodes.
-
- Immutable: Once created, the RDD cannot be changed.
-
- Lazy Evaluation: Operations are only evaluated when an action is called.
RDDs allow fine-grained control over transformations and computations and are particularly useful in scenarios where schema is not required or when working with low-level operations.
Creating RDDs
RDDs can be created in multiple ways using PySpark.
1. From an existing collection (in-memory data)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDDExample").getOrCreate()
sc = spark.sparkContext
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
print(rdd1.collect())
2. From an external file (text file, CSV, etc.)
rdd2 = sc.textFile("/path/to/file.txt")
print(rdd2.first())
3. From existing RDD transformations
rdd3 = rdd1.map(lambda x: x * x)
print(rdd3.collect())
RDD Operations
RDDs support two types of operations:
-
- Transformations: Return a new RDD
-
- Actions: Trigger computation and return results
Common Transformations
map()
Applies a function to each element.
rdd = sc.parallelize([1, 2, 3])
rdd.map(lambda x: x * 2).collect()
flatMap()
Returns multiple output elements for each input element.
lines = sc.parallelize(["Hello world", "Spark RDD"])
words = lines.flatMap(lambda line: line.split(" "))
print(words.collect())
filter()
Filters elements based on a condition.
rdd = sc.parallelize([1, 2, 3, 4, 5])
even = rdd.filter(lambda x: x % 2 == 0)
print(even.collect())
reduceByKey()
Groups data with the same key and applies a function.
data = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
result = data.reduceByKey(lambda x, y: x + y)
print(result.collect())
Lazy Evaluation and Lineage
Lazy Evaluation
Spark evaluates transformations only when an action is invoked. This allows Spark to optimize the execution plan, minimizing data shuffles and I/O operations.
rdd = sc.parallelize([1, 2, 3, 4])
transformed = rdd.map(lambda x: x * 2) # No computation yet
print(transformed.collect()) # Computation happens here
Lineage
RDDs maintain a record of all transformations applied to build the current RDD. This information is stored in the form of a lineage graph, which is used for fault recovery.
You can inspect the lineage of an RDD using:
print(transformed.toDebugString())
RDD Caching and Persistence
When an RDD is reused multiple times in an application, it is efficient to cache it in memory to avoid recomputation.
cache()
Stores the RDD in memory.
cached_rdd = rdd.map(lambda x: x * 2).cache()
print(cached_rdd.count())
print(cached_rdd.collect()) # No recomputation
persist()
Allows specifying different storage levels (memory, disk, etc.).
from pyspark import StorageLevel
rdd.persist(StorageLevel.MEMORY_AND_DISK)
You can unpersist it when no longer needed:
rdd.unpersist()
When to Use RDD vs DataFrame
Feature | RDD | DataFrame |
---|---|---|
Schema | No | Yes (structured and typed) |
Performance | Lower (manual optimization) | Higher (Catalyst Optimizer) |
Ease of Use | Low (functional API) | High (SQL-like syntax) |
Use Case | Complex transformations, custom logic | Structured data, SQL-style queries |
Debugging | Verbose | Easier with Spark UI and Catalyst plan |
Use RDD When:
-
- You need full control over low-level transformations
-
- Your data lacks schema or is unstructured
-
- You need fine-tuned performance tuning
Use DataFrame When:
-
- You work with structured data (CSV, Parquet, Hive)
-
- You want to use SQL queries
-
- You want better optimization and concise syntax
Summary
-
- RDD is the fundamental data structure of Spark, representing immutable, distributed collections.
-
- RDDs support two types of operations: transformations and actions.
-
- Spark uses lazy evaluation and tracks lineage for fault tolerance.
-
- Caching and persistence improve performance for repeated RDD usage.
-
- DataFrames are generally preferred for structured data and performance.
In the next chapter, we will explore PySpark DataFrames in detail and learn how they offer more expressive and optimized data processing capabilities.