Working with RDDs (Resilient Distributed Datasets)

Table of Contents:

In this chapter, we will explore one of the foundational abstractions in Apache Spark—Resilient Distributed Datasets (RDDs). Understanding RDDs is essential for working with low-level distributed data processing tasks in PySpark.

What is an RDD?

RDD stands for Resilient Distributed Dataset. It is the original core data stucture in Apache Spark, representing an immutable, distributed collection of objects that can be processed in parallel across a cluster.

Key Characteristics:

- Resilient: Automatically recovers from node failures.

- Distributed: Data is partitioned across multiple nodes.

- Immutable: Once created, the RDD cannot be changed.

- Lazy Evaluation: Operations are only evaluated when an action is called.

RDDs allow fine-grained control over transformations and computations and are particularly useful in scenarios where schema is not required or when working with low-level operations.

Creating RDDs

RDDs can be created in multiple ways using PySpark.

1. From an existing collection (in-memory data)

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDDExample").getOrCreate()
sc = spark.sparkContext
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
print(rdd1.collect())

2. From an external file (text file, CSV, etc.)

rdd2 = sc.textFile("/path/to/file.txt")
print(rdd2.first())

3. From existing RDD transformations

rdd3 = rdd1.map(lambda x: x * x)
print(rdd3.collect())

RDD Operations

RDDs support two types of operations:

- Transformations: Return a new RDD

- Actions: Trigger computation and return results

Common Transformations

map()

Applies a function to each element.

rdd = sc.parallelize([1, 2, 3])
rdd.map(lambda x: x * 2).collect()

flatMap()

Returns multiple output elements for each input element.

lines = sc.parallelize(["Hello world", "Spark RDD"])
words = lines.flatMap(lambda line: line.split(" "))
print(words.collect())

filter()

Filters elements based on a condition.

rdd = sc.parallelize([1, 2, 3, 4, 5])
even = rdd.filter(lambda x: x % 2 == 0)
print(even.collect())

reduceByKey()

Groups data with the same key and applies a function.

data = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
result = data.reduceByKey(lambda x, y: x + y)
print(result.collect())

Lazy Evaluation and Lineage

Lazy Evaluation

Spark evaluates transformations only when an action is invoked. This allows Spark to optimize the execution plan, minimizing data shuffles and I/O operations.

rdd = sc.parallelize([1, 2, 3, 4])
transformed = rdd.map(lambda x: x * 2)  # No computation yet
print(transformed.collect())  # Computation happens here

Lineage

RDDs maintain a record of all transformations applied to build the current RDD. This information is stored in the form of a lineage graph, which is used for fault recovery.

You can inspect the lineage of an RDD using:

print(transformed.toDebugString())

RDD Caching and Persistence

When an RDD is reused multiple times in an application, it is efficient to cache it in memory to avoid recomputation.

cache()

Stores the RDD in memory.

cached_rdd = rdd.map(lambda x: x * 2).cache()
print(cached_rdd.count())
print(cached_rdd.collect())  # No recomputation

persist()

Allows specifying different storage levels (memory, disk, etc.).

from pyspark import StorageLevel
rdd.persist(StorageLevel.MEMORY_AND_DISK)

You can unpersist it when no longer needed:

rdd.unpersist()

When to Use RDD vs DataFrame

Feature	RDD	DataFrame
Schema	No	Yes (structured and typed)
Performance	Lower (manual optimization)	Higher (Catalyst Optimizer)
Ease of Use	Low (functional API)	High (SQL-like syntax)
Use Case	Complex transformations, custom logic	Structured data, SQL-style queries
Debugging	Verbose	Easier with Spark UI and Catalyst plan

Use RDD When:

- You need full control over low-level transformations

- Your data lacks schema or is unstructured

- You need fine-tuned performance tuning

Use DataFrame When:

- You work with structured data (CSV, Parquet, Hive)

- You want to use SQL queries

- You want better optimization and concise syntax

Summary

- RDD is the fundamental data structure of Spark, representing immutable, distributed collections.

- RDDs support two types of operations: transformations and actions.

- Spark uses lazy evaluation and tracks lineage for fault tolerance.

- Caching and persistence improve performance for repeated RDD usage.

- DataFrames are generally preferred for structured data and performance.

In the next chapter, we will explore PySpark DataFrames in detail and learn how they offer more expressive and optimized data processing capabilities.