PySpark Installation and Environment Setup

In this chapter, you’ll learn how to set up PySpark on your local machine, configure it for modern tools like Jupyter and VS Code, and optionally connect it to Hadoop. By the end of this tutorial, you’ll be ready to run your first Spark jobs in Python.

What You’ll Need

Before we start, make sure you have:

  • Python 3.8 to 3.12
  • Java 11 or 17 (Java 8 is deprecated)
  • pip (Python package manager)

You don’t need to install Spark manually anymore. The pyspark package includes everything.

Step-by-Step Pyspark Installation Guide

Step 1: Install Python and Java

Install Python and Java 11 or 17 from the official websites. After installation, check their versions:

python3 --version
java -version

Step 2: Install PySpark Using pip

Run the following command to install the latest stable PySpark:

pip install pyspark==3.5.1

This will install Spark and all necessary components with Python bindings.

Step 3: Verify the Installation

Open your terminal or command prompt and launch Python:

python3

Now run this test code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TestApp").getOrCreate()
df = spark.range(5)
df.show()

You should see a simple table printed as Spark successfully creates a DataFrame.

OS-Specific Setup Instructions

Windows

  1. Set environment variables:
JAVA_HOME = C:\Program Files\Java\jdk-17
  1. Add JAVA_HOME\bin to the PATH
  2. Run:
pip install pyspark
pyspark

Linux (Ubuntu)

Install dependencies and configure your shell:

sudo apt update
sudo apt install openjdk-17-jdk python3-pip
pip install pyspark

In your ~/.bashrc file, add:

export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH

Then reload:

source ~/.bashrc

macOS (Intel/M1/M2)

brew install openjdk@17 python
pip3 install pyspark

Add the following to your terminal profile file:

export JAVA_HOME="$(/usr/libexec/java_home -v17)"
export PATH="$JAVA_HOME/bin:$PATH"

Using PySpark in Jupyter Notebooks

Step 1: Install Jupyter + PySpark

pip install jupyterlab pyspark

Step 2: Launch Jupyter Lab

jupyter lab

Step 3: Use PySpark

In your notebook cell:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("NotebookApp").getOrCreate()
df = spark.range(10)
df.show()

You do NOT need findspark. This is supported natively since PySpark 3.3+.

Running PySpark in VS Code

Step 1: Install Required Extensions

  • Python Extension
  • Jupyter Extension (optional)

Step 2: Create and Run Script

Create a file example.py:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("VSCodeApp").getOrCreate()
df = spark.range(20)
df.show()

Run it from the terminal:

python example.py

Sample HDFS Code

df = spark.read.text("hdfs://localhost:9000/user/data/sample.txt")
df.show()

Recap: What You Learned

  • You installed PySpark using pip install pyspark
  • You ran PySpark in the terminal, Jupyter Notebook, and VSCode
  • You configured environment variables on Windows, Linux, and macOS
  • You optionally integrated PySpark with Hadoop HDFS

In the next chapter, we’ll dive deep into Spark’s architecture, including the DAG engine, job execution stages, and how Spark optimizes your code.

Exercise: Try loading a CSV file using spark.read.csv() and display the first few rows in your Jupyter Notebook.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top