In this chapter, you’ll learn how to set up PySpark on your local machine, configure it for modern tools like Jupyter and VS Code, and optionally connect it to Hadoop. By the end of this tutorial, you’ll be ready to run your first Spark jobs in Python.
What You’ll Need
Before we start, make sure you have:
- Python 3.8 to 3.12
- Java 11 or 17 (Java 8 is deprecated)
- pip (Python package manager)
You don’t need to install Spark manually anymore. The
pyspark
package includes everything.
Step-by-Step Pyspark Installation Guide
Step 1: Install Python and Java
Install Python and Java 11 or 17 from the official websites. After installation, check their versions:
python3 --version
java -version
Step 2: Install PySpark Using pip
Run the following command to install the latest stable PySpark:
pip install pyspark==3.5.1
This will install Spark and all necessary components with Python bindings.
Step 3: Verify the Installation
Open your terminal or command prompt and launch Python:
python3
Now run this test code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TestApp").getOrCreate()
df = spark.range(5)
df.show()
You should see a simple table printed as Spark successfully creates a DataFrame.
OS-Specific Setup Instructions
Windows
- Set environment variables:
JAVA_HOME = C:\Program Files\Java\jdk-17
- Add
JAVA_HOME\bin
to thePATH
- Run:
pip install pyspark
pyspark
Linux (Ubuntu)
Install dependencies and configure your shell:
sudo apt update
sudo apt install openjdk-17-jdk python3-pip
pip install pyspark
In your ~/.bashrc
file, add:
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
Then reload:
source ~/.bashrc
macOS (Intel/M1/M2)
brew install openjdk@17 python
pip3 install pyspark
Add the following to your terminal profile file:
export JAVA_HOME="$(/usr/libexec/java_home -v17)"
export PATH="$JAVA_HOME/bin:$PATH"
Using PySpark in Jupyter Notebooks
Step 1: Install Jupyter + PySpark
pip install jupyterlab pyspark
Step 2: Launch Jupyter Lab
jupyter lab
Step 3: Use PySpark
In your notebook cell:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NotebookApp").getOrCreate()
df = spark.range(10)
df.show()
You do NOT need
findspark
. This is supported natively since PySpark 3.3+.
Running PySpark in VS Code
Step 1: Install Required Extensions
- Python Extension
- Jupyter Extension (optional)
Step 2: Create and Run Script
Create a file example.py
:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("VSCodeApp").getOrCreate()
df = spark.range(20)
df.show()
Run it from the terminal:
python example.py
Sample HDFS Code
df = spark.read.text("hdfs://localhost:9000/user/data/sample.txt")
df.show()
Recap: What You Learned
- You installed PySpark using
pip install pyspark
- You ran PySpark in the terminal, Jupyter Notebook, and VSCode
- You configured environment variables on Windows, Linux, and macOS
- You optionally integrated PySpark with Hadoop HDFS
In the next chapter, we’ll dive deep into Spark’s architecture, including the DAG engine, job execution stages, and how Spark optimizes your code.
Exercise: Try loading a CSV file using
spark.read.csv()
and display the first few rows in your Jupyter Notebook.