In this chapter, you’ll learn how to set up PySpark on your local machine, configure it for modern tools like Jupyter and VS Code, and optionally connect it to Hadoop. By the end of this tutorial, you’ll be ready to run your first Spark jobs in Python.
Before we start, make sure you have:
You don’t need to install Spark manually anymore. The
pysparkpackage includes everything.
Install Python and Java 11 or 17 from the official websites. After installation, check their versions:
python3 --version
java -version
Run the following command to install the latest stable PySpark:
pip install pyspark==3.5.1
This will install Spark and all necessary components with Python bindings.
Open your terminal or command prompt and launch Python:
python3
Now run this test code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TestApp").getOrCreate()
df = spark.range(5)
df.show()Code language: JavaScript (javascript)
You should see a simple table printed as Spark successfully creates a DataFrame.
JAVA_HOME = C:\Program Files\Java\jdk-17
JAVA_HOME\bin to the PATHpip install pyspark
pyspark
Install dependencies and configure your shell:
sudo apt update
sudo apt install openjdk-17-jdk python3-pip
pip install pyspark
In your ~/.bashrc file, add:
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATHCode language: JavaScript (javascript)
Then reload:
source ~/.bashrc
brew install openjdk@17 python
pip3 install pysparkCode language: CSS (css)
Add the following to your terminal profile file:
export JAVA_HOME="$(/usr/libexec/java_home -v17)"
export PATH="$JAVA_HOME/bin:$PATH"Code language: JavaScript (javascript)
pip install jupyterlab pyspark
jupyter lab
In your notebook cell:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NotebookApp").getOrCreate()
df = spark.range(10)
df.show()Code language: JavaScript (javascript)
You do NOT need
findspark. This is supported natively since PySpark 3.3+.
Create a file example.py:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("VSCodeApp").getOrCreate()
df = spark.range(20)
df.show()Code language: JavaScript (javascript)
Run it from the terminal:
python example.pyCode language: CSS (css)
df = spark.read.text("hdfs://localhost:9000/user/data/sample.txt")
df.show()Code language: JavaScript (javascript)
pip install pysparkIn the next chapter, we’ll dive deep into Spark’s architecture, including the DAG engine, job execution stages, and how Spark optimizes your code.
Exercise: Try loading a CSV file using
spark.read.csv()and display the first few rows in your Jupyter Notebook.