Course
data-engineering-zoomcamp
Question
What is the difference between a Spark application, job, stage, and task?
Answer
Basic Spark Concepts
When we start working with Apache Spark, one of the first places where many new concepts appear is the graphical interface. There we see terms like application, job, stage, or task, but at first it's not always clear how they relate to each other. Understanding this hierarchy is very useful because it allows us to interpret what Spark is doing internally, debug problems, and better understand the performance of our processes.
Applications
A Spark application is the complete program we execute. It is the entire process that begins when we launch something like:
or when we start a session in PySpark or a notebook.
import pyspark
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master(os.environ.get('SPARK_MASTER')) \
.appName("csv-to-parquet") \
.getOrCreate()
A Spark application includes:
- The driver program, which coordinates execution.
- The executors, which perform the distributed work.
- All operations executed by the program until it finishes.
We can think of the application as the complete execution of our program.
For example, if we run a PySpark script that 1) reads a dataset, 2) performs several transformations, and 3) writes a result; all of that forms a single Spark application.
The graphical interface shows one entry for each executed application.
Job
Within an application, Spark divides the work into jobs. A job is created every time we execute an action on a DataFrame. As we saw in previous chapters, in Spark there are two types of operations:
- Transformations: describe a transformation but are not executed immediately.
- Actions: trigger the immediate execution of the transformations described up to that point.
Some examples of actions are show(), count(), collect(), write, or save. Every time we call an action, Spark creates a new job.
In the script:
df = spark.read.parquet("data.parquet")
df_filtered = df.filter("price > 10")
df_filtered.count()
df_filtered.show()
... two separate jobs will be executed, one for count() and one for show(), even though both use the same DataFrame.
This happens because Spark evaluates transformations lazily and only executes the plan when a result is requested.
Stages
Each job is divided into stages. Stages represent groups of operations that can be executed without needing to redistribute data between nodes.
The reason they are separated into stages is usually an operation called a shuffle. A shuffle occurs when data must be redistributed between partitions; for example in operations like: groupBy, join, distinct, and reduceByKey.
When Spark detects that a shuffle is needed, it divides the job into several stages.
df.groupBy("city").count()
This typically generates an execution plan roughly like this:
- Stage 1: data reading and initial transformation
- Shuffle
- Stage 2: final aggregation
Each stage can be executed in parallel across multiple nodes.
Task
A task is the smallest unit of work in Spark. Each stage is divided into multiple tasks, and each task processes one partition of data.
For example, for a dataset with 200 partitions in one stage, Spark will launch 200 tasks. And each task will be executed by an executor.
In other words:
Stage
├─ Task 1: processes partition 1
├─ Task 2: processes partition 2
├─ Task 3: processes partition 3
...
The more partitions there are, the more tasks Spark can execute in parallel.
Full Hierarchy in an Example
Imagine this code:
df = spark.read.parquet("rides.parquet")
result = (
df
.filter("passenger_count > 2")
.groupBy("PULocationID")
.count()
)
result.show()
The execution might look like this:
- Application: the complete script.
- Job: created by
show().
- Stages:
- Reading and filtering.
- Shuffle and aggregation.
- Tasks: one per partition.
Relevant Links
To get in-depth information about these concepts, check:
Checklist
Course
data-engineering-zoomcamp
Question
What is the difference between a Spark application, job, stage, and task?
Answer
Basic Spark Concepts
When we start working with Apache Spark, one of the first places where many new concepts appear is the graphical interface. There we see terms like application, job, stage, or task, but at first it's not always clear how they relate to each other. Understanding this hierarchy is very useful because it allows us to interpret what Spark is doing internally, debug problems, and better understand the performance of our processes.
Applications
A Spark application is the complete program we execute. It is the entire process that begins when we launch something like:
or when we start a session in PySpark or a notebook.
A Spark application includes:
We can think of the application as the complete execution of our program.
For example, if we run a PySpark script that 1) reads a dataset, 2) performs several transformations, and 3) writes a result; all of that forms a single Spark application.
The graphical interface shows one entry for each executed application.
Job
Within an application, Spark divides the work into jobs. A job is created every time we execute an action on a DataFrame. As we saw in previous chapters, in Spark there are two types of operations:
Some examples of actions are
show(),count(),collect(),write, orsave. Every time we call an action, Spark creates a new job.In the script:
... two separate jobs will be executed, one for
count()and one forshow(), even though both use the same DataFrame.This happens because Spark evaluates transformations lazily and only executes the plan when a result is requested.
Stages
Each job is divided into stages. Stages represent groups of operations that can be executed without needing to redistribute data between nodes.
The reason they are separated into stages is usually an operation called a shuffle. A shuffle occurs when data must be redistributed between partitions; for example in operations like:
groupBy,join,distinct, andreduceByKey.When Spark detects that a shuffle is needed, it divides the job into several stages.
This typically generates an execution plan roughly like this:
Each stage can be executed in parallel across multiple nodes.
Task
A task is the smallest unit of work in Spark. Each stage is divided into multiple tasks, and each task processes one partition of data.
For example, for a dataset with 200 partitions in one stage, Spark will launch 200 tasks. And each task will be executed by an executor.
In other words:
The more partitions there are, the more tasks Spark can execute in parallel.
Full Hierarchy in an Example
Imagine this code:
The execution might look like this:
show().Relevant Links
To get in-depth information about these concepts, check:
Checklist