stackabletech · razvan · Jun 26, 2026 · Jun 29, 2026 · Jun 29, 2026 · Jun 29, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,7 @@ All notable changes to this project will be documented in this file.
 ### Added
 
 - BREAKING: Add required CLI argument and env var to set the image repository used to construct final product image names: `IMAGE_REPOSITORY` (`--image-repository`), eg. `oci.example.org/my/namespace` ([#684]).
+- Add CRD version `v1alpha2` for `SparkApplication` and `SparkApplicationTemplate`. The conversion webhook converts `v1alpha1` objects to `v1alpha2` ([#711]).
 
 ### Fixed
 
@@ -15,6 +16,9 @@ All notable changes to this project will be documented in this file.
 
 ### Changed
 
+- BREAKING: The operator no longer runs a separate `spark-submit` process for `SparkApplication`s. The driver is launched directly as a Kubernetes `Job` (built from `spec.driver`) running `spark-submit` in client mode; executors are still created by the driver via Spark's Kubernetes backend. A headless driver `Service` is created so executors can reach the driver. This affects both `v1alpha1` (after conversion) and `v1alpha2` objects ([#711]).
+- The driver `Job` is no longer retried on failure (`backoffLimit` is `0`); the previous `spec.job.retryOnFailureCount` is deprecated and ignored ([#711]).
+- The `SparkApplication` status (`status.phase`) is now derived from the driver `Job` status by the application controller instead of a dedicated pod-driver controller watching driver pods. The pod-driver controller and the operator's `pods` RBAC permissions have been removed ([#711]).
 - Document Helm deployed RBAC permissions and remove unnecessary permissions ([#674]).
 - BREAKING: Each custom resource accepts now only the known config files in `configOverrides`:
   - `SparkApplication`: `spark-env.sh` and `security.properties`
@@ -27,6 +31,11 @@ All notable changes to this project will be documented in this file.
 - Fix the `SparkApplication` CRD description, which incorrectly described it as a "Spark cluster stacklet" rather than a Spark application ([#705]).
 - BREAKING: make application templates namespaced instead of cluster wide objects ([#694]).
 
+### Deprecated
+
+- `SparkApplication`/`SparkApplicationTemplate` `spec.job` is deprecated and ignored since `v1alpha2` (renamed to `deprecatedJob` in that version). The driver `Job` is now built from `spec.driver` ([#711]).
+- `SparkApplication`/`SparkApplicationTemplate` `spec.mode` is deprecated and ignored (renamed to `deprecatedMode` in `v1alpha2`): the operator always runs the driver in client mode internally. `mode` is now optional in `v1alpha1` as well ([#711]).
+
 [#674]: https://github.com/stackabletech/spark-k8s-operator/pull/674
 [#679]: https://github.com/stackabletech/spark-k8s-operator/pull/679
 [#680]: https://github.com/stackabletech/spark-k8s-operator/pull/680
@@ -36,6 +45,7 @@ All notable changes to this project will be documented in this file.
 [#694]: https://github.com/stackabletech/spark-k8s-operator/pull/694
 [#696]: https://github.com/stackabletech/spark-k8s-operator/pull/696
 [#705]: https://github.com/stackabletech/spark-k8s-operator/pull/705
+[#711]: https://github.com/stackabletech/spark-k8s-operator/pull/711
 
 ## [26.3.0] - 2026-03-16
 

diff --git a/deploy/helm/spark-k8s-operator/templates/roles.yaml b/deploy/helm/spark-k8s-operator/templates/roles.yaml
@@ -13,19 +13,6 @@ rules:
       - nodes/proxy
     verbs:
       - get
-  # The pod-driver controller watches Spark driver pods
-  # (labelled spark-role=driver) to track SparkApplication completion. It also
-  # deletes driver pods once the application reaches a terminal phase (Succeeded
-  # or Failed).
-  - apiGroups:
-      - ""
-    resources:
-      - pods
-    verbs:
-      - delete
-      - get
-      - list
-      - watch
   # ConfigMaps hold pod templates and Spark configuration. All three controllers apply
   # them via Server-Side Apply (create + patch). The history and connect controllers
   # track them for orphan cleanup (list + delete). All controllers watch ConfigMaps via
@@ -43,9 +30,10 @@ rules:
       - patch
       - watch
   # Services expose Spark History Server and Spark Connect Server for metrics and
-  # inter-component communication. Applied via Server-Side Apply and tracked for orphan
-  # cleanup by the history and connect controllers. The history and connect controllers
-  # watch Services via .owns(Service) to trigger re-reconciliation on change.
+  # inter-component communication, and the app controller creates a headless driver Service per
+  # SparkApplication so executors can reach the driver in client mode. Applied via Server-Side
+  # Apply and tracked for orphan cleanup by the history and connect controllers. The app, history
+  # and connect controllers watch Services via .owns(Service) to trigger re-reconciliation on change.
   # get is required for the ReconciliationPaused strategy in cluster_resources.add().
   - apiGroups:
       - ""
@@ -115,17 +103,20 @@ rules:
       - list
       - patch
       - watch
-  # A Kubernetes Job is created per SparkApplication via Server-Side Apply to run
-  # spark-submit. The app controller applies Jobs directly (not via cluster_resources),
-  # so only create + patch (SSA) are needed. Jobs are not watched and not tracked for
-  # orphan cleanup by any controller.
+  # The driver Job is created per SparkApplication via Server-Side Apply. The app controller
+  # applies Jobs directly (create + patch via SSA) and watches them via .owns(Job) to track
+  # SparkApplication progress, so get + list + watch are also required. Jobs are garbage collected
+  # via their owner reference and ttlSecondsAfterFinished, so no explicit delete is needed.
   - apiGroups:
       - batch
     resources:
       - jobs
     verbs:
       - create
+      - get
+      - list
       - patch
+      - watch
   # PodDisruptionBudgets limit voluntary disruptions to Spark History Server pods.
   # Applied via Server-Side Apply and tracked for orphan cleanup by the history
   # controller. No controller watches PDBs via .owns().

diff --git a/docs/modules/spark-k8s/pages/getting_started/first_steps.adoc b/docs/modules/spark-k8s/pages/getting_started/first_steps.adoc
@@ -6,11 +6,13 @@ Afterwards you can <<_verify_that_it_works, verify that it works>> by looking at
 
 == Starting a Spark job
 
-A Spark application is made of up three components:
+A Spark application is made up of two components:
 
-* Job: this builds a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
-* Driver: the driver starts the designated number of executors and removes them when the job is completed.
-* Executor(s): responsible for executing the job itself
+* Driver: the operator creates a Kubernetes `Job` (built from `spec.driver`) whose pod runs `spark-submit` in client mode and therefore _is_ the Spark driver.
+  The driver starts the designated number of executors and removes them when the job is completed.
+* Executor(s): responsible for executing the job itself. They are created by the driver via Spark's Kubernetes backend and connect back to the driver through a headless `Service` created by the operator.
+
+NOTE: Previous versions ran an additional `spark-submit` process in a dedicated pod that then created the driver pod. This is no longer the case; `spec.job` and `spec.mode` are deprecated and ignored.
 
 Create a Spark application by running:
 
@@ -27,22 +29,21 @@ include::example$getting_started/application.yaml[Create a Spark application]
 ----
 <1> `metadata.name` contains the name of the SparkApplication
 <2> `spec.sparkImage`: the image used by the job, driver and executor pods. This can be a custom image built by the user or an official Stackable image. Available official images are stored in the Stackable https://oci.stackable.tech/[image registry,window=_blank]. Information on how to browse the registry can be found xref:contributor:project-overview.adoc#docker-images[here,window=_blank].
-<3> `spec.mode`: only `cluster` is currently supported
+<3> `spec.mode`: deprecated and ignored; the driver always runs in client mode internally.
 <4> `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
   This path is relative to the image, so in this case an example python script (that calculates the value of pi) is running: it is bundled with the Spark code and therefore already present in the job image
-<5> `spec.job`: submit command specific settings.
-<6> `spec.driver`: driver-specific settings.
+<5> `spec.job`: deprecated and ignored. In previous versions this configured the dedicated `spark-submit` job.
+<6> `spec.driver`: driver-specific settings. Used to build the driver `Job` pod.
 <7> `spec.executor`: executor-specific settings.
 
 == Verify that it works
 
-As mentioned above, the SparkApplication that has just been created builds a `spark-submit` command and pass it to the driver Pod, which in turn creates executor Pods that run for the duration of the job before being clean up.
+As mentioned above, the operator creates a driver `Job` whose pod runs `spark-submit` in client mode and thus acts as the Spark driver, which in turn creates executor Pods that run for the duration of the job before being cleaned up.
 A running process looks like this:
 
 image::getting_started/spark_running.png[Spark job]
 
-* `pyspark-pi-xxxx`: this is the initializing job that creates the spark-submit command (named as `metadata.name` with a unique suffix)
-* `pyspark-pi-xxxxxxx-driver`: the driver pod that drives the execution
+* `pyspark-pi-xxxxxxx-driver`: the driver pod (created by the driver `Job`) that drives the execution
 * `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in the example `spec.executor.instances` was set to 3 which is why 3 executors are running)
 
 Job progress can be followed by issuing this command:

diff --git a/docs/modules/spark-k8s/pages/usage-guide/examples.adoc b/docs/modules/spark-k8s/pages/usage-guide/examples.adoc
@@ -6,7 +6,7 @@ The following examples have the following `spec` fields in common:
 * `version`: the current version is "1.0"
 * `sparkImage`: the docker image that is used by job, driver and executor pods.
   This can be provided by the user.
-* `mode`: only `cluster` is currently supported
+* `mode`: deprecated and ignored; the driver always runs in client mode internally
 * `mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
 * `args`: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.
 * `sparkConf`: these list spark configuration settings that are passed directly to `spark-submit` and which are best defined explicitly by the user. Since the `SparkApplication` "knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together.