Note: xpk workload create works only on clusters created through XPK. See docs on how to create a cluster via XPK.
-
Workload Create (submit training job):
xpk workload create \ --workload xpk-test-workload --command "echo goodbye" \ --cluster xpk-test \ --tpu-type=v5litepod-16 --project=$PROJECT
-
Workload create (DWS flex with queued provisioning):
xpk workload create \ --workload xpk-test-workload --command "echo goodbye" \ --cluster xpk-test --flex \ --tpu-type=v5litepod-16 --project=$PROJECT
-
Workload Create for Pathways: Pathways workload can be submitted using
workload create-pathwayson a Pathways enabled cluster (created withcluster create-pathways)Pathways workload example:
xpk workload create-pathways \ --workload xpk-pw-test \ --num-slices=1 \ --tpu-type=v5litepod-16 \ --cluster xpk-pw-test \ --docker-name='user-workload' \ --docker-image=<maxtext docker image> \ --command='python3 -m MaxText.train MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1 enable_single_controller=True'
Regular workload can also be submitted on a Pathways enabled cluster (created with
cluster create-pathways)Pathways workload example:
xpk workload create-pathways \ --workload xpk-regular-test \ --num-slices=1 \ --tpu-type=v5litepod-16 \ --cluster xpk-pw-test \ --docker-name='user-workload' \ --docker-image=<maxtext docker image> \ --command='python3 -m MaxText.train MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
Pathways in headless mode - Pathways now offers the capability to run JAX workloads in Vertex AI notebooks or in GCE VMs! Specify
--headlesswithworkload create-pathwayswhen the user workload is not provided in a docker container.xpk workload create-pathways --headless \ --workload xpk-pw-headless \ --num-slices=1 \ --tpu-type=v5litepod-16 \ --cluster xpk-pw-test
Executing the command above would provide the address of the proxy that the user job should connect to.
kubectl get pods kubectl port-forward pod/<proxy-pod-name> 29000:29000
JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 python -c 'import pathwaysutils; import jax; print(jax.devices())'Specify
JAX_PLATFORMS=proxyandJAX_BACKEND_TARGET=<proxy address from above>andimport pathwaysutilsto establish this connection between the user's JAX code and the Pathways proxy. Execute Pathways workloads interactively on Vertex AI notebooks!
--max-restarts <value>: By default, this is 0. This will restart the job "" times when the job terminates. For production jobs, it is recommended to increase this to a large number, say 50. Real jobs can be interrupted due to hardware failures and software updates. We assume your job has implemented checkpointing so the job restarts near where it was interrupted.
To submit jobs on a cluster with A3 or A4 machines, run the command with selected device type. To create a cluster with A3 or A4 machines see here.
| Machine | Device type |
|---|---|
| A3 Mega | h100-mega-80gb-8 |
| A3 Ultra | h200-141gb-8 |
| A4 | b200-8 |
xpk workload create \
--workload=$WORKLOAD_NAME --command="echo goodbye" \
--cluster=$CLUSTER_NAME --device-type DEVICE_TYPE \
--zone=$COMPUTE_ZONE --project=$PROJECT_ID \
--num-nodes=$WOKRKLOAD_NUM_NODESThe docker image flags/arguments introduced in workloads section can be used with A3 or A4 machines as well.
In order to run NCCL test on A3 machines check out this guide.
To schedule a workload on a Super-slicing cluster, specify the TPU type with the desired slice configuration (e.g., tpu7x-4x4x12).
Example Usage:
WORKLOAD_SLICE=4x4x12
xpk workload create \
--workload=$WORKLOAD_NAME \
--cluster=$CLUSTER_NAME \
--project="$PROJECT_ID" \
--zone="$ZONE" \
--tpu-type="tpu7x-${WORKLOAD_SLICE}" \
--command="python3 fake_training.py"-
Set the priority level of your workload with
--priority=LEVELWe have five priorities defined: [
very-low,low,medium,high,very-high]. The default priority ismedium.Priority determines:
-
Order of queued jobs.
Queued jobs are ordered by
very-low<low<medium<high<very-high -
Preemption of lower priority workloads.
A higher priority job will
evictlower priority jobs. Evicted jobs are brought back to the queue and will re-hydrate appropriately.xpk workload create \ --workload xpk-test-medium-workload --command "echo goodbye" --cluster \ xpk-test --tpu-type=v5litepod-16 --priority=medium
-
Note: This feature is available in XPK >= 0.4.0. Enable Vertex AI API in your Google Cloud console to use this feature. Make sure you have Vertex AI Administrator role assigned to your user account and to the Compute Engine Service account attached to the node pools in the cluster.
Vertex AI Experiment is a tool that helps to track and analyze an experiment run on Vertex AI Tensorboard. To learn more about Vertex AI Experiments, visit this.
XPK will create a Vertex AI Experiment in workload create command and attach the Vertex AI Tensorboard created for the cluster during cluster create. If there is a cluster created before this feature is released, there will be no Vertex AI Tensorboard created for the cluster and workload create will fail. Re-run cluster create to create a Vertex AI Tensorboard and then run workload create again to schedule your workload.
- Create Vertex AI Experiment with default Experiment name:
xpk workload create \
--cluster xpk-test --workload xpk-workload \
--use-vertex-tensorboardwill create a Vertex AI Experiment with the name xpk-test-xpk-workload (<args.cluster>-<args.workload>).
- Create Vertex AI Experiment with user-specified Experiment name:
xpk workload create \
--cluster xpk-test --workload xpk-workload \
--use-vertex-tensorboard --experiment-name=test-experimentwill create a Vertex AI Experiment with the name test-experiment.
Check out MaxText example on how to update your workload to automatically upload logs collected in your Tensorboard directory to the Vertex AI Experiment created by workload create.
-
Workload Delete (delete training job):
xpk workload delete \ --workload xpk-test-workload --cluster xpk-test
This will only delete
xpk-test-workloadworkload inxpk-testcluster. -
Workload Delete (delete all training jobs in the cluster):
xpk workload delete \ --cluster xpk-test
This will delete all the workloads in
xpk-testcluster. Deletion will only begin if you typeyoryesat the prompt. Multiple workload deletions are processed in batches for optimized processing. -
Workload Delete supports filtering. Delete a portion of jobs that match user criteria.
- Filter by Job:
filter-by-job
xpk workload delete \ --cluster xpk-test --filter-by-job=$USERThis will delete all the workloads in
xpk-testcluster whose names start with$USER. Deletion will only begin if you typeyoryesat the prompt.- Filter by Status:
filter-by-status
xpk workload delete \ --cluster xpk-test --filter-by-status=QUEUED
This will delete all the workloads in
xpk-testcluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you typeyoryesat the prompt. Status can be:EVERYTHING,FINISHED,RUNNING,QUEUED,FAILED,SUCCESSFUL. - Filter by Job:
-
Workload List (see training jobs):
xpk workload list \ --cluster xpk-test
-
Example Workload List Output:
The below example shows four jobs of different statuses:
user-first-job-failed: filter-status isFINISHEDandFAILED.user-second-job-success: filter-status isFINISHEDandSUCCESSFUL.user-third-job-running: filter-status isRUNNING.user-forth-job-in-queue: filter-status isQUEUED.user-fifth-job-in-queue-preempted: filter-status isQUEUED.
Jobset Name Created Time Priority TPU VMs Needed TPU VMs Running/Ran TPU VMs Done Status Status Message Status Time user-first-job-failed 2023-1-1T1:00:00Z medium 4 4 <none> Finished JobSet failed 2023-1-1T1:05:00Z user-second-job-success 2023-1-1T1:10:00Z medium 4 4 4 Finished JobSet finished successfully 2023-1-1T1:14:00Z user-third-job-running 2023-1-1T1:15:00Z medium 4 4 <none> Admitted Admitted by ClusterQueue cluster-queue 2023-1-1T1:16:00Z user-forth-job-in-queue 2023-1-1T1:16:05Z medium 4 <none> <none> Admitted couldn't assign flavors to pod set slice-job: insufficient unused quota for google.com/tpu in flavor 2xv4-8, 4 more need 2023-1-1T1:16:10Z user-fifth-job-preempted 2023-1-1T1:10:05Z low 4 <none> <none> Evicted Preempted to accommodate a higher priority Workload 2023-1-1T1:10:00Z -
Workload List supports filtering. Observe a portion of jobs that match user criteria.
- Filter by Status:
filter-by-status
Filter the workload list by the status of respective jobs. Status can be:
EVERYTHING,FINISHED,RUNNING,QUEUED,FAILED,SUCCESSFUL- Filter by Job:
filter-by-job
Filter the workload list by the name of a job.
xpk workload list \ --cluster xpk-test --filter-by-job=$USER - Filter by Status:
-
Workload List supports waiting for the completion of a specific job. XPK will follow an existing job until it has finished or the
timeout, if provided, has been reached and then list the job. If notimeoutis specified, the default value is set to the max value, 1 week. You may also settimeout=0to poll the job once.Wait for a job to complete.
xpk workload list \ --cluster xpk-test --wait-for-job-completion=xpk-test-workload
Wait for a job to complete with a timeout of 300 seconds.
xpk workload list \ --cluster xpk-test --wait-for-job-completion=xpk-test-workload \ --timeout=300
Return codes
0: Workload finished and completed successfully.124: Timeout was reached before workload finished.125: Workload finished but did not complete successfully.1: Other failure.