Describe the solution you'd like
Add support in Trident's ONTAP REST job handling to use ONTAP's return_timeout parameter for async REST operations where supported, instead of immediately returning 202 Accepted and polling the job with exponential backoff.
For operations such as volume create, volume mount/unmount, and volume export policy updates, ONTAP supports return_timeout. Setting return_timeout=120 allows ONTAP to hold the REST request until the operation completes, avoiding multiple follow-up /api/cluster/jobs/{uuid} polling calls and the sleep time introduced by exponential backoff.
In local testing with an ontap-nas-style workflow, rolling up the async call plus job polling showed the following averages:
| Operation |
Current exponential backoff avg |
return_timeout=120 avg |
Improvement |
| Volume create |
3.86s |
1.02s |
73.61% faster |
| Volume mount |
1.98s |
0.53s |
73.06% faster |
| Apply export policy |
1.12s |
0.17s |
84.55% faster |
| Total for these operations |
6.95s |
1.72s |
75.21% faster |
Considering only the hightest latency numbers observed between both, return_timeout is still faster.
| Scope |
Exponential backoff worst total |
return_timeout=120 worst total |
Improvement |
Faster |
| create + mount + apply export policy |
14.58s |
2.49s |
12.09s |
82.95% |
The current approach also generates extra REST traffic. In the same test flow, the exponential-backoff implementation averaged multiple job polling calls per operation:
| Operation |
Avg job poll calls |
Avg wait time between polls |
| Volume create |
4.5 |
3.56s |
| Volume mount |
3.6 |
1.88s |
| Apply export policy |
3.0 |
1.00s |
The desired behavior is to use return_timeout for supported ONTAP REST POST/PATCH/DELETE operations, and only fall back to job polling if ONTAP still returns 202 Accepted after the timeout expires.
Describe alternatives you've considered
The current behavior uses exponential backoff job polling:
InitialInterval = 1s
Multiplier = 1.414
MaxInterval = 2s
RandomizationFactor = 0.1
MaxElapsedTime = 2m
This works correctly, but it adds latency because Trident waits between job status checks even when the job may have completed shortly after the prior poll. It also increases REST API traffic due to repeated calls to /api/cluster/jobs/{uuid}?fields=state and a final job details lookup.
Another alternative would be tuning the backoff parameters to poll more aggressively, but that would trade latency for more API traffic. Using return_timeout lets ONTAP handle the wait server-side and avoids most client-side polling in the common case.
Additional context
I tested two versions of the same ONTAP NAS FlexVol-per-volume workflow:
- Current Trident-style behavior using async job polling with exponential backoff.
- A REST variant using
return_timeout=120 for supported volume operations.
Both flows performed the same high-level operations:
- Create volume
- Mount volume
- Create/apply export policy
- Unmount volume # to cleanup after test
- Delete volume # to cleanup after test
- Delete export policy # to cleanup after test
The focused comparison for the operations that trigger job polling showed:
| Operation |
Exponential backoff rolled-up avg |
return_timeout=120 avg |
Difference |
| Volume create |
3.86s |
1.02s |
-2.84s |
| Volume mount |
1.98s |
0.53s |
-1.44s |
| Apply export policy |
1.12s |
0.17s |
-0.94s |
| Total |
6.95s |
1.72s |
-5.23s |
The rolled-up exponential backoff time includes the initial async REST call, all job polling calls, job details lookup, and sleep time between polls. The return_timeout=120 time is the single REST operation latency.
This suggests Trident could reduce provisioning latency and REST API chatter by preferring return_timeout for supported ONTAP REST operations, with existing job polling retained as a fallback when a job does not complete within the return timeout.
Describe the solution you'd like
Add support in Trident's ONTAP REST job handling to use ONTAP's
return_timeoutparameter for async REST operations where supported, instead of immediately returning202 Acceptedand polling the job with exponential backoff.For operations such as volume create, volume mount/unmount, and volume export policy updates, ONTAP supports
return_timeout. Settingreturn_timeout=120allows ONTAP to hold the REST request until the operation completes, avoiding multiple follow-up/api/cluster/jobs/{uuid}polling calls and the sleep time introduced by exponential backoff.In local testing with an
ontap-nas-style workflow, rolling up the async call plus job polling showed the following averages:return_timeout=120avgConsidering only the hightest latency numbers observed between both, return_timeout is still faster.
The current approach also generates extra REST traffic. In the same test flow, the exponential-backoff implementation averaged multiple job polling calls per operation:
The desired behavior is to use
return_timeoutfor supported ONTAP REST POST/PATCH/DELETE operations, and only fall back to job polling if ONTAP still returns202 Acceptedafter the timeout expires.Describe alternatives you've considered
The current behavior uses exponential backoff job polling:
InitialInterval = 1sMultiplier = 1.414MaxInterval = 2sRandomizationFactor = 0.1MaxElapsedTime = 2mThis works correctly, but it adds latency because Trident waits between job status checks even when the job may have completed shortly after the prior poll. It also increases REST API traffic due to repeated calls to
/api/cluster/jobs/{uuid}?fields=stateand a final job details lookup.Another alternative would be tuning the backoff parameters to poll more aggressively, but that would trade latency for more API traffic. Using
return_timeoutlets ONTAP handle the wait server-side and avoids most client-side polling in the common case.Additional context
I tested two versions of the same ONTAP NAS FlexVol-per-volume workflow:
return_timeout=120for supported volume operations.Both flows performed the same high-level operations:
The focused comparison for the operations that trigger job polling showed:
return_timeout=120avgThe rolled-up exponential backoff time includes the initial async REST call, all job polling calls, job details lookup, and sleep time between polls. The
return_timeout=120time is the single REST operation latency.This suggests Trident could reduce provisioning latency and REST API chatter by preferring
return_timeoutfor supported ONTAP REST operations, with existing job polling retained as a fallback when a job does not complete within the return timeout.