Skip to content

Use ONTAP REST return_timeout to reduce job polling latency #1144

@ashokmuthyalapati

Description

@ashokmuthyalapati

Describe the solution you'd like
Add support in Trident's ONTAP REST job handling to use ONTAP's return_timeout parameter for async REST operations where supported, instead of immediately returning 202 Accepted and polling the job with exponential backoff.

For operations such as volume create, volume mount/unmount, and volume export policy updates, ONTAP supports return_timeout. Setting return_timeout=120 allows ONTAP to hold the REST request until the operation completes, avoiding multiple follow-up /api/cluster/jobs/{uuid} polling calls and the sleep time introduced by exponential backoff.

In local testing with an ontap-nas-style workflow, rolling up the async call plus job polling showed the following averages:

Operation Current exponential backoff avg return_timeout=120 avg Improvement
Volume create 3.86s 1.02s 73.61% faster
Volume mount 1.98s 0.53s 73.06% faster
Apply export policy 1.12s 0.17s 84.55% faster
Total for these operations 6.95s 1.72s 75.21% faster

Considering only the hightest latency numbers observed between both, return_timeout is still faster.

Scope Exponential backoff worst total return_timeout=120 worst total Improvement Faster
create + mount + apply export policy 14.58s 2.49s 12.09s 82.95%

The current approach also generates extra REST traffic. In the same test flow, the exponential-backoff implementation averaged multiple job polling calls per operation:

Operation Avg job poll calls Avg wait time between polls
Volume create 4.5 3.56s
Volume mount 3.6 1.88s
Apply export policy 3.0 1.00s

The desired behavior is to use return_timeout for supported ONTAP REST POST/PATCH/DELETE operations, and only fall back to job polling if ONTAP still returns 202 Accepted after the timeout expires.

Describe alternatives you've considered
The current behavior uses exponential backoff job polling:

  • InitialInterval = 1s
  • Multiplier = 1.414
  • MaxInterval = 2s
  • RandomizationFactor = 0.1
  • MaxElapsedTime = 2m

This works correctly, but it adds latency because Trident waits between job status checks even when the job may have completed shortly after the prior poll. It also increases REST API traffic due to repeated calls to /api/cluster/jobs/{uuid}?fields=state and a final job details lookup.

Another alternative would be tuning the backoff parameters to poll more aggressively, but that would trade latency for more API traffic. Using return_timeout lets ONTAP handle the wait server-side and avoids most client-side polling in the common case.

Additional context
I tested two versions of the same ONTAP NAS FlexVol-per-volume workflow:

  1. Current Trident-style behavior using async job polling with exponential backoff.
  2. A REST variant using return_timeout=120 for supported volume operations.

Both flows performed the same high-level operations:

  • Create volume
  • Mount volume
  • Create/apply export policy
  • Unmount volume # to cleanup after test
  • Delete volume # to cleanup after test
  • Delete export policy # to cleanup after test

The focused comparison for the operations that trigger job polling showed:

Operation Exponential backoff rolled-up avg return_timeout=120 avg Difference
Volume create 3.86s 1.02s -2.84s
Volume mount 1.98s 0.53s -1.44s
Apply export policy 1.12s 0.17s -0.94s
Total 6.95s 1.72s -5.23s

The rolled-up exponential backoff time includes the initial async REST call, all job polling calls, job details lookup, and sleep time between polls. The return_timeout=120 time is the single REST operation latency.

This suggests Trident could reduce provisioning latency and REST API chatter by preferring return_timeout for supported ONTAP REST operations, with existing job polling retained as a fallback when a job does not complete within the return timeout.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions