Skip to content

Wait handlers: hardcoded sleeps cause spurious 404s on postgresflex/iaas creation #7165

@qaiser42

Description

@qaiser42

Problem description

The SDK's Wait* handlers call SetSleepBeforeWait internally with hardcoded values that callers can't override. When STACKIT's control plane is slower than that hardcoded sleep — fairly common for postgresflex instance creation and iaas server creation — the first poll fires before the resource is queryable, the API returns 404, and the wait handler treats this as a fatal error instead of "not yet visible."

The 404 then propagates straight up through terraform-provider-stackit / pulumi-stackit as a failed apply:

error: Error creating instance: Instance creation waiting: 404 Not Found, status code 404, Body: 
{"message":"Requested instance with ID: d4a2c1eb-a696-4b3d-b571-a25dd8c11002 cannot be found","code":404,"type":"NotFound"}

This leaves inconsistent state — the instance is actually being created on the STACKIT side, but the IaC tool thinks it failed. Manual cleanup or a terraform apply retry is needed, and on retry the apply usually succeeds because the timing happens to fall on the right side of the sleep window.

This is a known recurring issue — #314 attempted to fix it for postgresflex by bumping the hardcoded sleep, but the value still isn't sufficient in all cases (see attached error from a recent run).

Proposed solution

Two complementary changes:

1. Make the values passed to SetSleepBeforeWait (and SetThrottle, SetTimeout) overridable by callers, uniformly across all wait handlers — postgresflex, iaas, loadbalancer, etc. The handler API already supports this; what's missing is exposing it so consumers can override the SDK-internal defaults without forking:

waiter := postgresflex.NewAPIClient(...).
    CreateInstanceWaitHandler(ctx, projectId, instanceId)

// override the SDK-set defaults
waiter.SetSleepBeforeWait(60 * time.Second)
waiter.SetThrottle(15 * time.Second)
waiter.SetTimeout(45 * time.Minute)

_, err := waiter.WaitWithContext(ctx)

2. Treat 404 immediately after a successful create as a transient "not yet visible" state, not as a fatal error. The current behavior is fragile by design — any sleep value, no matter how generous, will occasionally lose the race. A short retry window (e.g. tolerate 404s for the first N seconds / M attempts after creation) would make the handlers robust regardless of how the sleep is tuned.

Additional information

A code search for SetSleepBeforeWait across the repo shows the same hardcoded-sleep pattern across most service modules, so a fix should probably be applied uniformly rather than service-by-service. Happy to open a PR once there's agreement on the API surface — the post-create 404 tolerance feels like the more impactful of the two changes, since it removes the race entirely instead of widening it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions