Problem description
The SDK's Wait* handlers call SetSleepBeforeWait internally with hardcoded values that callers can't override. When STACKIT's control plane is slower than that hardcoded sleep — fairly common for postgresflex instance creation and iaas server creation — the first poll fires before the resource is queryable, the API returns 404, and the wait handler treats this as a fatal error instead of "not yet visible."
The 404 then propagates straight up through terraform-provider-stackit / pulumi-stackit as a failed apply:
error: Error creating instance: Instance creation waiting: 404 Not Found, status code 404, Body:
{"message":"Requested instance with ID: d4a2c1eb-a696-4b3d-b571-a25dd8c11002 cannot be found","code":404,"type":"NotFound"}
This leaves inconsistent state — the instance is actually being created on the STACKIT side, but the IaC tool thinks it failed. Manual cleanup or a terraform apply retry is needed, and on retry the apply usually succeeds because the timing happens to fall on the right side of the sleep window.
This is a known recurring issue — #314 attempted to fix it for postgresflex by bumping the hardcoded sleep, but the value still isn't sufficient in all cases (see attached error from a recent run).
Proposed solution
Two complementary changes:
1. Make the values passed to SetSleepBeforeWait (and SetThrottle, SetTimeout) overridable by callers, uniformly across all wait handlers — postgresflex, iaas, loadbalancer, etc. The handler API already supports this; what's missing is exposing it so consumers can override the SDK-internal defaults without forking:
waiter := postgresflex.NewAPIClient(...).
CreateInstanceWaitHandler(ctx, projectId, instanceId)
// override the SDK-set defaults
waiter.SetSleepBeforeWait(60 * time.Second)
waiter.SetThrottle(15 * time.Second)
waiter.SetTimeout(45 * time.Minute)
_, err := waiter.WaitWithContext(ctx)
2. Treat 404 immediately after a successful create as a transient "not yet visible" state, not as a fatal error. The current behavior is fragile by design — any sleep value, no matter how generous, will occasionally lose the race. A short retry window (e.g. tolerate 404s for the first N seconds / M attempts after creation) would make the handlers robust regardless of how the sleep is tuned.
Additional information
A code search for SetSleepBeforeWait across the repo shows the same hardcoded-sleep pattern across most service modules, so a fix should probably be applied uniformly rather than service-by-service. Happy to open a PR once there's agreement on the API surface — the post-create 404 tolerance feels like the more impactful of the two changes, since it removes the race entirely instead of widening it.
Problem description
The SDK's
Wait*handlers callSetSleepBeforeWaitinternally with hardcoded values that callers can't override. When STACKIT's control plane is slower than that hardcoded sleep — fairly common forpostgresflexinstance creation andiaasserver creation — the first poll fires before the resource is queryable, the API returns 404, and the wait handler treats this as a fatal error instead of "not yet visible."The 404 then propagates straight up through
terraform-provider-stackit/pulumi-stackitas a failed apply:This leaves inconsistent state — the instance is actually being created on the STACKIT side, but the IaC tool thinks it failed. Manual cleanup or a
terraform applyretry is needed, and on retry the apply usually succeeds because the timing happens to fall on the right side of the sleep window.This is a known recurring issue — #314 attempted to fix it for
postgresflexby bumping the hardcoded sleep, but the value still isn't sufficient in all cases (see attached error from a recent run).Proposed solution
Two complementary changes:
1. Make the values passed to
SetSleepBeforeWait(andSetThrottle,SetTimeout) overridable by callers, uniformly across all wait handlers —postgresflex,iaas,loadbalancer, etc. The handler API already supports this; what's missing is exposing it so consumers can override the SDK-internal defaults without forking:2. Treat 404 immediately after a successful create as a transient "not yet visible" state, not as a fatal error. The current behavior is fragile by design — any sleep value, no matter how generous, will occasionally lose the race. A short retry window (e.g. tolerate 404s for the first N seconds / M attempts after creation) would make the handlers robust regardless of how the sleep is tuned.
Additional information
A code search for
SetSleepBeforeWaitacross the repo shows the same hardcoded-sleep pattern across most service modules, so a fix should probably be applied uniformly rather than service-by-service. Happy to open a PR once there's agreement on the API surface — the post-create 404 tolerance feels like the more impactful of the two changes, since it removes the race entirely instead of widening it.