fix: provider state race by toddbaert · Pull Request #380 · open-feature/spec

toddbaert · 2026-04-23T11:51:01Z

moves "provider status" back into the provider interface
clarifies some ordering and atomicity
adds "migration" appendix describing the proposed migration strategy

For background, see: #365

PoCs for the migration strategy (not the actual spec changes) by @toddbaert here:

Go: open-feature/go-sdk#487
Java: open-feature/java-sdk#1892
JS: open-feature/js-sdk#1362

Fixes: #365

If/when this is merged, I will create issues for every implementation (thought I think Kotlin/Swift might already be corrected).

⚠️ IMO this is superseded by #385

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

gemini-code-assist

Code Review

This pull request updates the OpenFeature specification to shift provider status ownership from the SDK to the provider itself, addressing race conditions in multi-threaded environments by ensuring status updates and event emissions are atomic. Key changes include the addition of Section 2.8 (Provider status), updates to lifecycle and event requirements to reflect delegation to the provider, and a new migration guide in Appendix E. Review feedback correctly identifies that Requirement 1.7.2.1 is inconsistent with other sections, as it omits the "FATAL" status and uses non-normative language instead of the required "MUST" keyword.

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

dd-oleksii · 2026-04-28T16:30:43Z

+> Status changes and any associated event emissions **MUST** be atomic from the perspective of external observers.
+
+When a provider transitions between statuses and emits an event associated with that transition, external observers (such as SDK event handlers) must observe a consistent view: the updated `status` value and the emitted event are visible together.
+This prevents ordering anomalies where, for example, a `PROVIDER_READY` handler runs while `status` still indicates `NOT_READY` or `ERROR`, or where the provider transitions out of a status before the associated event is dispatched.


"Dispatch" may be a big ambiguous here:

or where the provider transitions out of a status before the associated event is dispatched

Is the intent here that event handlers see the exact status that triggered the event? If so, this may be problematic as it requires holding the status lock while all handlers run, preventing the provider from changing its own status.

I think what we're after here is establishing an observable "happens-before" relationship:

Status change must happen before the corresponding event handlers run.

Event handlers must only be run after all event handlers for the previous event has finished.

So an event handler may observe next status changes, the important bit is that it must observe the status change that triggered the event.

I will remove this line, I think.

I have some bigger concerns though.

this may be problematic as it requires holding the status lock while all handlers run

I think we would also need some sort of lock to :

Event handlers must only be run after all event handlers for the previous event has finished.

As far as I can tell, none of the SDKs currently do this. I'm a bit hesitant to add this though, since it means a misbehaving or blocked handler could block subsequent event emissions.

Possible relates to: #380 (comment)

I think we would also need some sort of lock to :

Event handlers must only be run after all event handlers for the previous event has finished.

As far as I can tell, none of the SDKs currently do this. I'm a bit hesitant to add this though, since it means a misbehaving or blocked handler could block subsequent event emissions.

Given that all event handlers are synchronous and most SDKs use either a queue or direct event firing, I think this is actually the default behavior we have today?

Perhaps relaxing this a bit we can have: for each event handler, it should be fired in the order of events happening and each invocation should wait for the previous invocation to finish (this corresponds to a per-subscriber queue). If we relax it more, that would allow the publisher to either publish events out-of-order or concurrently, and subscriber would have no way to establish the real order — not very great.

fwiw, I think we can drop this from the spec — SDKs should be able to figure out what's appropriate for each language

Moot in the alternative approach: #380 (comment)

dd-oleksii · 2026-04-28T16:57:35Z

+
+- Call `initialize()`, `shutdown()`, and `on context change` lifecycle methods on the provider
+- Forward provider-emitted events to registered domain and API-level event handlers
+- Run late-attached handlers immediately if the provider is already in the associated state


major: I believe SDKs can't do this safely as this requires that subscription and emission of the current status is atomic (there should be no new events emitted between reading current status and running the handler) — and you can't guarantee that when another thread is emitting events (unless SDK implements event buffering which is cumbersome).

I agree this is a real concern. With provider-owned state, I don't see how the SDK can safely guarantee atomic late-attached handler execution or short-circuit evaluation without adding significant complexity (e.g. shared locks between the SDK and provider, or subscription-awareness in the provider).

That said, this is existing behavior that's important for usability; users need to be able to attach PROVIDER_READY handlers after initialization and still have them fire. Removing it would be a significant regression in developer experience.

Your suggestion about an event queue is interesting; if the SDK serializes event dispatch through a queue, late-attached handler checks could go through the same queue, which might also address concerns about event handler interleaving. Maybe we need this even though it will be heavy?

I've taken a different approach: #380 (comment)

Swift SDK is actually doing this events/subscription serialization with a lock in ProviderStatusTracker. Actual event handlers are run on a separate (per-subscriber) dispatch queue to prevent event handlers from locking the provider.

Thinking about this more, I believe we're confusing "event observing" and "state observing" and this adds a lot of accidental complexity.

Many languages have separate tools for a regular channel and a channel that remembers the last value:

Swift: PassthroughSubject vs CurrentValueSubject

Kotlin: SharedFlow vs StateFlow

Rust: broadcast vs watch channels

The issue is that we want a weird mix of the two: we want it to remember the last state... except "not ready" status because it doesn't have a corresponding event to re-emit)... and don't remember "configuration changed" event (because it doesn't change status)... and "context changed" should be remapped to "ready" (but only if it's a replay and not an online observation).

This leads to a couple of issues. The obvious one is that the behavior is convoluted and is complicated to implement. In MultiProvider, we get a mismatch between emitted events and actual provider state (the spec requires MultiProvider to re-emit all events from the underlying providers, but underlying provider going into fatal state does not mean that the whole MultiProvider does).

And the last one is that waiting for PROVIDER_READY is already an unreliable way to wait for "ready" status. If provider is currently in reconciling status, most SDKs don't emit PROVIDER_READY after reconciliation is done. (This can easily happen if user does api.setProvider(...); api.setContext(...); api.addEventHandler(PROVIDER_READY, ...);)

I think a much cleaner API is to add "status observers" that remember the last status and are guaranteed to fire when the provider switches to the corresponding status (i.e. a successful reconciliation should invoke ready handler). And keep event observers without this last-value memory

We can even keep this backward-compatible for SDKs that have subscription topics (JS, Go) by saying that PROVIDER_READY/PROVIDER_STALE/PROVIDER_ERROR/PROVIDER_RECONCILING are status observers. This would make PROVIDER_CONFIGURATION_CHANGED and PROVIDER_CONTEXT_CHANGED the only true events (which is kinda already true today).

An alternative take is that this issue only exists for languages that have these native tools for broadcast and last-state observers (swift, kotlin), so SDKs decided to expose native observers instead of named event handlers as the spec describes. So now these SDKs are fighting with synchronizing these subjects/flows. As they are already deviating from the spec, perhaps they can deviate further to implement status subscription in addition to events (or alternatively align with the spec and implement named event handlers instead of subject/flow — though this is less idiomatic and will be more cumbersome for users)

cc @nicklasl @fabriziodemaria @typotter for your thoughts as this is a pain point in kotlin and swift

I haven't had time to fully consider this, but I think it's close to what I've proposed in my other pr: #385

It doesn't mention the "BehaviorSubject" construct exactly, but describes something basically similar. I agree that we could use those abstractions though.

dd-oleksii · 2026-04-28T16:58:01Z

+- Call `initialize()`, `shutdown()`, and `on context change` lifecycle methods on the provider
+- Forward provider-emitted events to registered domain and API-level event handlers
+- Run late-attached handlers immediately if the provider is already in the associated state
+- Enforce short-circuit behavior for `NOT_READY` and `FATAL` statuses during flag evaluation


major: multi-threaded SDKs can't enforce this because of TOCTOU

See #380 (comment)

dd-oleksii · 2026-04-28T17:06:40Z

+At registration time, check whether the provider implements the `StateManagingProvider` interface (or equivalent).
+Store this as a flag on the internal provider wrapper for use during lifecycle calls and event handling.
+
+#### SDK wrapper behavior


minor/aside: I'd argue that the better design is implementing an adapter that takes a non-StateManagingProvider and implements the StateManagingProvider interface. This way, all special-casing/mapping is concentrated in one place, and it is about the only thing that needs to be deleted in the next major release.

IIUC this is essentially what the PoCs already do; FeatureProviderStateManager in Java and providerReference in Go are effectively that adapter/wrapper. They wrap legacy providers and handle state management on their behalf, so downstream SDK code doesn't need to special-case. At the next major version, the wrapper is the only thing that gets deleted.

Or am I missing something?

I was referring to using adapter pattern to adapt legacy interface to the new one. Strictly speaking, none of the PoCs do that.

Kotlin PR is an example.:LegacyFeatureProviderAdapter wraps a non-state-managing provider and implement state-managing interface (manage status, emits events, etc.). It nicely contains/abstracts the knowledge of non-managing providers, and the rest of openfeature sdk handles all providers equally as if they can manage their state.

This also reduces the cyclomatic complexity overall: the adapter knows the wrapped provider doesn't manage status, so it doesn't need any ifs. The SDK can treat all providers as state-managing, so it doesn't need any ifs either. The only if is in setProvider to determine whether wrapping is needed.

In other PoCs, the provider wrapper now has a double-duty of whatever-it-was-doing-before + adapting legacy providers, so it's riddled with ifs. And it still doesn't contain the adapting completely and this part doesn't hold:

At the next major version, the wrapper is the only thing that gets deleted.

In JS, OpenFeatureCommonAPI reaches inside with wrappedProvider.delegateManagesState to determine whether to emit events or not. In Java, ProviderRepository reaches in with newManager.delegateManagesState. Go is doing the checks in both event executor and openfeature API. So we'd need to modify these components as well.

It's not a huge deal though, especially for a temporary migration. I find the proper adapter slightly cleaner but it is more work than throwing a bunch of if's in

Ah, OK, you were just talking about encapsulating this logic more elegantly then? If that's the case I agree. The PoCs were just there to prove the flow was possible, not necessarily in the cleanest way.

Co-authored-by: Oleksii Shmalko <oleksii.shmalko@datadoghq.com> Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

toddbaert · 2026-05-18T16:08:20Z

I kept running into TOCTOU issues with the provider state, and no matter what abstractions I used to solve it, synthetic events were a problem. Rather than adding state back to the provider, this proposal keeps the state in the SDK but removes all the synthetic event emission, which I believe was the real source of the issue. The SDK state is fully derived from the provider's event emission. The only remaining "synthetic" behavior is running a handler immediately when it's registered and the provider is already in the matching state (e.g. attaching a READY handler when the provider is already READY). This is less problematic since it doesn't involve an actual state transition; it's just replaying the current status to a late subscriber.

Marking this one as draft for now. cc @nicklasl @dd-oleksii

toddbaert added 2 commits April 23, 2026 07:43

fix: provider state race

fa0b999

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

fixup: break up migration

fe60ffc

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

toddbaert requested a review from a team as a code owner April 23, 2026 11:51

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread specification.json Outdated

Comment thread specification/sections/01-flag-evaluation.md Outdated

fixup: keywords and parser

8f1228f

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

toddbaert requested review from aepfli, beeme1mr, cupofcat, dd-oleksii, erka, federicobond, jonathannorris, kinyoklion, lukas-reining, moredip and nicklasl April 23, 2026 12:27

dd-oleksii reviewed Apr 28, 2026

View reviewed changes

nicklasl reviewed May 12, 2026

View reviewed changes

Comment thread specification/sections/01-flag-evaluation.md Outdated

toddbaert and others added 3 commits May 12, 2026 15:34

Update specification/sections/01-flag-evaluation.md

630ef55

Co-authored-by: Oleksii Shmalko <oleksii.shmalko@datadoghq.com> Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

fixup: wording

6d70ac9

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

fixup: remove redundant line

885c67f

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

toddbaert requested a review from nicklasl May 17, 2026 19:19

toddbaert mentioned this pull request May 18, 2026

fix: provider state ownership via events #385

Open

toddbaert marked this pull request as draft May 18, 2026 16:11

Conversation

toddbaert commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddbaert May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddbaert May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

toddbaert commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

toddbaert commented Apr 23, 2026 •

edited

Loading

toddbaert May 17, 2026 •

edited

Loading

toddbaert May 21, 2026 •

edited

Loading

toddbaert commented May 18, 2026 •

edited

Loading