feat: add per-context replay status tracking by zhongkechen · Pull Request #393 · aws/aws-durable-execution-sdk-python

zhongkechen · 2026-05-13T21:44:37Z

Issue #, if available:

Fixes #389

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

ayushiahjolia · 2026-05-27T20:45:18Z

+        )
+
+    def peek_next_step_id(self):
+        next_step = self._operation_counter.get_current() + 1


Do we need to look at _virtual_operation_counter?

This method is used to determine the replay status. _virtual_operation_counter is not used for that.

If next operation is a virtual operation then what would self._operation_counter.get_current() + 1 return?

it returns the id of the next "real" operation

ayushiahjolia · 2026-05-27T20:46:32Z

+        This allows us to recover operation ids or even look
+        forward without changing the internal state of this context.
+        """
+        parts = [self._step_id_prefix, "v" if is_virtual else None, step]


Isn't this a breaking change?

How SDK generates the operation id is internal to the SDK. Users will only see a hash value.

I think we still want deterministic Ids AND unique Id for all virtual operations. With the current proposal, if there are multiple virtual child context between "real" steps, wouldn't they all have the same ID?

[UPDATE] sorry I misunderstood the counter. Currently, a virtual child context in the user context (pre hashing) would look like the following:

v-1 v-2 1 2 v-3 3-v-1

The only thing that bugs me here is that the virtual id doesn't preserve order within the context as a whole, but I guess it doesn't really matter in python since it's all hashed anyway.

SilanHe · 2026-05-28T17:05:11Z

+        ).increment()
+        return self.create_step_id_for_logical_step(new_counter, is_virtual=is_virtual)
+
+    def create_step_id_for_logical_step(self, step: int, is_virtual: bool) -> str:


nit: personally I think it would have been better to keep the virtual child context id tracking in a separate PR

SilanHe · 2026-05-28T17:26:47Z

-    executor_context._create_step_id_for_logical_step = lambda *args: "1"  # noqa SLF001
+    executor_context._operation_id_generator.create_step_id_for_logical_step = (
+        lambda *args: "1"
+    )  # noqa SLF001


nit: interesting pattern, since it looks like we have to override the linter, I do wonder if there are other (preferred) testing patterns.

SilanHe · 2026-05-28T17:30:10Z

+    mock_state.get_checkpoint_result.assert_called_with(
+        "1ced8f5be2db23a6513eba4d819c73806424748a7bc6fa0d792cc1c7d1775a97"
+    )



nit: are we missing new test asserting the behaviour of _track_replay() with a virtual child context as the next operation?

zhongkechen · 2026-05-29T01:08:15Z

Abandoned in favor of a backend enhancement to help SDK track the updated operation during the execution is suspended. That would be nicer than whatever SDK can use to guess that.

yaythomas · 2026-05-29T00:19:55Z

            ),
            context_logger=self.logger,
        )
+        self._track_replay()


_track_replay() runs after create_step_id() increments the counter, so peek_next_step_id resolves to N+1 instead of N. The check asks "is the op after the current one checkpointed?" rather than "is the current one?" JS does the inverse order on purpose (checkAndUpdateReplayMode runs before createStepId)

Same pattern for the other operations.

This will result in is_replaying being incorrect during the last replayed op's processing window.

No customer-observable effect today because the only internal reader of the flag is Logger._should_log.

yaythomas · 2026-05-29T00:19:55Z


    def _should_log(self) -> bool:
-        return not self._execution_state.is_replaying()
+        return not self._context.is_replaying


Logger holds a DurableContext reference, and with_log_info (line 83) preserves it across Logger.from_log_info(..., context=self._context).

When create_child_context builds a child logger via parent_logger.with_log_info(...), the child's _context is permanently the parent. _should_log reads not parent.is_replaying, never the child's own flag.

This means this PR doesn't actually fix what it intends to, which is to say it keeps the broken behaviour where FLAT parallel/map branches log identically to each other, gated on the parent context's flag. Previously this happened via the global state.is_replaying().

yaythomas · 2026-05-29T00:19:55Z

-                    self._replay_status = ReplayStatus.NEW
+    def transition_replay_status(self) -> None:
+        """Transition to NEW status"""
+        if self._replay_status is ReplayStatus.REPLAY:


Shouldn't this be under a lock? Two threads can both pass the if, both serially acquire the lock, both log "Transitioning…", both reassign.

since _track_replay is now called per-context-per-op this could potentially happen more frequently too.

zhongkechen force-pushed the replay branch from dd926ab to a15da20 Compare May 13, 2026 21:45

zhongkechen self-assigned this May 16, 2026

zhongkechen marked this pull request as ready for review May 16, 2026 04:52

yaythomas force-pushed the replay branch from a15da20 to 7035d3b Compare May 20, 2026 04:49

ayushiahjolia previously approved these changes May 25, 2026

View reviewed changes

SilanHe reviewed May 25, 2026

View reviewed changes

Comment thread src/aws_durable_execution_sdk_python/context.py

zhongkechen dismissed ayushiahjolia’s stale review via 5a77d0f May 26, 2026 17:40

zhongkechen force-pushed the replay branch from 6c24d73 to 5a77d0f Compare May 26, 2026 17:40

feat: add per-context replay status tracking

8d9205f

zhongkechen force-pushed the replay branch from 5a77d0f to 8d9205f Compare May 26, 2026 18:11

zhongkechen commented May 26, 2026

View reviewed changes

Comment thread src/aws_durable_execution_sdk_python/context.py

zhongkechen force-pushed the replay branch from 848e6ab to 93ac768 Compare May 27, 2026 00:09

fix: fix operation counter for virtual contexts

549db29

zhongkechen force-pushed the replay branch from 93ac768 to 549db29 Compare May 27, 2026 00:10

ayushiahjolia reviewed May 27, 2026

View reviewed changes

SilanHe reviewed May 28, 2026

View reviewed changes

ayushiahjolia mentioned this pull request May 28, 2026

Clarify mode management step ID lookup - current vs next operation aws/aws-durable-execution-sdk-js#586

Open

zhongkechen closed this May 29, 2026

yaythomas reviewed May 29, 2026

View reviewed changes

zhongkechen deleted the replay branch May 29, 2026 01:34

Conversation

zhongkechen commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SilanHe May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhongkechen commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SilanHe May 28, 2026 •

edited

Loading

zhongkechen commented May 29, 2026 •

edited

Loading