Skip to content

feat(workflow-engine): add retryOnTimeout opt-in for step timeout retries#5089

Merged
abcxff merged 1 commit into
mainfrom
05-21-feat_workflow-engine_add_retryontimeout_opt-in_for_step_timeout_retries
Jun 3, 2026
Merged

feat(workflow-engine): add retryOnTimeout opt-in for step timeout retries#5089
abcxff merged 1 commit into
mainfrom
05-21-feat_workflow-engine_add_retryontimeout_opt-in_for_step_timeout_retries

Conversation

@abcxff
Copy link
Copy Markdown
Contributor

@abcxff abcxff commented May 21, 2026

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5089 May 21, 2026 19:45 Destroyed
@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 21, 2026

🚅 Deployed to the rivet-pr-5089 environment in rivet-frontend

Service Status Web Updated (UTC)
frontend-inspector ❌ Build Failed (View Logs) Web Jun 3, 2026 at 1:08 am
kitchen-sink ❌ Build Failed (View Logs) Web Jun 3, 2026 at 1:07 am
frontend-cloud 🕒 Building (View Logs) Web Jun 3, 2026 at 1:07 am
website 😴 Sleeping (View Logs) Web Jun 2, 2026 at 12:25 am
ladle ✅ Success (View Logs) Web May 21, 2026 at 7:49 pm
mcp-hub ✅ Success (View Logs) Web May 21, 2026 at 7:47 pm

Copy link
Copy Markdown
Contributor Author

abcxff commented May 21, 2026

@abcxff abcxff mentioned this pull request May 21, 2026
11 tasks
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 21, 2026

Review: feat(workflow-engine): add retryOnTimeout opt-in for step timeout retries

The feature is well-scoped and the core logic change is clean. Three issues worth addressing before merging.


Bug: getTryStepFailureFromExhaustedError hardcodes kind: "exhausted" on the replay path

When a workflow exhausts (via timeout + retryOnTimeout: true), then recover()'d and replayed, the resume path in context.ts:825-838 reaches getTryStepFailureFromExhaustedError (line 240), which always emits kind: "exhausted":

// context.ts:240
function getTryStepFailureFromExhaustedError(...): TryStepFailure {
    return {
        kind: "exhausted",  // always, even if exhaustion was due to a timeout
        ...
    };
}

After recover(), the persisted error is just a string (metadata.error), so the runtime cannot distinguish a timed-out step from a normally-exhausted one. The docs claim:

when retries exhaust on a timeout the try-step failure kind is "timeout"

That is only true on the first exhaust run. Subsequent replays after recover() report "exhausted" instead. Either the docs should carve out the post-recovery case, or getTryStepFailureFromExhaustedError needs a stored error-kind hint. Worth deciding before shipping.


Missing test: tryStep failure kind when retryOnTimeout exhausts

The added test only calls ctx.step(...) and checks StepExhaustedError is thrown. The docs explicitly say the tryStep failure kind will be "timeout" after exhaustion, but there is no test covering it:

// not covered
const result = await ctx.tryStep({
    name: "timeout-step",
    retryOnTimeout: true,
    maxRetries: 2,
    timeout: 5,
    retryBackoffBase: 1,
    retryBackoffMax: 1,
    run: async () => { ... },
});
expect(result.ok).toBe(false);
expect(result.failure.kind).toBe("timeout");  // never asserted

DEFAULT_TRY_STEP_CATCH already includes "timeout", so this is reachable and documented behavior.


Test uses real wall-clock timers with tight values

The new test uses await new Promise(resolve => setTimeout(resolve, 10)) between yield-mode runs, with a 5 ms step timeout and 1 ms backoff. Per CLAUDE.md:

Never paper over flakes with retry loops or bumped waits. Write a deterministic repro using vi.useFakeTimers() or event-ordered Promise resolution.

The existing retry tests at line 179 use the same pattern (consistent with prior art in the file), but the tighter values here (5 ms timeout, 25 ms operation, 10 ms inter-run sleep) are more likely to spuriously fail on slow CI runners. Consider vi.useFakeTimers() or widening the margins.


Nits

The logic at lines 946 and 1013-1015 is correct. The jsdoc and retries.md updates are accurate and follow comment-style conventions. TryStepConfig inheriting retryOnTimeout from StepConfig without extra wiring is clean.

@abcxff abcxff force-pushed the 05-21-feat_workflow-engine_add_retryontimeout_opt-in_for_step_timeout_retries branch from f7b5398 to a7e1162 Compare May 22, 2026 04:17
@abcxff abcxff force-pushed the 05-18-fix_rivetkit_wire_cancellationtoken_through_waitfornamesavailable_to_eliminate_kv_busy-polling branch from 9dc7e20 to 9a50acb Compare May 22, 2026 04:17
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5089 May 22, 2026 04:17 Destroyed
@abcxff abcxff mentioned this pull request May 22, 2026
11 tasks
@abcxff abcxff force-pushed the 05-21-feat_workflow-engine_add_retryontimeout_opt-in_for_step_timeout_retries branch from a7e1162 to fccc091 Compare May 22, 2026 05:08
@abcxff abcxff force-pushed the 05-18-fix_rivetkit_wire_cancellationtoken_through_waitfornamesavailable_to_eliminate_kv_busy-polling branch from 9a50acb to 8f91d91 Compare May 22, 2026 05:08
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5089 May 22, 2026 05:08 Destroyed
@abcxff abcxff marked this pull request as ready for review June 1, 2026 20:59
@abcxff abcxff force-pushed the 05-18-fix_rivetkit_wire_cancellationtoken_through_waitfornamesavailable_to_eliminate_kv_busy-polling branch from 8f91d91 to f83acdc Compare June 2, 2026 20:09
@abcxff abcxff force-pushed the 05-21-feat_workflow-engine_add_retryontimeout_opt-in_for_step_timeout_retries branch from fccc091 to ea20605 Compare June 2, 2026 20:09
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5089 June 2, 2026 20:09 Destroyed
Copy link
Copy Markdown
Contributor Author

abcxff commented Jun 3, 2026

Merge activity

  • Jun 3, 1:04 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Jun 3, 1:08 AM UTC: Graphite rebased this pull request as part of a merge.
  • Jun 3, 1:08 AM UTC: @abcxff merged this pull request with Graphite.

@abcxff abcxff changed the base branch from 05-18-fix_rivetkit_wire_cancellationtoken_through_waitfornamesavailable_to_eliminate_kv_busy-polling to graphite-base/5089 June 3, 2026 01:05
@abcxff abcxff changed the base branch from graphite-base/5089 to main June 3, 2026 01:06
@abcxff abcxff force-pushed the 05-21-feat_workflow-engine_add_retryontimeout_opt-in_for_step_timeout_retries branch from ea20605 to e91e64f Compare June 3, 2026 01:07
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5089 June 3, 2026 01:07 Destroyed
@abcxff abcxff merged commit c2073af into main Jun 3, 2026
11 of 18 checks passed
@abcxff abcxff deleted the 05-21-feat_workflow-engine_add_retryontimeout_opt-in_for_step_timeout_retries branch June 3, 2026 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant