Skip to content

.NET: Add checkpoint on super step started event: issue #4280#4604

Open
elgold92 wants to merge 8 commits intomicrosoft:mainfrom
elgold92:ericgold/CheckpointOnSuperStepStarted
Open

.NET: Add checkpoint on super step started event: issue #4280#4604
elgold92 wants to merge 8 commits intomicrosoft:mainfrom
elgold92:ericgold/CheckpointOnSuperStepStarted

Conversation

@elgold92
Copy link

Motivation and Context

Address issue #4280, allowing workflows to resume from checkpoints saved from SuperStepStarted events.

Description

Adds CheckpointInfo? field to the SuperStepStartInfo class, populating this information in the InProcessRunner and InProcStepTracer. Also updates associated unit tests to expect more checkpoints to be created on checkpointed workflows.

Eric Gold added 7 commits March 9, 2026 16:53
Copilot AI review requested due to automatic review settings March 10, 2026 20:11
@markwallace-microsoft markwallace-microsoft added .NET workflows Related to Workflows in agent-framework labels Mar 10, 2026
@github-actions github-actions bot changed the title Add checkpoint on super step started event: issue #4280 .NET: Add checkpoint on super step started event: issue #4280 Mar 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for creating/checkpointing workflow state at the SuperStepStarted boundary so runs can resume from “pre-delivery” checkpoints (addressing #4280), and updates tests accordingly.

Changes:

  • Add a CheckpointInfo? field to SuperStepStartInfo and populate it on SuperStepStartedEvent.
  • Create a checkpoint at the start of each superstep (capturing pre-delivery queued messages) by extending runner state export to accept an override StepContext.
  • Update and expand unit tests to account for the additional checkpoints and validate parent chaining/resume behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/InProcessStateTests.cs Updates expected checkpoint count due to start+end checkpointing per superstep.
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/CheckpointParentTests.cs Extends tests to include checkpoints emitted on SuperStepStartedEvent and adds new resume/count assertions.
dotnet/src/Microsoft.Agents.AI.Workflows/SuperStepStartInfo.cs Adds Checkpoint property to expose the checkpoint emitted at superstep start.
dotnet/src/Microsoft.Agents.AI.Workflows/InProc/InProcessRunnerContext.cs Allows exporting runner state from an alternate StepContext (pre-delivery snapshot).
dotnet/src/Microsoft.Agents.AI.Workflows/InProc/InProcessRunner.cs Saves a checkpoint before superstep execution and wires it into the started event.
dotnet/src/Microsoft.Agents.AI.Workflows/InProc/InProcStepTracer.cs Plumbs the start-checkpoint into SuperStepStartedEvent payload.

public bool HasExternalMessages { get; init; }

/// <summary>
/// Gets the <see cref="CheckpointInfo"/> corresponding to the checkpoint restored at the start of this SuperStep, if any.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XML doc for SuperStepStartInfo.Checkpoint says this is the checkpoint restored at the start of the SuperStep, but this property is populated from the checkpoint created immediately before raising SuperStepStartedEvent (pre-delivery). Please update the summary to match the actual semantics (and optionally clarify whether its StepNumber corresponds to the previous completed step or the upcoming step).

Suggested change
/// Gets the <see cref="CheckpointInfo"/> corresponding to the checkpoint restored at the start of this SuperStep, if any.
/// Gets the <see cref="CheckpointInfo"/> for the checkpoint created immediately before raising
/// the <c>SuperStepStartedEvent</c> for this SuperStep (i.e., the checkpoint taken after the
/// previous SuperStep completed). The <see cref="CheckpointInfo.StepNumber"/> therefore
/// corresponds to the previously completed SuperStep, not the upcoming one.

Copilot uses AI. Check for mistakes.
Comment on lines 35 to +41
StreamingRun run =
await env.WithCheckpointing(checkpointManager).RunStreamingAsync(workflow, "Hello");

List<CheckpointInfo> checkpoints = [];
await foreach (WorkflowEvent evt in run.WatchStreamAsync())
{
if (evt is SuperStepCompletedEvent stepEvt && stepEvt.CompletionInfo?.Checkpoint is { } cp)
if (evt is SuperStepStartedEvent superStepStartEvt && superStepStartEvt.StartInfo?.Checkpoint is { } startCp)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamingRun is IAsyncDisposable, but this test creates run without await using/disposing it. Disposing the run is important to release workflow ownership and underlying resources; please wrap it in await using or ensure DisposeAsync() is called in a finally block.

Copilot uses AI. Check for mistakes.
Comment on lines +223 to +231
List<CheckpointInfo> firstRunCheckpoints = [];
using CancellationTokenSource cts = new();
await foreach (WorkflowEvent evt in run.WatchStreamAsync(cts.Token))
{
if (evt is SuperStepStartedEvent superStepStartEvt && superStepStartEvt.StartInfo?.Checkpoint is { } startCp)
{
firstRunCheckpoints.Add(startCp);
}
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Checkpoint_AfterResumeFromSuperstepStart_CountCheckpointsEmittedAsync, CancellationTokenSource cts is created and passed to WatchStreamAsync, but it is never canceled. Either remove the CTS and use WatchStreamAsync() directly, or cancel once the expected number of checkpoints has been collected to keep the test bounded and avoid potential hangs if the stream doesn’t complete.

Copilot uses AI. Check for mistakes.
Comment on lines +236 to +238
// Dispose the first run to release workflow ownership before resuming.
await run.DisposeAsync();

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run is declared with await using, but later DisposeAsync() is called explicitly as well. Since await using already disposes the run, drop the explicit DisposeAsync() (or remove await using and keep the manual dispose) to avoid redundant disposal and keep the test clearer.

Suggested change
// Dispose the first run to release workflow ownership before resuming.
await run.DisposeAsync();

Copilot uses AI. Check for mistakes.
Comment on lines +282 to +290
List<CheckpointInfo> firstRunCheckpoints = [];
using CancellationTokenSource cts = new();
await foreach (WorkflowEvent evt in run.WatchStreamAsync(cts.Token))
{
if (evt is SuperStepCompletedEvent completedEvent && completedEvent.CompletionInfo?.Checkpoint is { } startCp)
{
firstRunCheckpoints.Add(startCp);
}
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Checkpoint_AfterResumeFromSuperstepCompleted_CountCheckpointsEmittedAsync, CancellationTokenSource cts is created and passed to WatchStreamAsync, but it is never canceled. Either remove the CTS or cancel once you’ve collected the expected checkpoints so the test stays bounded if the stream doesn’t complete.

Copilot uses AI. Check for mistakes.
…tParentTests.cs


rename local variable

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 10, 2026 21:07
@elgold92
Copy link
Author

@copilot open a new pull request to apply changes based on the comments in this thread. Changes look to be generally minor improvements.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

// Dispose the first run to release workflow ownership before resuming.
await run.DisposeAsync();

// Act: Resume from the first checkpoint
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says "Resume from the first checkpoint", but resumePoint is taken from firstRunCheckpoints[1] (the second checkpoint). Please update the comment to match the code to avoid confusion when reading/fixing these tests.

Suggested change
// Act: Resume from the first checkpoint
// Act: Resume from the second checkpoint

Copilot uses AI. Check for mistakes.
Comment on lines +223 to +226
List<CheckpointInfo> firstRunCheckpoints = [];
using CancellationTokenSource cts = new();
await foreach (WorkflowEvent evt in run.WatchStreamAsync(cts.Token))
{
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test creates a CancellationTokenSource and passes cts.Token to WatchStreamAsync, but never cancels it (the stream is consumed to completion). Consider removing the CTS and using WatchStreamAsync() directly, or canceling once the needed checkpoints have been collected to keep the test intent clear.

Copilot uses AI. Check for mistakes.
// Dispose the first run to release workflow ownership before resuming.
await run.DisposeAsync();

// Act: Resume from the first checkpoint
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says "Resume from the first checkpoint", but resumePoint is taken from firstRunCheckpoints[1] (the second checkpoint). Please update the comment to match the code to avoid confusion when reading/fixing these tests.

Suggested change
// Act: Resume from the first checkpoint
// Act: Resume from the second checkpoint

Copilot uses AI. Check for mistakes.
Comment on lines +282 to +285
List<CheckpointInfo> firstRunCheckpoints = [];
using CancellationTokenSource cts = new();
await foreach (WorkflowEvent evt in run.WatchStreamAsync(cts.Token))
{
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test creates a CancellationTokenSource and passes cts.Token to WatchStreamAsync, but never cancels it (the stream is consumed to completion). Consider removing the CTS and using WatchStreamAsync() directly, or canceling once the needed checkpoints have been collected to keep the test intent clear.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

.NET workflows Related to Workflows in agent-framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants