Skip to content

feat(agentserver): Add durable long-running agents to azure-ai-agentserver-core#46839

Draft
RaviPidaparthi wants to merge 14 commits into
Azure:mainfrom
RaviPidaparthi:feature/agentserver-durable-tasks
Draft

feat(agentserver): Add durable long-running agents to azure-ai-agentserver-core#46839
RaviPidaparthi wants to merge 14 commits into
Azure:mainfrom
RaviPidaparthi:feature/agentserver-durable-tasks

Conversation

@RaviPidaparthi
Copy link
Copy Markdown
Member

Durable Task Framework for azure-ai-agentserver-core

Adds a crash-resilient durable task system to azure-ai-agentserver-core, enabling hosted agent scenarios that need persistence, retry, and lifecycle management.

Key Features

  • @durable_task decorator — Turns async functions into crash-resilient tasks with full lifecycle (start, run, get, cancel, terminate)
  • TaskResult[Output] — Generic result wrapper with .output, .status, .is_suspended, .suspension_reason
  • Cooperative cancellationctx.cancel event + configurable grace period before hard cancellation
  • Configurable timeouts — Per-task execution timeouts with cooperative → hard cancellation flow
  • Retry policies — Fixed, linear, and exponential backoff with max-attempt limits
  • Callable factoriestags, title, description accept Callable[[Any, str], ...] for dynamic per-task values
  • Local in-memory provider — Development/testing provider implementing the TaskStoreProvider protocol
  • Task streamingAsyncIterator-based streaming with durable checkpointing
  • Lease-based locking — Distributed lock support for concurrent task execution
  • Ephemeral & persistent modes — Auto-cleanup or retain task records after completion
  • Metadata & provenance — Task source tracking and metadata management

Testing

  • 248 tests across 17 test modules (all passing)
  • Covers: lifecycle, retry, cancellation/timeout, streaming, entry modes, callable factories, TaskResult, metadata, models, decorator, source, resume routing, local provider

Samples & Docs

  • 3 sample applications: durable_retry, durable_source, durable_streaming
  • 2 integration samples: durable_langgraph, durable_multiturn
  • Developer guide: docs/durable-task-developer-guide.md
  • Design specs: 6 spec documents covering all architectural decisions

Files Changed

  • azure-ai-agentserver-core/azure/ai/agentserver/core/durable/ — 15 new modules
  • azure-ai-agentserver-core/tests/durable/ — 17 test files
  • azure-ai-agentserver-core/samples/ — 3 sample directories
  • azure-ai-agentserver-core/docs/ — Developer guide
  • azure-ai-agentserver-invocations/samples/ — 2 integration samples

…-core

Implements a crash-resilient durable task system with:

- @durable_task decorator with full lifecycle management (start, run, get, cancel, terminate)
- TaskResult[Output] wrapper replacing exception-based suspension handling
- Cooperative cancellation and configurable timeouts
- Configurable retry policies with backoff
- Callable factories for tags, title, and description
- Local in-memory provider for development/testing
- Task streaming support via AsyncIterator
- Lease-based distributed locking
- Ephemeral and persistent task modes
- Task metadata and source provenance tracking

Includes:
- 248 passing tests across 17 test modules
- 3 sample applications (retry, source, streaming)
- Developer guide documentation
- Spec files (001-006) covering all design decisions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the Hosted Agents sdk/agentserver/* label May 12, 2026
@RaviPidaparthi RaviPidaparthi changed the title feat(agentserver): Add durable task framework to azure-ai-agentserver-core feat(agentserver): Add durable long-running agents to azure-ai-agentserver-core May 12, 2026
RaviPidaparthi and others added 13 commits May 12, 2026 19:22
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- TaskMetadata: add MutableMapping dict protocol (__setitem__,
  __getitem__, __delitem__, __contains__, __iter__, __len__, keys,
  values, items) with dirty-tracking on mutations
- Fix cspell CI failures: rename 'sess' abbreviations in _models.py,
  test_local_provider.py, test_models.py, test_source.py
- CHANGELOG 2.0.0b4: document all durable long-running agent features
- README: add durable agents section with code examples and dev guide link
- Developer guide: update metadata examples to dict-style syntax
- Invocations: bump core dep to >=2.0.0b4, add durable samples changelog
- Specs 001-007 and backlog: all 16 items resolved

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Explain the problem (containers can die), the 4-step durability mechanism
(persist → lease → recover → complete), and the net effect before listing
what the developer doesn't need to think about.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify that durable tasks are not a checkpoint/replay engine, not a
result store, not a stream log, not app-level persistence, and not
unbounded storage. Fix misleading 'checkpoint progress' language to
'lightweight progress signals'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify that the framework recovers crashed tasks on container restart
automatically, not in response to a caller calling .run() again.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix name default: __qualname__, not 'Function name'
- Add missing ctx.agent_name and ctx.lease_generation to properties table
- Fix recovery description: automatic at startup + on .run()/.start()
- Fix cancel semantics: function returning normally = success, not TaskCancelled
- Update cancel vs terminate table with accurate outcomes
- Fix resume docs: both .run() and .start() handle suspended tasks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Sphinx: remove durable re-exports from core/__init__.py to fix
  duplicate object description warnings (symbols documented at both
  core and core.durable levels)
- MyPy: fix 3 type errors (_run.py Future type, _manager.py narrowing)
- Pylint: fix 55 issues across 7 files (docstrings, unused imports,
  import ordering, complexity suppressions)
- Constitution v1.3.0: add pre-push validation gate (NON-NEGOTIABLE)

All checks pass locally: pylint 10.00/10, mypy clean, sphinx clean,
261 tests passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ng, samples

Steering:
- Full steering implementation with generation model, pending queue, drain logic
- ctx.was_steered, ctx.previous_input, ctx.pending_inputs, ctx.generation
- SteeringQueueFull exception, TaskResult.is_superseded
- Completion-vs-steering race handling with etag
- Crash recovery with drain_in_progress flag

Task listing:
- DurableTask.list(status, session_id) with auto-scoping per function
- Server-side: agent_name, session_id, tag, status filters
- Client-side: source.type filter (until DEV-009 resolved)
- Provider protocol + local provider tag AND filtering

Reserved tag protection:
- _strip_reserved_tags() at all entry points (decorator, callsite, options)
- Framework auto-stamps _durable_task_name tag, always wins

Recovery routing:
- _find_resume_callback() matches source.name first (stable anchor)
- name param documented as stable identity anchor

Other:
- Local provider payload merge fixed to strict shallow (spec §11)
- steering_poll_seconds removed from public API (internal 2s default kept)
- Multi-worker references removed (single-container model)
- Developer guide cleaned of internal implementation details
- Steering spec updated to match implementation
- Samples: durable_claude, durable_copilot, updated durable_langgraph

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ming

Replace hardcoded asyncio.Queue with a pluggable StreamHandler protocol
(put/get/close) for the durable task streaming path.

Changes:
- New _stream.py: StreamHandler protocol + QueueStreamHandler default
- Refactored _context.py, _run.py, _manager.py: _stream_queue -> _stream_handler
- Added stream_handler param to start()/run() in _decorator.py
- Updated __init__.py exports
- Updated test_streaming.py and test_sample_e2e.py
- Updated developer guide with Custom Stream Handlers section
- SSE streaming samples and invocations framework updates

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Hosted Agents sdk/agentserver/*

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant