[DRAFT] BREAKING FEAT: Scenario Core Refactor Proposal by ValbuenaVC · Pull Request #1767 · microsoft/PyRIT

ValbuenaVC · 2026-05-20T21:23:46Z

Description

Note: this is a huge PR and it's a proposal. It shouldn't be merged in as one PR due to its size and scope, but it's here as a point of reference for ongoing changes in other PRs.

Re-architects pyrit/scenario/core/ around an explicit state-machine layer so scenarios can express non-linear control flow (branching, escalation, retries-by-state) instead of the previous flat for atomic in atomic_attacks loop. Landed additively across 10 phases so each commit stayed independently green; no destructive renames and no schema-breaking migrations.

The motivation comes from a review comment on #1622 (rapid response scenario) and discussion around #1654 (cyber technique registry) and #1760 (text adaptive scenario). The flat-loop pattern works for content-harms-style scenarios that are semantically heterogeneous but operationally identical, but it can't cleanly express things like "sweep, then deep-dive only on weak categories" or "select the next technique adaptively based on the last response." This PR introduces the abstraction without changing observable behavior for any existing scenario.

What landed

ScenarioStep ABC (pyrit/scenario/core/scenario_step.py) — the unit of work the new graph dispatches. Every scenario step is one. AtomicAttackScenarioStep is the back-compat adapter that wraps an AtomicAttack and exposes the same surface so the default linear policy continues to work for everyone.
StrategyGraph + StrategyPolicy + PolicyAction (pyrit/scenario/core/strategy_graph.py) — a generic state machine over (StepT, StateT). StrategyGraph.event_loop_async walks the graph, dispatches the policy action for the current state, and accumulates ScenarioStepResult history. linear_strategy_policy(steps) is the default and is used by every legacy scenario unchanged.
ScenarioCoreState (pyrit/scenario/core/scenario_state.py) — the scenario-level state enum (UNINITIALIZED, INITIALIZED, RUNNING, COMPLETED, FAILED) consumed by Scenario.run_async for lifecycle telemetry.
OutcomeScorer (pyrit/score/decorators/outcome_scorer.py) — composition decorator over a Scorer that maps its output to a transition label via an outcome_map: dict[label, predicate]. The "unscored" sentinel is always declared so step validators can pin missing transitions early.
step_identifier persistence (pyrit/identifiers/step_identifier.py + an alembic migration adding the column to AttackResultEntries) — additive nullable column. Legacy rows continue to load. Step identity nests inner attack identifiers via children={"key": [list_of_identifiers]}.
Adaptive scenario (pyrit/scenario/scenarios/adaptive/*) — vendored from FEAT text adaptive scenario #1760 (commit 99fa9dce) then migrated to a ScenarioStep-based AdaptiveStep driven by a linear StrategyPolicy[ScenarioStep, int] (commit a151bedb).
BroadSweepThenDeepDive (pyrit/scenario/scenarios/airt/sweep_then_deep_dive.py) — first real branching scenario. Sweep step classifies each response via OutcomeScorer; the policy emits DEEP_DIVING only when a category was flagged. Two terminal states (COMPLETE vs ALL_SAFE) so downstream can tell why the scenario stopped. Validates the abstraction end-to-end.

Why this is marked `[BREAKING]`

No public API was removed, but the orchestration contract that scenario authors rely on changed shape:

Scenario.run_async now drives steps through StrategyGraph.event_loop_async instead of an inline for loop. Scenario subclasses that previously overrode run_async directly (rather than just _get_atomic_attacks_async) need to either move to the new _build_execution_graph hook or accept the default linear policy.
AdaptiveDispatchAttack is deprecated in favor of AdaptiveStep (removal targeted for 0.17.0). Downstream code instantiating AdaptiveDispatchAttack directly will need to migrate.
The new step_identifier column on AttackResultEntries requires an alembic upgrade on existing memory databases. The migration is additive (column defaults to NULL for legacy rows) and includes a downgrade.

Everything else — AtomicAttack, AttackTechniqueSpec, AttackTechniqueRegistry, AttackTechniqueFactory, ScenarioStrategy (the technique-enum), atomic_attack_identifier — is unchanged.

What's deferred

ScenarioWizard CLI — Phase 8 in the plan; queued for a follow-up PR.
Opportunistic migration of existing scenarios — Phase 9 in the plan. Every legacy scenario (rapid_response, encoding, jailbreak, cyber, scam, psychosocial, leakage, red_team_agent, adversarial_benchmark, fairness_bias) continues to drive its AtomicAttacks through the default linear policy via the back-compat adapter. Each port is its own small PR.
Generic _build_execution_graph signature — the base method types StrategyGraph[ScenarioStep, int]. BroadSweepThenDeepDive widens StateT to a per-scenario enum and uses a single targeted # ty: ignore[invalid-method-override]. Making the base method generic over StateT is tracked as a Phase 9 cleanup.

Related PRs and discussions

MAINT: Rapid response Scenario #1622 (rapid response — the review comment that sparked this design)
MAINT: Refactor Cyber scenario to use technique registry pattern #1654 (cyber technique registry — informed the keep-everything-additive constraint)
FEAT text adaptive scenario #1760 (text adaptive — vendored verbatim, then migrated)

Tests and Documentation

Test coverage is the centerpiece of this PR. Each phase landed with its tests, then a targeted Phase 10 sweep added missing coverage for the six concerns the user called out (performance, scenario state, resumability, attack-id dedup, AtomicAttack → ScenarioStep migration safety, AttackTechniqueSpec + ScenarioStrategy integration).

Full suite

Unit: 8066 passed / 118 skipped / 0 failed in ~2m17s on uv run python -m pytest tests/unit -n 4 --dist=loadfile (baseline main was 7912 passed).
Integration: tests/integration/scenarios/test_notebooks_scenarios.py collects 4 notebooks cleanly, including 3_adaptive_scenarios.ipynb.
No ruff / format / ty regressions introduced by any phase.

New test files

tests/unit/scenario/test_scenario_step.py — ScenarioStep ABC contract.
tests/unit/scenario/test_atomic_attack_scenario_step.py — back-compat adapter + duck-typed attrs.
tests/unit/scenario/test_strategy_graph.py and test_strategy_graph_branching.py — policy construction, traversal, and multi-way dispatch.
tests/unit/scenario/test_linear_strategy_policy.py — the default linear policy builder.
tests/unit/scenario/test_scenario_state.py — ScenarioCoreState lifecycle.
tests/unit/scenario/test_scenario_graph_execution.py — Scenario.run_async graph rewire.
tests/unit/scenario/scenarios/adaptive/test_*.py — vendored + migrated adaptive scenario.
tests/unit/scenario/scenarios/airt/test_sweep_then_deep_dive.py — branching scenario end-to-end (25 tests).
tests/unit/identifiers/test_step_identifier.py + test_step_evaluation_identifier.py — additive persistence.
tests/unit/score/decorators/test_outcome_scorer.py — OutcomeScorer decorator.

Phase 10 audit augmentations

10a — OutcomeScorer (+14): UNSCORED sentinel, declared-outcomes contract, wrapped-scorer exception propagation, defensive outcome_map copy.
10b — ScenarioStep + adapter (+11): ScenarioStepResult defaults freshness (mutable-default footgun), keyword-only remaining_objectives, ABC instantiation failure paths.
10c — StrategyGraph (+7): counting-mock action-invocation parity, frozenset external-mutation immunity, deterministic history order, 3-way branch dispatch.
10d — step_identifier (+10): no false dedup across attack-execution child params, execution-order semantics, multi-result share of one identifier, end-to-end SQLite round-trip of legacy NULL rows, alembic upgrade ↔ downgrade round-trip.
10e — Scenario.run_async rewire (+4): graph-rebuild terminal-state shrinkage on retry, step_identifier stamping without duplication (single + multi result), factory-produced technique end-to-end through StrategyGraph.
10f — adaptive migration (+10): AdaptiveStep is ScenarioStep not AtomicAttack, identifier stability under reversed-input technique order, bind_current_step invariants around each action.

Migration concerns surfaced during audit (non-blocking)

pyrit/scenario/core/scenario.py:1174 — the default linear policy dispatches via isinstance(_step, AtomicAttack) to preserve max_concurrency plumbing the bare ScenarioStep.process_async doesn't accept. Anywhere a non-AtomicAttack ScenarioStep is plugged into the default linear policy, max_concurrency is silently bypassed. Correct for today's only consumer (AtomicAttack); flagged for cleanup when more ScenarioStep subclasses appear (Phase 9).
step_identifier.attack_executions preserves caller order, so reversed-input order produces a different hash. 10d locked in the actual behavior with a regression test rather than silently switching to sort-for-stability. If sorted nesting is wanted long-term, the implementation change is small and isolated.

JupyText

No new notebook content in this PR — the adaptive scenarios notebook (doc/code/scenarios/3_adaptive_scenarios.py / .ipynb) was vendored as-is from #1760 in commit 99fa9dce and notebook collection is unchanged. Existing JupyText workflow applies: jupytext --sync doc/code/scenarios/*.py.

Local test command

uv run python -m pytest tests/unit -n 4 --dist=loadfile

(There is one pre-existing failure in tests/unit/cli/test_pyrit_scan.py::TestMain::test_main_prints_startup_message from an ODBC connection attempt to airtdev.database.windows.net that is unrelated to this refactor.)

…rer) Land the empty-but-tested new abstractions for the scenario core refactor side by side with the existing flat-loop scenario plumbing. Nothing in pyrit/scenario/core/scenario.py changes yet; later phases wire these in. New modules: - pyrit/scenario/core/scenario_state.py: ScenarioCoreState enum (UNINITIALIZED, INITIALIZING, EXECUTING, COMPLETE, FAILED) plus ScenarioStateLike runtime-checkable protocol. Per-scenario state enums extend the vocabulary by satisfying the protocol. - pyrit/scenario/core/scenario_step.py: ScenarioStep(Identifiable) ABC plus frozen ScenarioStepResult dataclass. One step owns one outcome decision (may wrap N attack executions). - pyrit/scenario/core/strategy_graph.py: generic StrategyGraph orchestrator over a policy dict[state, async-action]. Restartable event_loop_async yields ScenarioStepResults; history tracked for resume. Constructor validates terminal_states, initial_state, and policy/terminal overlap. - pyrit/score/decorators/outcome_scorer.py: OutcomeScorer composition wrapper around a Scorer. resolve_outcome_async returns the first matching label from outcome_map, or the 'unscored' sentinel. Not a Scorer subclass on purpose (composition keeps the Scorer ABC's validator and abstract methods out of the way). - pyrit/identifiers/step_identifier.py: build_step_identifier factory plus STEP_EVAL_VERSION constant. Composite identifier wraps N atomic_attack_identifiers under children['attack_executions']. atomic_attack_identifier is unchanged: step identity is additive. Exports: - pyrit.identifiers re-exports build_step_identifier, STEP_EVAL_VERSION - pyrit.score re-exports OutcomeScorer Tests (44 new, all green): construction validation, identifier determinism, event-loop traversal, restartability across exceptions, outcome resolution and ordering, unscored fallback for empty score lists and unmatched predicates. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Aligns the Phase 0 scaffold with the codebase's policy patterns used by TargetCapabilities / TargetRequirements / ScorerOverridePolicy: - ScenarioCoreState now inherits (str, Enum) like CapabilityName and ScorerOverridePolicy, keeping state values JSON-serializable for resume payloads. - New frozen StrategyPolicy dataclass wraps actions / initial_state / terminal_states with MappingProxyType defensive copy and a keyword-only get_action(*, state=...) / is_terminal(*, state=...) lookup API, mirroring CapabilityHandlingPolicy.behaviors / get_behavior. - StrategyGraph is reduced to a thin orchestrator that consumes a single StrategyPolicy. Construction-time validation moved onto StrategyPolicy.__post_init__ so the policy is its own typed invariant. - bind_current_step(*, step=...) is now keyword-only. AtomicAttack inherits from ScenarioStep: - name property aliases atomic_attack_name (the resume / dedup key). - outputs returns a defensive copy of the single hard-coded `done` transition label. - process_async wraps run_async into a ScenarioStepResult; incomplete_objectives and input_indices ride in result.metadata so the orchestrator (Phase 5) can consume them without forcing every step to invent its own payload type. - _build_identifier nests the underlying AttackTechnique identifier under children. ScenarioStepResult gains a metadata: dict[str, Any] field so steps can carry per-step bookkeeping (incomplete objectives, adaptive selector state, etc.) without polluting the outcome label. Tests: 13 new ScenarioStep-contract tests for AtomicAttack and a full rewrite of test_strategy_graph.py to construct via StrategyPolicy. Scoped suite (tests/unit/scenario tests/unit/identifiers tests/unit/score) green: 1825 passed, 15 skipped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…coverage (Phase 3) Completes Phase 3 of the scenario-core refactor by adding the convenience policy builder and the branching-graph proof-of-concept tests called out in the rubber-duck pass. linear_strategy_policy(steps): - Produces a StrategyPolicy[ScenarioStep, int] that walks an ordered list of steps state-by-state, with action i binding steps[i] as current_step, awaiting its process_async, and transitioning to state i+1. State len(steps) is the sole terminal state. - Captures step / next_state via default-argument binding to dodge the classic late-binding closure bug in for-loops. - Always clears current_step in a finally so a step raising mid-execution doesn't leave the graph in an inconsistent state — the graph stays at the failed state so the existing retry loop can re-enter. - This is the policy Phase 5 will use to silently upgrade legacy scenarios that still declare their steps via _get_atomic_attacks_async. test_linear_strategy_policy.py (6 tests): - Locks the silent-upgrade contract: order preservation, binding lifecycle, late-binding bug guard, finally-clear on failure, and the empty-input guardrail. test_strategy_graph_branching.py (4 tests): - Forces the policy API through a non-trivial branching scenario (BroadSweepThenDeepDive) before Phase 5 commits to it: opening phase emits safe or violation; safe short-circuits to COMPLETE, violation routes through ESCALATION_PHASE first. - Confirms that history records both branch states, that escalation step metadata survives the round trip, and that graph.reset() correctly replays the branching path. Full unit suite: 7929 passed, 118 skipped (the one CLI test failure is the pre-existing ODBC driver missing on this host — unrelated to the refactor). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Lands the additive `step_identifier` column on `AttackResultEntry` so `AttackResult` rows produced through the new `StrategyGraph` orchestrator carry the composite `ScenarioStep` identity built by `pyrit.identifiers.step_identifier.build_step_identifier` (introduced in Phase 0). Old rows stay null - no backfill, no destructive migration. Per the Phase 4 plan, `atomic_attack_identifier` is NOT renamed and NOT removed. `step_identifier` is purely additive metadata that records *which step inside which scenario* produced the attack result. Direct attack invocations continue to set only `atomic_attack_identifier` and write `step_identifier = null`. Changes: * pyrit/identifiers/evaluation_identifier.py - new `StepEvaluationIdentifier` mirroring `AtomicAttackEvaluationIdentifier.CHILD_EVAL_RULES` so nested attack-execution children get filtered identically inside step-level eval grouping. The step's own params (`step_name`, `outcome`, `eval_version`) are fully included - a `STEP_EVAL_VERSION` bump splits two semantically-equivalent step runs. * pyrit/identifiers/identifier_filters.py - `IdentifierType.STEP`. * pyrit/identifiers/__init__.py - exports `StepEvaluationIdentifier`. * pyrit/memory/alembic/versions/a1c2e4f80b3d_add_step_identifier.py - new migration chaining off `7a1b2c3d4e5f` adding a nullable JSON column. * pyrit/memory/memory_models.py - `AttackResultEntry.step_identifier` JSON column; `__init__` populates `eval_hash` via `StepEvaluationIdentifier` BEFORE the `to_dict` truncation pass so the hash survives DB storage, mirroring the atomic_attack_identifier precedent; `get_attack_result` reconstructs via `ComponentIdentifier.from_dict`. * pyrit/memory/memory_interface.py - `identifier_column_map` extended so `IdentifierType.STEP` filters route to the new column. * pyrit/models/attack_result.py - `step_identifier: Optional` field added to the dataclass + `to_dict` / `from_dict`. Old payloads without the key still hydrate cleanly. Tests (+18 new, all passing; full unit suite 7947 passed, 118 skipped, 1 pre-existing ODBC env failure): * test_step_evaluation_identifier.py - eval-hash stability, outcome / nested-target / eval_version sensitivity, scorer / operational-param exclusions, rule parity with AtomicAttackEvaluationIdentifier. * test_memory_models.py - AttackResultEntry round-trip with and without step_identifier, eval_hash preservation through the column. * test_attack_result.py - to_dict / from_dict round-trip; null behavior. * test_interface_attack_results.py - SQLite filter by `IdentifierType.STEP` matches step_name and skips legacy rows. * test_identifier_filters.py - guard test count + value assertion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Phase 5 of the scenario-core refactor moves Scenario.run_async from a flat for-loop over AtomicAttacks to a StrategyGraph event loop, without changing observable behavior for any existing scenario. Key changes in pyrit/scenario/core/scenario.py: * New `_build_execution_graph(*, steps=None)` factory returns the StrategyGraph that drives the execution attempt. Default implementation wraps the supplied steps (or self._atomic_attacks) via `_build_default_linear_policy`, which preserves AtomicAttack-level concurrency semantics (max_concurrency, return_partial_on_failure) and stamps each step's name into ScenarioStepResult.metadata['step_name'] so the orchestrator can identify yields without depending on graph.current_step. * `_execute_scenario_async` now iterates `self._execution_graph.event_loop_async()` instead of the flat remaining_attacks list. Resume-by-name semantics are preserved: `_get_remaining_atomic_attacks_async` runs first, the graph is built from its output, and already-completed steps simply aren't in the policy. Partial-failure handling, retry, scenario_run_state transitions, error_attack_result_ids persistence, and progress-bar continuity all behave identically. * Each AttackResult flowing out of the graph is stamped with a step_identifier (the Phase 4 column) and that identifier is pushed to the existing AttackResultEntry row via update_attack_result_by_id, mirroring AtomicAttack._enrich_atomic_attack_identifiers. Steps that pre-stamp their own step_identifier (e.g., future adaptive steps) are not overwritten. * New public properties `execution_graph` and `execution_history` expose the active attempt's state machine for inspection and downstream tooling. Tests: * New tests/unit/scenario/test_scenario_graph_execution.py (11 tests) pins the new public surface: graph factory contract, execution_graph/execution_history properties, step_identifier stamping (default and pre-stamped), max_concurrency propagation, partial-failure surfacing, and non-AtomicAttack ScenarioStep dispatch through process_async via subclass override. * Full unit suite: 7958 passed, 118 skipped, 1 pre-existing ODBC env failure unrelated to the refactor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…e 6a) Phase 6a brings the in-flight adaptive scenario landing (PR microsoft#1760, hawestra/text_adaptive_scenario) into this branch as a sibling module so Phase 6b can migrate it onto the new StrategyGraph without blocking on upstream merge order. Files vendored verbatim from the PR head (1375974): * pyrit/scenario/scenarios/adaptive/{__init__.py, adaptive_scenario.py, dispatcher.py, selector.py, text_adaptive.py} * tests/unit/scenario/scenarios/adaptive/{test_dispatcher.py, test_selector.py, test_text_adaptive.py} * doc/code/scenarios/3_adaptive_scenarios.{ipynb, py} * doc/myst.yml — added 3_adaptive_scenarios entry Only edit applied locally: * pyrit/scenario/__init__.py — merged the PR's adaptive export with this branch's existing Phase 0-3 scaffold exports (PolicyAction, StrategyGraph, StrategyPolicy, ScenarioStep, ScenarioStepResult, ScenarioCoreState, ScenarioStateLike, linear_strategy_policy). Re-sorted the __all__ block to keep submodule names grouped. Test counts: vendored adaptive suite runs 63 tests green; full unit suite 8021 passed / 118 skipped / 1 pre-existing ODBC env failure (test_main_prints_startup_message, unrelated). Phase 6b will rewrite AdaptiveScenario to drive its event loop through StrategyGraph + a recurring SELECTING state, deprecating AdaptiveDispatchAttack in favor of an AdaptiveStep whose process_async owns one selector tick. The vendored tests become the regression net for that port. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Introduces `AdaptiveStep(ScenarioStep)` as the per-objective execution unit and migrates `AdaptiveScenario` to dispatch through `StrategyGraph`. The new step extracts the per-objective adaptive loop from `AdaptiveDispatchAttack._perform_async` and emits `ScenarioStepResult` with outcome label `'success'` or `'exhausted'` (lifting the static `'done'` outcome). It duck-types the `AtomicAttack`-like attributes (`atomic_attack_name`, `objectives`, `seed_groups`, `display_group`, `filter_seed_groups_by_objectives`) so the orchestrator's resume bookkeeping continues to work without changes. `AdaptiveScenario` now overrides `_build_execution_graph` with a custom linear policy (`_build_adaptive_linear_policy`) that always dispatches via `step.process_async()` — bypassing the base class's `isinstance(_step, AtomicAttack)` branch that would otherwise flatten outcomes to `'done'`. The scenario caches its single `AdaptiveTechniqueSelector` on `self._selector` for external introspection and shares the same reference across every emitted `AdaptiveStep`. `AdaptiveDispatchAttack` is deprecated via `print_deprecation_message` pointing to `AdaptiveStep`; scheduled for removal in 0.17.0. Module docstring updated accordingly. Tests: adds `tests/unit/scenario/scenarios/adaptive/test_adaptive_step.py` (19 tests across init validation, AtomicAttack parity, process loop, identifier shape, adaptive-context labels). Migrates 3 assertions in `test_text_adaptive.py` (selector sharing, seed-technique compat) to introspect `step._techniques`/`step._selector` directly. Suppresses dispatcher deprecation noise via module-level `pytestmark` in `test_dispatcher.py` and adds a dedicated `TestDeprecation` class that explicitly asserts the warning fires. Adaptive package: 83 tests pass (was 64). Full unit suite: 7984 passed (no regressions outside the pre-existing ODBC env failure in test_pyrit_scan.py). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>