Skip to content

[DRAFT] BREAKING FEAT: Scenario Core Refactor Proposal#1767

Draft
ValbuenaVC wants to merge 46 commits into
microsoft:mainfrom
ValbuenaVC:vvalbuena-microsoft/scenario-core-refactor
Draft

[DRAFT] BREAKING FEAT: Scenario Core Refactor Proposal#1767
ValbuenaVC wants to merge 46 commits into
microsoft:mainfrom
ValbuenaVC:vvalbuena-microsoft/scenario-core-refactor

Conversation

@ValbuenaVC
Copy link
Copy Markdown
Contributor

@ValbuenaVC ValbuenaVC commented May 20, 2026

Description

Note: this is a huge PR and it's a proposal. It shouldn't be merged in as one PR due to its size and scope, but it's here as a point of reference for ongoing changes in other PRs.

Re-architects pyrit/scenario/core/ around an explicit state-machine layer so scenarios can express non-linear control flow (branching, escalation, retries-by-state) instead of the previous flat for atomic in atomic_attacks loop. Landed additively across 10 phases so each commit stayed independently green; no destructive renames and no schema-breaking migrations.

The motivation comes from a review comment on #1622 (rapid response scenario) and discussion around #1654 (cyber technique registry) and #1760 (text adaptive scenario). The flat-loop pattern works for content-harms-style scenarios that are semantically heterogeneous but operationally identical, but it can't cleanly express things like "sweep, then deep-dive only on weak categories" or "select the next technique adaptively based on the last response." This PR introduces the abstraction without changing observable behavior for any existing scenario.

What landed

  • ScenarioStep ABC (pyrit/scenario/core/scenario_step.py) — the unit of work the new graph dispatches. Every scenario step is one. AtomicAttackScenarioStep is the back-compat adapter that wraps an AtomicAttack and exposes the same surface so the default linear policy continues to work for everyone.
  • StrategyGraph + StrategyPolicy + PolicyAction (pyrit/scenario/core/strategy_graph.py) — a generic state machine over (StepT, StateT). StrategyGraph.event_loop_async walks the graph, dispatches the policy action for the current state, and accumulates ScenarioStepResult history. linear_strategy_policy(steps) is the default and is used by every legacy scenario unchanged.
  • ScenarioCoreState (pyrit/scenario/core/scenario_state.py) — the scenario-level state enum (UNINITIALIZED, INITIALIZED, RUNNING, COMPLETED, FAILED) consumed by Scenario.run_async for lifecycle telemetry.
  • OutcomeScorer (pyrit/score/decorators/outcome_scorer.py) — composition decorator over a Scorer that maps its output to a transition label via an outcome_map: dict[label, predicate]. The "unscored" sentinel is always declared so step validators can pin missing transitions early.
  • step_identifier persistence (pyrit/identifiers/step_identifier.py + an alembic migration adding the column to AttackResultEntries) — additive nullable column. Legacy rows continue to load. Step identity nests inner attack identifiers via children={"key": [list_of_identifiers]}.
  • Adaptive scenario (pyrit/scenario/scenarios/adaptive/*) — vendored from FEAT text adaptive scenario #1760 (commit 99fa9dce) then migrated to a ScenarioStep-based AdaptiveStep driven by a linear StrategyPolicy[ScenarioStep, int] (commit a151bedb).
  • BroadSweepThenDeepDive (pyrit/scenario/scenarios/airt/sweep_then_deep_dive.py) — first real branching scenario. Sweep step classifies each response via OutcomeScorer; the policy emits DEEP_DIVING only when a category was flagged. Two terminal states (COMPLETE vs ALL_SAFE) so downstream can tell why the scenario stopped. Validates the abstraction end-to-end.

Why this is marked [BREAKING]

No public API was removed, but the orchestration contract that scenario authors rely on changed shape:

  • Scenario.run_async now drives steps through StrategyGraph.event_loop_async instead of an inline for loop. Scenario subclasses that previously overrode run_async directly (rather than just _get_atomic_attacks_async) need to either move to the new _build_execution_graph hook or accept the default linear policy.
  • AdaptiveDispatchAttack is deprecated in favor of AdaptiveStep (removal targeted for 0.17.0). Downstream code instantiating AdaptiveDispatchAttack directly will need to migrate.
  • The new step_identifier column on AttackResultEntries requires an alembic upgrade on existing memory databases. The migration is additive (column defaults to NULL for legacy rows) and includes a downgrade.

Everything else — AtomicAttack, AttackTechniqueSpec, AttackTechniqueRegistry, AttackTechniqueFactory, ScenarioStrategy (the technique-enum), atomic_attack_identifier — is unchanged.

What's deferred

  • ScenarioWizard CLI — Phase 8 in the plan; queued for a follow-up PR.
  • Opportunistic migration of existing scenarios — Phase 9 in the plan. Every legacy scenario (rapid_response, encoding, jailbreak, cyber, scam, psychosocial, leakage, red_team_agent, adversarial_benchmark, fairness_bias) continues to drive its AtomicAttacks through the default linear policy via the back-compat adapter. Each port is its own small PR.
  • Generic _build_execution_graph signature — the base method types StrategyGraph[ScenarioStep, int]. BroadSweepThenDeepDive widens StateT to a per-scenario enum and uses a single targeted # ty: ignore[invalid-method-override]. Making the base method generic over StateT is tracked as a Phase 9 cleanup.

Related PRs and discussions

Tests and Documentation

Test coverage is the centerpiece of this PR. Each phase landed with its tests, then a targeted Phase 10 sweep added missing coverage for the six concerns the user called out (performance, scenario state, resumability, attack-id dedup, AtomicAttackScenarioStep migration safety, AttackTechniqueSpec + ScenarioStrategy integration).

Full suite

  • Unit: 8066 passed / 118 skipped / 0 failed in ~2m17s on uv run python -m pytest tests/unit -n 4 --dist=loadfile (baseline main was 7912 passed).
  • Integration: tests/integration/scenarios/test_notebooks_scenarios.py collects 4 notebooks cleanly, including 3_adaptive_scenarios.ipynb.
  • No ruff / format / ty regressions introduced by any phase.

New test files

  • tests/unit/scenario/test_scenario_step.pyScenarioStep ABC contract.
  • tests/unit/scenario/test_atomic_attack_scenario_step.py — back-compat adapter + duck-typed attrs.
  • tests/unit/scenario/test_strategy_graph.py and test_strategy_graph_branching.py — policy construction, traversal, and multi-way dispatch.
  • tests/unit/scenario/test_linear_strategy_policy.py — the default linear policy builder.
  • tests/unit/scenario/test_scenario_state.pyScenarioCoreState lifecycle.
  • tests/unit/scenario/test_scenario_graph_execution.pyScenario.run_async graph rewire.
  • tests/unit/scenario/scenarios/adaptive/test_*.py — vendored + migrated adaptive scenario.
  • tests/unit/scenario/scenarios/airt/test_sweep_then_deep_dive.py — branching scenario end-to-end (25 tests).
  • tests/unit/identifiers/test_step_identifier.py + test_step_evaluation_identifier.py — additive persistence.
  • tests/unit/score/decorators/test_outcome_scorer.pyOutcomeScorer decorator.

Phase 10 audit augmentations

  • 10a — OutcomeScorer (+14): UNSCORED sentinel, declared-outcomes contract, wrapped-scorer exception propagation, defensive outcome_map copy.
  • 10b — ScenarioStep + adapter (+11): ScenarioStepResult defaults freshness (mutable-default footgun), keyword-only remaining_objectives, ABC instantiation failure paths.
  • 10c — StrategyGraph (+7): counting-mock action-invocation parity, frozenset external-mutation immunity, deterministic history order, 3-way branch dispatch.
  • 10d — step_identifier (+10): no false dedup across attack-execution child params, execution-order semantics, multi-result share of one identifier, end-to-end SQLite round-trip of legacy NULL rows, alembic upgrade ↔ downgrade round-trip.
  • 10e — Scenario.run_async rewire (+4): graph-rebuild terminal-state shrinkage on retry, step_identifier stamping without duplication (single + multi result), factory-produced technique end-to-end through StrategyGraph.
  • 10f — adaptive migration (+10): AdaptiveStep is ScenarioStep not AtomicAttack, identifier stability under reversed-input technique order, bind_current_step invariants around each action.

Migration concerns surfaced during audit (non-blocking)

  • pyrit/scenario/core/scenario.py:1174 — the default linear policy dispatches via isinstance(_step, AtomicAttack) to preserve max_concurrency plumbing the bare ScenarioStep.process_async doesn't accept. Anywhere a non-AtomicAttack ScenarioStep is plugged into the default linear policy, max_concurrency is silently bypassed. Correct for today's only consumer (AtomicAttack); flagged for cleanup when more ScenarioStep subclasses appear (Phase 9).
  • step_identifier.attack_executions preserves caller order, so reversed-input order produces a different hash. 10d locked in the actual behavior with a regression test rather than silently switching to sort-for-stability. If sorted nesting is wanted long-term, the implementation change is small and isolated.

JupyText

No new notebook content in this PR — the adaptive scenarios notebook (doc/code/scenarios/3_adaptive_scenarios.py / .ipynb) was vendored as-is from #1760 in commit 99fa9dce and notebook collection is unchanged. Existing JupyText workflow applies: jupytext --sync doc/code/scenarios/*.py.

Local test command

uv run python -m pytest tests/unit -n 4 --dist=loadfile

(There is one pre-existing failure in tests/unit/cli/test_pyrit_scan.py::TestMain::test_main_prints_startup_message from an ODBC connection attempt to airtdev.database.windows.net that is unrelated to this refactor.)

Victor Valbuena and others added 30 commits May 20, 2026 12:14
…rer)

Land the empty-but-tested new abstractions for the scenario core
refactor side by side with the existing flat-loop scenario plumbing.
Nothing in pyrit/scenario/core/scenario.py changes yet; later phases
wire these in.

New modules:
- pyrit/scenario/core/scenario_state.py: ScenarioCoreState enum
  (UNINITIALIZED, INITIALIZING, EXECUTING, COMPLETE, FAILED) plus
  ScenarioStateLike runtime-checkable protocol. Per-scenario state
  enums extend the vocabulary by satisfying the protocol.
- pyrit/scenario/core/scenario_step.py: ScenarioStep(Identifiable)
  ABC plus frozen ScenarioStepResult dataclass. One step owns one
  outcome decision (may wrap N attack executions).
- pyrit/scenario/core/strategy_graph.py: generic StrategyGraph
  orchestrator over a policy dict[state, async-action]. Restartable
  event_loop_async yields ScenarioStepResults; history tracked for
  resume. Constructor validates terminal_states, initial_state, and
  policy/terminal overlap.
- pyrit/score/decorators/outcome_scorer.py: OutcomeScorer composition
  wrapper around a Scorer. resolve_outcome_async returns the first
  matching label from outcome_map, or the 'unscored' sentinel. Not a
  Scorer subclass on purpose (composition keeps the Scorer ABC's
  validator and abstract methods out of the way).
- pyrit/identifiers/step_identifier.py: build_step_identifier
  factory plus STEP_EVAL_VERSION constant. Composite identifier wraps
  N atomic_attack_identifiers under children['attack_executions'].
  atomic_attack_identifier is unchanged: step identity is additive.

Exports:
- pyrit.identifiers re-exports build_step_identifier, STEP_EVAL_VERSION
- pyrit.score re-exports OutcomeScorer

Tests (44 new, all green): construction validation, identifier
determinism, event-loop traversal, restartability across exceptions,
outcome resolution and ordering, unscored fallback for empty score
lists and unmatched predicates.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Aligns the Phase 0 scaffold with the codebase's policy patterns used by TargetCapabilities / TargetRequirements / ScorerOverridePolicy:

- ScenarioCoreState now inherits (str, Enum) like CapabilityName and ScorerOverridePolicy, keeping state values JSON-serializable for resume payloads.
- New frozen StrategyPolicy dataclass wraps actions / initial_state / terminal_states with MappingProxyType defensive copy and a keyword-only get_action(*, state=...) / is_terminal(*, state=...) lookup API, mirroring CapabilityHandlingPolicy.behaviors / get_behavior.
- StrategyGraph is reduced to a thin orchestrator that consumes a single StrategyPolicy. Construction-time validation moved onto StrategyPolicy.__post_init__ so the policy is its own typed invariant.
- bind_current_step(*, step=...) is now keyword-only.

AtomicAttack inherits from ScenarioStep:

- name property aliases atomic_attack_name (the resume / dedup key).
- outputs returns a defensive copy of the single hard-coded `done` transition label.
- process_async wraps run_async into a ScenarioStepResult; incomplete_objectives and input_indices ride in result.metadata so the orchestrator (Phase 5) can consume them without forcing every step to invent its own payload type.
- _build_identifier nests the underlying AttackTechnique identifier under children.

ScenarioStepResult gains a metadata: dict[str, Any] field so steps can carry per-step bookkeeping (incomplete objectives, adaptive selector state, etc.) without polluting the outcome label.

Tests: 13 new ScenarioStep-contract tests for AtomicAttack and a full rewrite of test_strategy_graph.py to construct via StrategyPolicy. Scoped suite (tests/unit/scenario tests/unit/identifiers tests/unit/score) green: 1825 passed, 15 skipped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…coverage (Phase 3)

Completes Phase 3 of the scenario-core refactor by adding the convenience policy builder and the branching-graph proof-of-concept tests called out in the rubber-duck pass.

linear_strategy_policy(steps):

- Produces a StrategyPolicy[ScenarioStep, int] that walks an ordered list of steps state-by-state, with action i binding steps[i] as current_step, awaiting its process_async, and transitioning to state i+1. State len(steps) is the sole terminal state.
- Captures step / next_state via default-argument binding to dodge the classic late-binding closure bug in for-loops.
- Always clears current_step in a finally so a step raising mid-execution doesn't leave the graph in an inconsistent state — the graph stays at the failed state so the existing retry loop can re-enter.
- This is the policy Phase 5 will use to silently upgrade legacy scenarios that still declare their steps via _get_atomic_attacks_async.

test_linear_strategy_policy.py (6 tests):

- Locks the silent-upgrade contract: order preservation, binding lifecycle, late-binding bug guard, finally-clear on failure, and the empty-input guardrail.

test_strategy_graph_branching.py (4 tests):

- Forces the policy API through a non-trivial branching scenario (BroadSweepThenDeepDive) before Phase 5 commits to it: opening phase emits safe or violation; safe short-circuits to COMPLETE, violation routes through ESCALATION_PHASE first.
- Confirms that history records both branch states, that escalation step metadata survives the round trip, and that graph.reset() correctly replays the branching path.

Full unit suite: 7929 passed, 118 skipped (the one CLI test failure is the pre-existing ODBC driver missing on this host — unrelated to the refactor).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Lands the additive `step_identifier` column on `AttackResultEntry` so
`AttackResult` rows produced through the new `StrategyGraph` orchestrator
carry the composite `ScenarioStep` identity built by
`pyrit.identifiers.step_identifier.build_step_identifier` (introduced in
Phase 0). Old rows stay null - no backfill, no destructive migration.

Per the Phase 4 plan, `atomic_attack_identifier` is NOT renamed and NOT
removed. `step_identifier` is purely additive metadata that records *which
step inside which scenario* produced the attack result. Direct attack
invocations continue to set only `atomic_attack_identifier` and write
`step_identifier = null`.

Changes:
 * pyrit/identifiers/evaluation_identifier.py - new
   `StepEvaluationIdentifier` mirroring
   `AtomicAttackEvaluationIdentifier.CHILD_EVAL_RULES` so nested
   attack-execution children get filtered identically inside step-level
   eval grouping. The step's own params (`step_name`, `outcome`,
   `eval_version`) are fully included - a `STEP_EVAL_VERSION` bump
   splits two semantically-equivalent step runs.
 * pyrit/identifiers/identifier_filters.py - `IdentifierType.STEP`.
 * pyrit/identifiers/__init__.py - exports `StepEvaluationIdentifier`.
 * pyrit/memory/alembic/versions/a1c2e4f80b3d_add_step_identifier.py - new
   migration chaining off `7a1b2c3d4e5f` adding a nullable JSON column.
 * pyrit/memory/memory_models.py - `AttackResultEntry.step_identifier`
   JSON column; `__init__` populates `eval_hash` via
   `StepEvaluationIdentifier` BEFORE the `to_dict` truncation pass so
   the hash survives DB storage, mirroring the atomic_attack_identifier
   precedent; `get_attack_result` reconstructs via
   `ComponentIdentifier.from_dict`.
 * pyrit/memory/memory_interface.py - `identifier_column_map` extended
   so `IdentifierType.STEP` filters route to the new column.
 * pyrit/models/attack_result.py - `step_identifier: Optional` field
   added to the dataclass + `to_dict` / `from_dict`. Old payloads
   without the key still hydrate cleanly.

Tests (+18 new, all passing; full unit suite 7947 passed, 118 skipped,
1 pre-existing ODBC env failure):
 * test_step_evaluation_identifier.py - eval-hash stability, outcome /
   nested-target / eval_version sensitivity, scorer / operational-param
   exclusions, rule parity with AtomicAttackEvaluationIdentifier.
 * test_memory_models.py - AttackResultEntry round-trip with and without
   step_identifier, eval_hash preservation through the column.
 * test_attack_result.py - to_dict / from_dict round-trip; null behavior.
 * test_interface_attack_results.py - SQLite filter by
   `IdentifierType.STEP` matches step_name and skips legacy rows.
 * test_identifier_filters.py - guard test count + value assertion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 5 of the scenario-core refactor moves Scenario.run_async from a flat

for-loop over AtomicAttacks to a StrategyGraph event loop, without changing

observable behavior for any existing scenario.

Key changes in pyrit/scenario/core/scenario.py:

* New `_build_execution_graph(*, steps=None)` factory returns the StrategyGraph that drives the execution attempt. Default implementation wraps the supplied steps (or self._atomic_attacks) via `_build_default_linear_policy`, which preserves AtomicAttack-level concurrency semantics (max_concurrency, return_partial_on_failure) and stamps each step's name into ScenarioStepResult.metadata['step_name'] so the orchestrator can identify yields without depending on graph.current_step.

* `_execute_scenario_async` now iterates `self._execution_graph.event_loop_async()` instead of the flat remaining_attacks list. Resume-by-name semantics are preserved: `_get_remaining_atomic_attacks_async` runs first, the graph is built from its output, and already-completed steps simply aren't in the policy. Partial-failure handling, retry, scenario_run_state transitions, error_attack_result_ids persistence, and progress-bar continuity all behave identically.

* Each AttackResult flowing out of the graph is stamped with a step_identifier (the Phase 4 column) and that identifier is pushed to the existing AttackResultEntry row via update_attack_result_by_id, mirroring AtomicAttack._enrich_atomic_attack_identifiers. Steps that pre-stamp their own step_identifier (e.g., future adaptive steps) are not overwritten.

* New public properties `execution_graph` and `execution_history` expose the active attempt's state machine for inspection and downstream tooling.

Tests:

* New tests/unit/scenario/test_scenario_graph_execution.py (11 tests) pins the new public surface: graph factory contract, execution_graph/execution_history properties, step_identifier stamping (default and pre-stamped), max_concurrency propagation, partial-failure surfacing, and non-AtomicAttack ScenarioStep dispatch through process_async via subclass override.

* Full unit suite: 7958 passed, 118 skipped, 1 pre-existing ODBC env failure unrelated to the refactor.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e 6a)

Phase 6a brings the in-flight adaptive scenario landing (PR microsoft#1760,

hawestra/text_adaptive_scenario) into this branch as a sibling module so

Phase 6b can migrate it onto the new StrategyGraph without blocking on

upstream merge order.

Files vendored verbatim from the PR head (1375974):

* pyrit/scenario/scenarios/adaptive/{__init__.py, adaptive_scenario.py, dispatcher.py, selector.py, text_adaptive.py}

* tests/unit/scenario/scenarios/adaptive/{test_dispatcher.py, test_selector.py, test_text_adaptive.py}

* doc/code/scenarios/3_adaptive_scenarios.{ipynb, py}

* doc/myst.yml — added 3_adaptive_scenarios entry

Only edit applied locally:

* pyrit/scenario/__init__.py — merged the PR's adaptive export with this branch's existing Phase 0-3 scaffold exports (PolicyAction, StrategyGraph, StrategyPolicy, ScenarioStep, ScenarioStepResult, ScenarioCoreState, ScenarioStateLike, linear_strategy_policy). Re-sorted the __all__ block to keep submodule names grouped.

Test counts: vendored adaptive suite runs 63 tests green; full unit suite 8021 passed / 118 skipped / 1 pre-existing ODBC env failure (test_main_prints_startup_message, unrelated).

Phase 6b will rewrite AdaptiveScenario to drive its event loop through StrategyGraph + a recurring SELECTING state, deprecating AdaptiveDispatchAttack in favor of an AdaptiveStep whose process_async owns one selector tick. The vendored tests become the regression net for that port.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduces `AdaptiveStep(ScenarioStep)` as the per-objective execution unit and migrates `AdaptiveScenario` to dispatch through `StrategyGraph`. The new step extracts the per-objective adaptive loop from `AdaptiveDispatchAttack._perform_async` and emits `ScenarioStepResult` with outcome label `'success'` or `'exhausted'` (lifting the static `'done'` outcome). It duck-types the `AtomicAttack`-like attributes (`atomic_attack_name`, `objectives`, `seed_groups`, `display_group`, `filter_seed_groups_by_objectives`) so the orchestrator's resume bookkeeping continues to work without changes.

`AdaptiveScenario` now overrides `_build_execution_graph` with a custom linear policy (`_build_adaptive_linear_policy`) that always dispatches via `step.process_async()` — bypassing the base class's `isinstance(_step, AtomicAttack)` branch that would otherwise flatten outcomes to `'done'`. The scenario caches its single `AdaptiveTechniqueSelector` on `self._selector` for external introspection and shares the same reference across every emitted `AdaptiveStep`.

`AdaptiveDispatchAttack` is deprecated via `print_deprecation_message` pointing to `AdaptiveStep`; scheduled for removal in 0.17.0. Module docstring updated accordingly.

Tests: adds `tests/unit/scenario/scenarios/adaptive/test_adaptive_step.py` (19 tests across init validation, AtomicAttack parity, process loop, identifier shape, adaptive-context labels). Migrates 3 assertions in `test_text_adaptive.py` (selector sharing, seed-technique compat) to introspect `step._techniques`/`step._selector` directly. Suppresses dispatcher deprecation noise via module-level `pytestmark` in `test_dispatcher.py` and adds a dedicated `TestDeprecation` class that explicitly asserts the warning fires. Adaptive package: 83 tests pass (was 64). Full unit suite: 7984 passed (no regressions outside the pre-existing ODBC env failure in test_pyrit_scan.py).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds regression coverage for the Phase 2 ScenarioStep ABC and the AtomicAttack ScenarioStep adapter:

- ScenarioStepResult: outcome is required; metadata/attack_results default factories produce fresh per-instance containers (Python mutable-default footgun); accepts all four fields when provided.

- ScenarioStep ABC: subclass missing process_async cannot instantiate; subclass that overrides only process_async inherits the default _build_identifier.

- AtomicAttack adapter: filter_seed_groups_by_objectives is keyword-only and correctly filters/preserves/empties seed_groups + objectives.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds tests for the Phase 3 state-machine layer covering gaps in the

existing suite:

- Performance: counting-mock assertion that an N-step linear graph

  invokes exactly N policy actions (guards against N**2 retraversal).

- State correctness: terminal_states is immune to external mutation of

  the input set; multi-terminal policies can reach an alternate

  terminal (FAILED, not just COMPLETE).

- Determinism: history ordering is identical across reset + re-run.

- Branching dispatch: parametrized 3-way branch confirms transitions

  are dict-lookup based rather than isinstance chains.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds high-signal tests for the Phase 6b adaptive scenario migration:

* AdaptiveStep is a ScenarioStep subclass (not AtomicAttack), with name aliasing atomic_attack_name for resume bookkeeping.

* _build_identifier output is stable when techniques dict is constructed in reversed key order.

* _build_adaptive_linear_policy + _build_execution_graph build a StrategyPolicy[ScenarioStep, int] with initial_state=0, terminal_states={len(steps)}, and one action per pre-terminal state.

* Event loop visits each step exactly once, terminates, propagates 'success'/'exhausted' outcomes verbatim, and binds/unbinds current_step around each action.

* End-to-end smoke: a real AdaptiveStep plugged into the adaptive linear policy emits 'success' as a real transition label.

Test count: 83 -> 93 (10 new tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds tests that fill the gaps Phase 4 left around the additive step_identifier column:

- step_identifier: no false dedup across attack-execution child configs, list (not nested dict) shape, execution-order is preserved, child param changes propagate to hash, eval_version is in params.

- memory interface: legacy AttackResult rows (NULL step_identifier) round-trip cleanly, and multiple results sharing one step_identifier are retrievable via a single STEP filter.

- alembic: a1c2e4f80b3d revision metadata, upgrade adds the column, full upgrade->downgrade round-trip restores the pre-Phase-4 schema.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 4 tests covering Phase 5 (commit 952311d) gaps in Scenario.run_async:

- TestStepIdentifierStampingNoDuplication (2 tests): verify the step_identifier stamping path uses update_attack_result_by_id and never inserts duplicate rows, both for single- and multi-result steps.

- TestExecutionGraphRebuildOnRetry (1 test): verify the execution graph is rebuilt from the resume-filtered remaining steps after a partial failure, so terminal_states shrinks on retry.

- TestFactoryAtomicAttackGraphIntegration (1 test): end-to-end integration through AttackTechniqueFactory -> AttackTechnique -> AtomicAttack -> StrategyGraph execution path, asserting the factory-built attack is the one the executor receives and that step_identifier is stamped on the resulting AttackResult.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-factory catalog

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nstructor

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ve copy

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
3. **Deep dive phase**: run each provided multi-turn ``AtomicAttack`` ONLY
against the categories the sweep flagged. Untargeted categories are
skipped; their names are stamped into ``ScenarioStepResult.metadata``
for diagnostics.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it seem simpler to do the current design, and instead have a composite scenario;

E.g.

ScenarioPipeline(phases=[
    PhaseSpec(scenario=rapid_response, name="cursory"),
    PhaseSpec(factory=re_probe_successes(...), name="deep_dive"),
    PhaseSpec(factory=adversarial_followup(...), name="amplify"),
])

To me that seems simpler than keeping track of state and messing with techniques

Copy link
Copy Markdown
Contributor

@rlundeen2 rlundeen2 May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branching and state management adds a ton of complexity. Which is sometimes needed; but it would be good to have a concrete example of scenarios it can help us unlock or make easier. It'd be worth a design meeting to chat about it!

Victor Valbuena and others added 12 commits May 21, 2026 11:25
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…typing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pat stub

Two duck-driven follow-ups to e61df6f (R5 ScenarioPipeline):

- Class docstring rewritten to be explicit that per-phase outcomes
  live only on in-memory self._phase_executions in v1, not in the
  persisted outer ScenarioResult. The previous wording implied
  cross-process readers could inspect per-phase outcomes; they
  cannot until R5.1 wires phase_executions into metadata.

- _ScenarioPipelinePhaseStep.set_scenario_result_id added as a
  no-op stub. Today the base orchestrator's isinstance(_step,
  AtomicAttack) guard makes this unreachable, but R1 plans to
  collapse that guard and dispatch uniformly via process_async.
  Any non-AtomicAttack ScenarioStep needs this method or R1 will
  break with AttributeError. Regression test pins the contract.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ure modes)

Closes the R5.1 rubber-duck follow-ups on top of R5 (ScenarioPipeline):

- Add Scenario._finalize_scenario_result_async base hook (no-op default)
  called once between the last successful step and the COMPLETED state
  transition, giving composition subclasses a place to write run-summary
  state into ScenarioResult.metadata.
- Override the hook on ScenarioPipeline to persist per-phase outcomes as
  metadata['phase_executions'] (a list of name/outcome/inner_scenario_result_id
  dicts), so cross-process readers can reload the pipeline result and
  walk phases without holding a live pipeline instance. Class docstring
  updated to reflect the new persistence contract.
- Invert metadata merge order in _build_phase_action: pipeline-stamped
  diagnostic keys (step_name, phase_index) now win over inner-step result
  metadata. Regression test pins the inversion against a NoisyStep that
  emits colliding keys.
- Document PipelineContext immutability nuance: structurally frozen at
  the dataclass level, but inner ScenarioResult payloads are not deep-
  immutable and should be treated as read-only by convention.
- Sharpen input_schema docstring on the kept-but-broken 'phases' role:
  explicit guidance that the OPAQUE tag is an authoring-refusal signal
  for the wizard until pipelines can round-trip.
- Add TestPipelineFailureModes covering Duck microsoft#1's M1 gaps: inner
  initialize_async / run_async exceptions, predicate exceptions, and
  partial-progress phase_executions on mid-flight failure.

50/50 composite tests pass (was 41 at R5 ship); 1066/1066 scenario tests
pass overall. Pre-commit clean (ruff format/check, ty).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Unify scenario step dispatch on `ScenarioStep.process_async` so the base linear policy handles AtomicAttack, AdaptiveStep, and any future ScenarioStep subclass through one code path. Adds a setter on `AtomicAttack` so the base policy can push the scenario-level `max_concurrency` into atomic steps without the orchestrator special-casing step types. Introduces `LinearScenario` as the L0 authoring tier so users can construct a scenario from a list of pre-built steps without subclassing.

- `AtomicAttack.set_scenario_max_concurrency` + `_scenario_max_concurrency` instance state, with `process_async` honoring the bound value when delegating to `run_async`.

- `Scenario._build_default_linear_policy` now pushes max_concurrency into every `AtomicAttack` step before the action loop and always dispatches via `process_async` (removes the isinstance branch that forced AdaptiveStep authors into L2).

- `AdaptiveScenario._build_execution_graph` and `_build_adaptive_linear_policy` (~80 LOC) deleted; the base linear policy now drives adaptive correctly because outcomes propagate verbatim from `AdaptiveStep.process_async`.

- `LinearScenario(steps=[...], objective_scorer=...)` returns a runnable scenario with zero subclassing — the L0 entry point sketched in the R1 plan response to rlundeen's PR microsoft#1767 review.

Test fixture pattern: `MagicMock(spec=AtomicAttack)` AsyncMock fallback for `process_async` returns coroutines that fail metadata unpacking. Five test fixtures updated to wire `process_async` to delegate to `run_async` so existing `run_async.assert_called_with` assertions continue to work through the new dispatch chain. New tests cover the setter validation, process_async max_concurrency forwarding, and end-to-end LinearScenario execution.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Renames the step-builder hook on Scenario from _get_atomic_attacks_async to _get_steps_async to honestly reflect that subclasses may return any ScenarioStep (AtomicAttack, AdaptiveStep, _ScenarioPipelinePhaseStep, etc.), not just AtomicAttacks. The legacy name keeps working as a passthrough through 0.16.0.

Base class now exposes _get_steps_async as the real factory (cross-product over selected techniques and datasets). _get_atomic_attacks_async stays as a thin delegate. __init_subclass__ detects subclasses that still override only the legacy name and emits a DeprecationWarning once at class-creation time so authors see the rename horizon before their next run_async() call. Internal callsite in initialize_async now invokes _get_steps_async; the existing baseline-injection rescue path is unchanged.

Migrates all 8 first-party Scenario subclasses (adaptive, adversarial, red_team_agent, encoding, jailbreak, psychosocial, scam, sweep_then_deep_dive), LinearScenario, and ScenarioPipeline to the new name. Test fixtures across the scenario suite are migrated except for the two that intentionally exercise the legacy rescue path (test_baseline_deprecation, test_scenario._LegacyOverrideScenario). Walkthroughs in doc/ and .github/instructions/scenarios.instructions.md updated with the rename plus a deprecation pointer.

Adds tests/unit/scenario/test_get_steps_async_rename.py pinning: legacy-override-only emits the warning, new-override-only stays quiet, both-overrides stays quiet, neither-override stays quiet, legacy override reached via _get_steps_async delegation, and new override reached via _get_atomic_attacks_async passthrough.

Per rlundeen review on microsoft#1767: surfaces R2 from the R-series rollout (R1 collapsed the adaptive override into the base linear policy; R3 will split scenario/step state).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Renames the singular current_step abstraction to active_steps (tuple) to prepare for R4 concurrent dispatch. Adds active_steps property + bind_active_steps mutator on StrategyGraph; keeps current_step + bind_current_step as backward-compat shims. current_step emits DeprecationWarning only when ambiguous (len(active_steps) > 1).

Migrates all four first-party callsites to bind_active_steps: linear_strategy_policy, Scenario._build_default_linear_policy, ScenarioPipeline._build_phase_action, and BroadSweepThenDeepDive sweep+deep actions.

Adds tests/unit/scenario/test_active_steps_split.py (10 tests) covering default state, sequential binding, shim semantics, concurrent binding warning, and reset behavior.

Per rlundeen review on microsoft#1767: surfaces R3 from the R-series rollout (R2 renamed the step builder; R4 wires concurrent dispatch on top of this split).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds max_step_concurrency (int, default 1) to BroadSweepThenDeepDive and FilteredDeepDiveStep. Default 1 preserves pre-R4 sequential semantics bit-for-bit. >1 wraps the per-atomic dispatch in an asyncio.Semaphore and awaits via asyncio.gather; dispatched_categories and attack_results retain input order because gather preserves it.

Validates inputs at both layers (>= 1) so wizard / programmatic callers fail fast on bogus values. Stamps the effective concurrency cap into ScenarioStepResult.metadata['max_step_concurrency'] for downstream diagnostics.

Surfaces the new scalar role through input_schema() so the wizard can elicit it (4 roles -> 5: 3 OPAQUE + 2 SCALAR).

Adds tests/unit/scenario/scenarios/airt/test_concurrent_deep_dive.py (13 tests) covering: validation, order preservation, empty short-circuit, peak in-flight observation via asyncio.Event gating, and semaphore upper-bound enforcement. Updates test_sweep_then_deep_dive_input_schema.py for the 5-role schema.

Per rlundeen review on microsoft#1767: R4 is the concrete concurrent-dispatch payload made possible by R3's active_steps split. Per-atomic active_steps publication (graph.mark_step_running per the plan's example) is a follow-up that requires the StepStatus sidecar abstraction that R3 explicitly deferred.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants