[DRAFT] BREAKING FEAT: Scenario Core Refactor Proposal#1767
Draft
ValbuenaVC wants to merge 46 commits into
Draft
[DRAFT] BREAKING FEAT: Scenario Core Refactor Proposal#1767ValbuenaVC wants to merge 46 commits into
ValbuenaVC wants to merge 46 commits into
Conversation
…rer) Land the empty-but-tested new abstractions for the scenario core refactor side by side with the existing flat-loop scenario plumbing. Nothing in pyrit/scenario/core/scenario.py changes yet; later phases wire these in. New modules: - pyrit/scenario/core/scenario_state.py: ScenarioCoreState enum (UNINITIALIZED, INITIALIZING, EXECUTING, COMPLETE, FAILED) plus ScenarioStateLike runtime-checkable protocol. Per-scenario state enums extend the vocabulary by satisfying the protocol. - pyrit/scenario/core/scenario_step.py: ScenarioStep(Identifiable) ABC plus frozen ScenarioStepResult dataclass. One step owns one outcome decision (may wrap N attack executions). - pyrit/scenario/core/strategy_graph.py: generic StrategyGraph orchestrator over a policy dict[state, async-action]. Restartable event_loop_async yields ScenarioStepResults; history tracked for resume. Constructor validates terminal_states, initial_state, and policy/terminal overlap. - pyrit/score/decorators/outcome_scorer.py: OutcomeScorer composition wrapper around a Scorer. resolve_outcome_async returns the first matching label from outcome_map, or the 'unscored' sentinel. Not a Scorer subclass on purpose (composition keeps the Scorer ABC's validator and abstract methods out of the way). - pyrit/identifiers/step_identifier.py: build_step_identifier factory plus STEP_EVAL_VERSION constant. Composite identifier wraps N atomic_attack_identifiers under children['attack_executions']. atomic_attack_identifier is unchanged: step identity is additive. Exports: - pyrit.identifiers re-exports build_step_identifier, STEP_EVAL_VERSION - pyrit.score re-exports OutcomeScorer Tests (44 new, all green): construction validation, identifier determinism, event-loop traversal, restartability across exceptions, outcome resolution and ordering, unscored fallback for empty score lists and unmatched predicates. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Aligns the Phase 0 scaffold with the codebase's policy patterns used by TargetCapabilities / TargetRequirements / ScorerOverridePolicy: - ScenarioCoreState now inherits (str, Enum) like CapabilityName and ScorerOverridePolicy, keeping state values JSON-serializable for resume payloads. - New frozen StrategyPolicy dataclass wraps actions / initial_state / terminal_states with MappingProxyType defensive copy and a keyword-only get_action(*, state=...) / is_terminal(*, state=...) lookup API, mirroring CapabilityHandlingPolicy.behaviors / get_behavior. - StrategyGraph is reduced to a thin orchestrator that consumes a single StrategyPolicy. Construction-time validation moved onto StrategyPolicy.__post_init__ so the policy is its own typed invariant. - bind_current_step(*, step=...) is now keyword-only. AtomicAttack inherits from ScenarioStep: - name property aliases atomic_attack_name (the resume / dedup key). - outputs returns a defensive copy of the single hard-coded `done` transition label. - process_async wraps run_async into a ScenarioStepResult; incomplete_objectives and input_indices ride in result.metadata so the orchestrator (Phase 5) can consume them without forcing every step to invent its own payload type. - _build_identifier nests the underlying AttackTechnique identifier under children. ScenarioStepResult gains a metadata: dict[str, Any] field so steps can carry per-step bookkeeping (incomplete objectives, adaptive selector state, etc.) without polluting the outcome label. Tests: 13 new ScenarioStep-contract tests for AtomicAttack and a full rewrite of test_strategy_graph.py to construct via StrategyPolicy. Scoped suite (tests/unit/scenario tests/unit/identifiers tests/unit/score) green: 1825 passed, 15 skipped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…coverage (Phase 3) Completes Phase 3 of the scenario-core refactor by adding the convenience policy builder and the branching-graph proof-of-concept tests called out in the rubber-duck pass. linear_strategy_policy(steps): - Produces a StrategyPolicy[ScenarioStep, int] that walks an ordered list of steps state-by-state, with action i binding steps[i] as current_step, awaiting its process_async, and transitioning to state i+1. State len(steps) is the sole terminal state. - Captures step / next_state via default-argument binding to dodge the classic late-binding closure bug in for-loops. - Always clears current_step in a finally so a step raising mid-execution doesn't leave the graph in an inconsistent state — the graph stays at the failed state so the existing retry loop can re-enter. - This is the policy Phase 5 will use to silently upgrade legacy scenarios that still declare their steps via _get_atomic_attacks_async. test_linear_strategy_policy.py (6 tests): - Locks the silent-upgrade contract: order preservation, binding lifecycle, late-binding bug guard, finally-clear on failure, and the empty-input guardrail. test_strategy_graph_branching.py (4 tests): - Forces the policy API through a non-trivial branching scenario (BroadSweepThenDeepDive) before Phase 5 commits to it: opening phase emits safe or violation; safe short-circuits to COMPLETE, violation routes through ESCALATION_PHASE first. - Confirms that history records both branch states, that escalation step metadata survives the round trip, and that graph.reset() correctly replays the branching path. Full unit suite: 7929 passed, 118 skipped (the one CLI test failure is the pre-existing ODBC driver missing on this host — unrelated to the refactor). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Lands the additive `step_identifier` column on `AttackResultEntry` so `AttackResult` rows produced through the new `StrategyGraph` orchestrator carry the composite `ScenarioStep` identity built by `pyrit.identifiers.step_identifier.build_step_identifier` (introduced in Phase 0). Old rows stay null - no backfill, no destructive migration. Per the Phase 4 plan, `atomic_attack_identifier` is NOT renamed and NOT removed. `step_identifier` is purely additive metadata that records *which step inside which scenario* produced the attack result. Direct attack invocations continue to set only `atomic_attack_identifier` and write `step_identifier = null`. Changes: * pyrit/identifiers/evaluation_identifier.py - new `StepEvaluationIdentifier` mirroring `AtomicAttackEvaluationIdentifier.CHILD_EVAL_RULES` so nested attack-execution children get filtered identically inside step-level eval grouping. The step's own params (`step_name`, `outcome`, `eval_version`) are fully included - a `STEP_EVAL_VERSION` bump splits two semantically-equivalent step runs. * pyrit/identifiers/identifier_filters.py - `IdentifierType.STEP`. * pyrit/identifiers/__init__.py - exports `StepEvaluationIdentifier`. * pyrit/memory/alembic/versions/a1c2e4f80b3d_add_step_identifier.py - new migration chaining off `7a1b2c3d4e5f` adding a nullable JSON column. * pyrit/memory/memory_models.py - `AttackResultEntry.step_identifier` JSON column; `__init__` populates `eval_hash` via `StepEvaluationIdentifier` BEFORE the `to_dict` truncation pass so the hash survives DB storage, mirroring the atomic_attack_identifier precedent; `get_attack_result` reconstructs via `ComponentIdentifier.from_dict`. * pyrit/memory/memory_interface.py - `identifier_column_map` extended so `IdentifierType.STEP` filters route to the new column. * pyrit/models/attack_result.py - `step_identifier: Optional` field added to the dataclass + `to_dict` / `from_dict`. Old payloads without the key still hydrate cleanly. Tests (+18 new, all passing; full unit suite 7947 passed, 118 skipped, 1 pre-existing ODBC env failure): * test_step_evaluation_identifier.py - eval-hash stability, outcome / nested-target / eval_version sensitivity, scorer / operational-param exclusions, rule parity with AtomicAttackEvaluationIdentifier. * test_memory_models.py - AttackResultEntry round-trip with and without step_identifier, eval_hash preservation through the column. * test_attack_result.py - to_dict / from_dict round-trip; null behavior. * test_interface_attack_results.py - SQLite filter by `IdentifierType.STEP` matches step_name and skips legacy rows. * test_identifier_filters.py - guard test count + value assertion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 5 of the scenario-core refactor moves Scenario.run_async from a flat for-loop over AtomicAttacks to a StrategyGraph event loop, without changing observable behavior for any existing scenario. Key changes in pyrit/scenario/core/scenario.py: * New `_build_execution_graph(*, steps=None)` factory returns the StrategyGraph that drives the execution attempt. Default implementation wraps the supplied steps (or self._atomic_attacks) via `_build_default_linear_policy`, which preserves AtomicAttack-level concurrency semantics (max_concurrency, return_partial_on_failure) and stamps each step's name into ScenarioStepResult.metadata['step_name'] so the orchestrator can identify yields without depending on graph.current_step. * `_execute_scenario_async` now iterates `self._execution_graph.event_loop_async()` instead of the flat remaining_attacks list. Resume-by-name semantics are preserved: `_get_remaining_atomic_attacks_async` runs first, the graph is built from its output, and already-completed steps simply aren't in the policy. Partial-failure handling, retry, scenario_run_state transitions, error_attack_result_ids persistence, and progress-bar continuity all behave identically. * Each AttackResult flowing out of the graph is stamped with a step_identifier (the Phase 4 column) and that identifier is pushed to the existing AttackResultEntry row via update_attack_result_by_id, mirroring AtomicAttack._enrich_atomic_attack_identifiers. Steps that pre-stamp their own step_identifier (e.g., future adaptive steps) are not overwritten. * New public properties `execution_graph` and `execution_history` expose the active attempt's state machine for inspection and downstream tooling. Tests: * New tests/unit/scenario/test_scenario_graph_execution.py (11 tests) pins the new public surface: graph factory contract, execution_graph/execution_history properties, step_identifier stamping (default and pre-stamped), max_concurrency propagation, partial-failure surfacing, and non-AtomicAttack ScenarioStep dispatch through process_async via subclass override. * Full unit suite: 7958 passed, 118 skipped, 1 pre-existing ODBC env failure unrelated to the refactor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e 6a) Phase 6a brings the in-flight adaptive scenario landing (PR microsoft#1760, hawestra/text_adaptive_scenario) into this branch as a sibling module so Phase 6b can migrate it onto the new StrategyGraph without blocking on upstream merge order. Files vendored verbatim from the PR head (1375974): * pyrit/scenario/scenarios/adaptive/{__init__.py, adaptive_scenario.py, dispatcher.py, selector.py, text_adaptive.py} * tests/unit/scenario/scenarios/adaptive/{test_dispatcher.py, test_selector.py, test_text_adaptive.py} * doc/code/scenarios/3_adaptive_scenarios.{ipynb, py} * doc/myst.yml — added 3_adaptive_scenarios entry Only edit applied locally: * pyrit/scenario/__init__.py — merged the PR's adaptive export with this branch's existing Phase 0-3 scaffold exports (PolicyAction, StrategyGraph, StrategyPolicy, ScenarioStep, ScenarioStepResult, ScenarioCoreState, ScenarioStateLike, linear_strategy_policy). Re-sorted the __all__ block to keep submodule names grouped. Test counts: vendored adaptive suite runs 63 tests green; full unit suite 8021 passed / 118 skipped / 1 pre-existing ODBC env failure (test_main_prints_startup_message, unrelated). Phase 6b will rewrite AdaptiveScenario to drive its event loop through StrategyGraph + a recurring SELECTING state, deprecating AdaptiveDispatchAttack in favor of an AdaptiveStep whose process_async owns one selector tick. The vendored tests become the regression net for that port. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduces `AdaptiveStep(ScenarioStep)` as the per-objective execution unit and migrates `AdaptiveScenario` to dispatch through `StrategyGraph`. The new step extracts the per-objective adaptive loop from `AdaptiveDispatchAttack._perform_async` and emits `ScenarioStepResult` with outcome label `'success'` or `'exhausted'` (lifting the static `'done'` outcome). It duck-types the `AtomicAttack`-like attributes (`atomic_attack_name`, `objectives`, `seed_groups`, `display_group`, `filter_seed_groups_by_objectives`) so the orchestrator's resume bookkeeping continues to work without changes. `AdaptiveScenario` now overrides `_build_execution_graph` with a custom linear policy (`_build_adaptive_linear_policy`) that always dispatches via `step.process_async()` — bypassing the base class's `isinstance(_step, AtomicAttack)` branch that would otherwise flatten outcomes to `'done'`. The scenario caches its single `AdaptiveTechniqueSelector` on `self._selector` for external introspection and shares the same reference across every emitted `AdaptiveStep`. `AdaptiveDispatchAttack` is deprecated via `print_deprecation_message` pointing to `AdaptiveStep`; scheduled for removal in 0.17.0. Module docstring updated accordingly. Tests: adds `tests/unit/scenario/scenarios/adaptive/test_adaptive_step.py` (19 tests across init validation, AtomicAttack parity, process loop, identifier shape, adaptive-context labels). Migrates 3 assertions in `test_text_adaptive.py` (selector sharing, seed-technique compat) to introspect `step._techniques`/`step._selector` directly. Suppresses dispatcher deprecation noise via module-level `pytestmark` in `test_dispatcher.py` and adds a dedicated `TestDeprecation` class that explicitly asserts the warning fires. Adaptive package: 83 tests pass (was 64). Full unit suite: 7984 passed (no regressions outside the pre-existing ODBC env failure in test_pyrit_scan.py). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds regression coverage for the Phase 2 ScenarioStep ABC and the AtomicAttack ScenarioStep adapter: - ScenarioStepResult: outcome is required; metadata/attack_results default factories produce fresh per-instance containers (Python mutable-default footgun); accepts all four fields when provided. - ScenarioStep ABC: subclass missing process_async cannot instantiate; subclass that overrides only process_async inherits the default _build_identifier. - AtomicAttack adapter: filter_seed_groups_by_objectives is keyword-only and correctly filters/preserves/empties seed_groups + objectives. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds tests for the Phase 3 state-machine layer covering gaps in the existing suite: - Performance: counting-mock assertion that an N-step linear graph invokes exactly N policy actions (guards against N**2 retraversal). - State correctness: terminal_states is immune to external mutation of the input set; multi-terminal policies can reach an alternate terminal (FAILED, not just COMPLETE). - Determinism: history ordering is identical across reset + re-run. - Branching dispatch: parametrized 3-way branch confirms transitions are dict-lookup based rather than isinstance chains. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds high-signal tests for the Phase 6b adaptive scenario migration:
* AdaptiveStep is a ScenarioStep subclass (not AtomicAttack), with name aliasing atomic_attack_name for resume bookkeeping.
* _build_identifier output is stable when techniques dict is constructed in reversed key order.
* _build_adaptive_linear_policy + _build_execution_graph build a StrategyPolicy[ScenarioStep, int] with initial_state=0, terminal_states={len(steps)}, and one action per pre-terminal state.
* Event loop visits each step exactly once, terminates, propagates 'success'/'exhausted' outcomes verbatim, and binds/unbinds current_step around each action.
* End-to-end smoke: a real AdaptiveStep plugged into the adaptive linear policy emits 'success' as a real transition label.
Test count: 83 -> 93 (10 new tests).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds tests that fill the gaps Phase 4 left around the additive step_identifier column: - step_identifier: no false dedup across attack-execution child configs, list (not nested dict) shape, execution-order is preserved, child param changes propagate to hash, eval_version is in params. - memory interface: legacy AttackResult rows (NULL step_identifier) round-trip cleanly, and multiple results sharing one step_identifier are retrievable via a single STEP filter. - alembic: a1c2e4f80b3d revision metadata, upgrade adds the column, full upgrade->downgrade round-trip restores the pre-Phase-4 schema. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 4 tests covering Phase 5 (commit 952311d) gaps in Scenario.run_async: - TestStepIdentifierStampingNoDuplication (2 tests): verify the step_identifier stamping path uses update_attack_result_by_id and never inserts duplicate rows, both for single- and multi-result steps. - TestExecutionGraphRebuildOnRetry (1 test): verify the execution graph is rebuilt from the resume-filtered remaining steps after a partial failure, so terminal_states shrinks on retry. - TestFactoryAtomicAttackGraphIntegration (1 test): end-to-end integration through AttackTechniqueFactory -> AttackTechnique -> AtomicAttack -> StrategyGraph execution path, asserting the factory-built attack is the one the executor receives and that step_identifier is stamped on the resulting AttackResult. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-factory catalog Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nstructor Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ve copy Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
rlundeen2
reviewed
May 21, 2026
| 3. **Deep dive phase**: run each provided multi-turn ``AtomicAttack`` ONLY | ||
| against the categories the sweep flagged. Untargeted categories are | ||
| skipped; their names are stamped into ``ScenarioStepResult.metadata`` | ||
| for diagnostics. |
Contributor
There was a problem hiding this comment.
Does it seem simpler to do the current design, and instead have a composite scenario;
E.g.
ScenarioPipeline(phases=[
PhaseSpec(scenario=rapid_response, name="cursory"),
PhaseSpec(factory=re_probe_successes(...), name="deep_dive"),
PhaseSpec(factory=adversarial_followup(...), name="amplify"),
])
To me that seems simpler than keeping track of state and messing with techniques
Contributor
There was a problem hiding this comment.
Branching and state management adds a ton of complexity. Which is sometimes needed; but it would be good to have a concrete example of scenarios it can help us unlock or make easier. It'd be worth a design meeting to chat about it!
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…typing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pat stub Two duck-driven follow-ups to e61df6f (R5 ScenarioPipeline): - Class docstring rewritten to be explicit that per-phase outcomes live only on in-memory self._phase_executions in v1, not in the persisted outer ScenarioResult. The previous wording implied cross-process readers could inspect per-phase outcomes; they cannot until R5.1 wires phase_executions into metadata. - _ScenarioPipelinePhaseStep.set_scenario_result_id added as a no-op stub. Today the base orchestrator's isinstance(_step, AtomicAttack) guard makes this unreachable, but R1 plans to collapse that guard and dispatch uniformly via process_async. Any non-AtomicAttack ScenarioStep needs this method or R1 will break with AttributeError. Regression test pins the contract. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ure modes) Closes the R5.1 rubber-duck follow-ups on top of R5 (ScenarioPipeline): - Add Scenario._finalize_scenario_result_async base hook (no-op default) called once between the last successful step and the COMPLETED state transition, giving composition subclasses a place to write run-summary state into ScenarioResult.metadata. - Override the hook on ScenarioPipeline to persist per-phase outcomes as metadata['phase_executions'] (a list of name/outcome/inner_scenario_result_id dicts), so cross-process readers can reload the pipeline result and walk phases without holding a live pipeline instance. Class docstring updated to reflect the new persistence contract. - Invert metadata merge order in _build_phase_action: pipeline-stamped diagnostic keys (step_name, phase_index) now win over inner-step result metadata. Regression test pins the inversion against a NoisyStep that emits colliding keys. - Document PipelineContext immutability nuance: structurally frozen at the dataclass level, but inner ScenarioResult payloads are not deep- immutable and should be treated as read-only by convention. - Sharpen input_schema docstring on the kept-but-broken 'phases' role: explicit guidance that the OPAQUE tag is an authoring-refusal signal for the wizard until pipelines can round-trip. - Add TestPipelineFailureModes covering Duck microsoft#1's M1 gaps: inner initialize_async / run_async exceptions, predicate exceptions, and partial-progress phase_executions on mid-flight failure. 50/50 composite tests pass (was 41 at R5 ship); 1066/1066 scenario tests pass overall. Pre-commit clean (ruff format/check, ty). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Unify scenario step dispatch on `ScenarioStep.process_async` so the base linear policy handles AtomicAttack, AdaptiveStep, and any future ScenarioStep subclass through one code path. Adds a setter on `AtomicAttack` so the base policy can push the scenario-level `max_concurrency` into atomic steps without the orchestrator special-casing step types. Introduces `LinearScenario` as the L0 authoring tier so users can construct a scenario from a list of pre-built steps without subclassing. - `AtomicAttack.set_scenario_max_concurrency` + `_scenario_max_concurrency` instance state, with `process_async` honoring the bound value when delegating to `run_async`. - `Scenario._build_default_linear_policy` now pushes max_concurrency into every `AtomicAttack` step before the action loop and always dispatches via `process_async` (removes the isinstance branch that forced AdaptiveStep authors into L2). - `AdaptiveScenario._build_execution_graph` and `_build_adaptive_linear_policy` (~80 LOC) deleted; the base linear policy now drives adaptive correctly because outcomes propagate verbatim from `AdaptiveStep.process_async`. - `LinearScenario(steps=[...], objective_scorer=...)` returns a runnable scenario with zero subclassing — the L0 entry point sketched in the R1 plan response to rlundeen's PR microsoft#1767 review. Test fixture pattern: `MagicMock(spec=AtomicAttack)` AsyncMock fallback for `process_async` returns coroutines that fail metadata unpacking. Five test fixtures updated to wire `process_async` to delegate to `run_async` so existing `run_async.assert_called_with` assertions continue to work through the new dispatch chain. New tests cover the setter validation, process_async max_concurrency forwarding, and end-to-end LinearScenario execution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Renames the step-builder hook on Scenario from _get_atomic_attacks_async to _get_steps_async to honestly reflect that subclasses may return any ScenarioStep (AtomicAttack, AdaptiveStep, _ScenarioPipelinePhaseStep, etc.), not just AtomicAttacks. The legacy name keeps working as a passthrough through 0.16.0. Base class now exposes _get_steps_async as the real factory (cross-product over selected techniques and datasets). _get_atomic_attacks_async stays as a thin delegate. __init_subclass__ detects subclasses that still override only the legacy name and emits a DeprecationWarning once at class-creation time so authors see the rename horizon before their next run_async() call. Internal callsite in initialize_async now invokes _get_steps_async; the existing baseline-injection rescue path is unchanged. Migrates all 8 first-party Scenario subclasses (adaptive, adversarial, red_team_agent, encoding, jailbreak, psychosocial, scam, sweep_then_deep_dive), LinearScenario, and ScenarioPipeline to the new name. Test fixtures across the scenario suite are migrated except for the two that intentionally exercise the legacy rescue path (test_baseline_deprecation, test_scenario._LegacyOverrideScenario). Walkthroughs in doc/ and .github/instructions/scenarios.instructions.md updated with the rename plus a deprecation pointer. Adds tests/unit/scenario/test_get_steps_async_rename.py pinning: legacy-override-only emits the warning, new-override-only stays quiet, both-overrides stays quiet, neither-override stays quiet, legacy override reached via _get_steps_async delegation, and new override reached via _get_atomic_attacks_async passthrough. Per rlundeen review on microsoft#1767: surfaces R2 from the R-series rollout (R1 collapsed the adaptive override into the base linear policy; R3 will split scenario/step state). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Renames the singular current_step abstraction to active_steps (tuple) to prepare for R4 concurrent dispatch. Adds active_steps property + bind_active_steps mutator on StrategyGraph; keeps current_step + bind_current_step as backward-compat shims. current_step emits DeprecationWarning only when ambiguous (len(active_steps) > 1). Migrates all four first-party callsites to bind_active_steps: linear_strategy_policy, Scenario._build_default_linear_policy, ScenarioPipeline._build_phase_action, and BroadSweepThenDeepDive sweep+deep actions. Adds tests/unit/scenario/test_active_steps_split.py (10 tests) covering default state, sequential binding, shim semantics, concurrent binding warning, and reset behavior. Per rlundeen review on microsoft#1767: surfaces R3 from the R-series rollout (R2 renamed the step builder; R4 wires concurrent dispatch on top of this split). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds max_step_concurrency (int, default 1) to BroadSweepThenDeepDive and FilteredDeepDiveStep. Default 1 preserves pre-R4 sequential semantics bit-for-bit. >1 wraps the per-atomic dispatch in an asyncio.Semaphore and awaits via asyncio.gather; dispatched_categories and attack_results retain input order because gather preserves it. Validates inputs at both layers (>= 1) so wizard / programmatic callers fail fast on bogus values. Stamps the effective concurrency cap into ScenarioStepResult.metadata['max_step_concurrency'] for downstream diagnostics. Surfaces the new scalar role through input_schema() so the wizard can elicit it (4 roles -> 5: 3 OPAQUE + 2 SCALAR). Adds tests/unit/scenario/scenarios/airt/test_concurrent_deep_dive.py (13 tests) covering: validation, order preservation, empty short-circuit, peak in-flight observation via asyncio.Event gating, and semaphore upper-bound enforcement. Updates test_sweep_then_deep_dive_input_schema.py for the 5-role schema. Per rlundeen review on microsoft#1767: R4 is the concrete concurrent-dispatch payload made possible by R3's active_steps split. Per-atomic active_steps publication (graph.mark_step_running per the plan's example) is a follow-up that requires the StepStatus sidecar abstraction that R3 explicitly deferred. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Note: this is a huge PR and it's a proposal. It shouldn't be merged in as one PR due to its size and scope, but it's here as a point of reference for ongoing changes in other PRs.
Re-architects
pyrit/scenario/core/around an explicit state-machine layer so scenarios can express non-linear control flow (branching, escalation, retries-by-state) instead of the previous flatfor atomic in atomic_attacksloop. Landed additively across 10 phases so each commit stayed independently green; no destructive renames and no schema-breaking migrations.The motivation comes from a review comment on #1622 (rapid response scenario) and discussion around #1654 (cyber technique registry) and #1760 (text adaptive scenario). The flat-loop pattern works for content-harms-style scenarios that are semantically heterogeneous but operationally identical, but it can't cleanly express things like "sweep, then deep-dive only on weak categories" or "select the next technique adaptively based on the last response." This PR introduces the abstraction without changing observable behavior for any existing scenario.
What landed
ScenarioStepABC (pyrit/scenario/core/scenario_step.py) — the unit of work the new graph dispatches. Every scenario step is one.AtomicAttackScenarioStepis the back-compat adapter that wraps anAtomicAttackand exposes the same surface so the default linear policy continues to work for everyone.StrategyGraph+StrategyPolicy+PolicyAction(pyrit/scenario/core/strategy_graph.py) — a generic state machine over(StepT, StateT).StrategyGraph.event_loop_asyncwalks the graph, dispatches the policy action for the current state, and accumulatesScenarioStepResulthistory.linear_strategy_policy(steps)is the default and is used by every legacy scenario unchanged.ScenarioCoreState(pyrit/scenario/core/scenario_state.py) — the scenario-level state enum (UNINITIALIZED,INITIALIZED,RUNNING,COMPLETED,FAILED) consumed byScenario.run_asyncfor lifecycle telemetry.OutcomeScorer(pyrit/score/decorators/outcome_scorer.py) — composition decorator over aScorerthat maps its output to a transition label via anoutcome_map: dict[label, predicate]. The"unscored"sentinel is always declared so step validators can pin missing transitions early.step_identifierpersistence (pyrit/identifiers/step_identifier.py+ an alembic migration adding the column toAttackResultEntries) — additive nullable column. Legacy rows continue to load. Step identity nests inner attack identifiers viachildren={"key": [list_of_identifiers]}.pyrit/scenario/scenarios/adaptive/*) — vendored from FEAT text adaptive scenario #1760 (commit99fa9dce) then migrated to aScenarioStep-basedAdaptiveStepdriven by a linearStrategyPolicy[ScenarioStep, int](commita151bedb).BroadSweepThenDeepDive(pyrit/scenario/scenarios/airt/sweep_then_deep_dive.py) — first real branching scenario. Sweep step classifies each response viaOutcomeScorer; the policy emitsDEEP_DIVINGonly when a category was flagged. Two terminal states (COMPLETEvsALL_SAFE) so downstream can tell why the scenario stopped. Validates the abstraction end-to-end.Why this is marked
[BREAKING]No public API was removed, but the orchestration contract that scenario authors rely on changed shape:
Scenario.run_asyncnow drives steps throughStrategyGraph.event_loop_asyncinstead of an inlineforloop. Scenario subclasses that previously overroderun_asyncdirectly (rather than just_get_atomic_attacks_async) need to either move to the new_build_execution_graphhook or accept the default linear policy.AdaptiveDispatchAttackis deprecated in favor ofAdaptiveStep(removal targeted for 0.17.0). Downstream code instantiatingAdaptiveDispatchAttackdirectly will need to migrate.step_identifiercolumn onAttackResultEntriesrequires an alembic upgrade on existing memory databases. The migration is additive (column defaults toNULLfor legacy rows) and includes a downgrade.Everything else —
AtomicAttack,AttackTechniqueSpec,AttackTechniqueRegistry,AttackTechniqueFactory,ScenarioStrategy(the technique-enum),atomic_attack_identifier— is unchanged.What's deferred
ScenarioWizardCLI — Phase 8 in the plan; queued for a follow-up PR.rapid_response,encoding,jailbreak,cyber,scam,psychosocial,leakage,red_team_agent,adversarial_benchmark,fairness_bias) continues to drive itsAtomicAttacks through the default linear policy via the back-compat adapter. Each port is its own small PR._build_execution_graphsignature — the base method typesStrategyGraph[ScenarioStep, int].BroadSweepThenDeepDivewidensStateTto a per-scenario enum and uses a single targeted# ty: ignore[invalid-method-override]. Making the base method generic overStateTis tracked as a Phase 9 cleanup.Related PRs and discussions
Tests and Documentation
Test coverage is the centerpiece of this PR. Each phase landed with its tests, then a targeted Phase 10 sweep added missing coverage for the six concerns the user called out (performance, scenario state, resumability, attack-id dedup,
AtomicAttack→ScenarioStepmigration safety,AttackTechniqueSpec+ScenarioStrategyintegration).Full suite
uv run python -m pytest tests/unit -n 4 --dist=loadfile(baselinemainwas 7912 passed).tests/integration/scenarios/test_notebooks_scenarios.pycollects 4 notebooks cleanly, including3_adaptive_scenarios.ipynb.New test files
tests/unit/scenario/test_scenario_step.py—ScenarioStepABC contract.tests/unit/scenario/test_atomic_attack_scenario_step.py— back-compat adapter + duck-typed attrs.tests/unit/scenario/test_strategy_graph.pyandtest_strategy_graph_branching.py— policy construction, traversal, and multi-way dispatch.tests/unit/scenario/test_linear_strategy_policy.py— the default linear policy builder.tests/unit/scenario/test_scenario_state.py—ScenarioCoreStatelifecycle.tests/unit/scenario/test_scenario_graph_execution.py—Scenario.run_asyncgraph rewire.tests/unit/scenario/scenarios/adaptive/test_*.py— vendored + migrated adaptive scenario.tests/unit/scenario/scenarios/airt/test_sweep_then_deep_dive.py— branching scenario end-to-end (25 tests).tests/unit/identifiers/test_step_identifier.py+test_step_evaluation_identifier.py— additive persistence.tests/unit/score/decorators/test_outcome_scorer.py—OutcomeScorerdecorator.Phase 10 audit augmentations
UNSCOREDsentinel, declared-outcomes contract, wrapped-scorer exception propagation, defensiveoutcome_mapcopy.ScenarioStepResultdefaults freshness (mutable-default footgun), keyword-onlyremaining_objectives, ABC instantiation failure paths.frozensetexternal-mutation immunity, deterministic history order, 3-way branch dispatch.NULLrows, alembic upgrade ↔ downgrade round-trip.step_identifierstamping without duplication (single + multi result), factory-produced technique end-to-end throughStrategyGraph.AdaptiveStep is ScenarioStepnotAtomicAttack, identifier stability under reversed-input technique order,bind_current_stepinvariants around each action.Migration concerns surfaced during audit (non-blocking)
pyrit/scenario/core/scenario.py:1174— the default linear policy dispatches viaisinstance(_step, AtomicAttack)to preservemax_concurrencyplumbing the bareScenarioStep.process_asyncdoesn't accept. Anywhere a non-AtomicAttackScenarioStepis plugged into the default linear policy,max_concurrencyis silently bypassed. Correct for today's only consumer (AtomicAttack); flagged for cleanup when moreScenarioStepsubclasses appear (Phase 9).step_identifier.attack_executionspreserves caller order, so reversed-input order produces a different hash. 10d locked in the actual behavior with a regression test rather than silently switching to sort-for-stability. If sorted nesting is wanted long-term, the implementation change is small and isolated.JupyText
No new notebook content in this PR — the adaptive scenarios notebook (
doc/code/scenarios/3_adaptive_scenarios.py/.ipynb) was vendored as-is from #1760 in commit99fa9dceand notebook collection is unchanged. Existing JupyText workflow applies:jupytext --sync doc/code/scenarios/*.py.Local test command
(There is one pre-existing failure in
tests/unit/cli/test_pyrit_scan.py::TestMain::test_main_prints_startup_messagefrom an ODBC connection attempt toairtdev.database.windows.netthat is unrelated to this refactor.)