FIX: Set default refusal scorer behavior for scenarios and initializers to STRICT, add support for underlying model name in initializers by fdubut · Pull Request #1537 · microsoft/PyRIT

fdubut · 2026-03-25T04:28:05Z

Description

Refusal scorer variants: Replace single refusal scorer with 4 variants (OBJECTIVE_BLOCK_SAFE, OBJECTIVE_ALLOW_SAFE,
NO_OBJECTIVE_BLOCK_SAFE, NO_OBJECTIVE_ALLOW_SAFE) and dynamically select the best one by F1 from evaluation metrics
Auto-detected scorer dependencies: Add find_dependents_of_tag() to BaseInstanceRegistry — wrapper/composite scorers that use a refusal
scorer are automatically discovered via eval_hash matching, no explicit depends_on needed
Phased scorer initialization: Refactor ScorerInitializer.initialize_async into 5 phases so dependent scorers (e.g.
TrueFalseInverterScorer) use the best refusal path
Tag-based evaluation filtering: Add --tags CLI arg to evaluate_scorers.py to run evals for specific scorer groups (e.g. --tags
refusal)
Underlying model support in initializers: Add OPENAI_CHAT_UNDERLYING_MODEL and AZURE_OPENAI_GPT4O_UNSAFE_CHAT_UNDERLYING_MODEL{,2} env
vars with fallback to model name (credit: @fdubut)

Co-authored with Rich, merged into @fdubut's branch to take their changes also

…ing model in initializers

rlundeen2 · 2026-03-26T04:24:22Z

I have another PR that is related; let's hold off on merging this one for a sec, I'll merge mine into this branch and consolidate

- Remove duplicate seed_type in harms.prompt (both sides added it independently) - Update stale REFUSAL_GPT4O docstring reference to REFUSAL_GPT4O_OBJECTIVE_ALLOW_SAFE Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tion - Add _collect_child_eval_hashes() to ComponentIdentifier for recursive child eval_hash collection - Add find_dependents_of_tag() to BaseInstanceRegistry for auto-detecting wrapper/composite scorer dependencies via eval_hash matching - Add 4 refusal scorer variants with REFUSAL tag in ScorerInitializer - Add _register_best_refusal_f1() to tag the best refusal scorer by F1 from existing metrics (parallels _register_best_objective_f1) - Refactor initialize_async into 5 phases: base refusal, best refusal selection, dependent scorers, other scorers, best objective selection - Add --tags CLI filtering to evaluate_scorers.py via argparse - Add comprehensive unit tests for all new functionality Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…r-1537

STRICT was removed in favor of named variants (OBJECTIVE_BLOCK_SAFE, OBJECTIVE_ALLOW_SAFE, etc.). Use the default OBJECTIVE_ALLOW_SAFE path which aligns with the dynamic best-refusal selection system. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rlundeen2 · 2026-03-26T23:10:55Z

pyrit/score/scorer_evaluation/scorer_evaluator.py

+        # Use harm_definition from CSV headers if available (e.g., "fairness_bias.yaml"),
+        # otherwise fall back to deriving from harm_category (e.g., "bias" -> "bias.yaml").
+        # The CSV header is authoritative since the harm_category name may differ from
+        # the YAML filename (e.g., harm_category="bias" but file is "fairness_bias.yaml").


this was moved, need to check if this fix still matters

Metrics should be regenerated with evaluate_scorers.py after the new refusal scorer variants are finalized. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Document evaluate_scorers.py usage with --tags filtering and the recommended two-step workflow: evaluate refusal scorers first, then re-run all scorers so dependents use the best refusal variant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Change tags parameter type from list[str] to Sequence[str] to accept list[ScorerInitializerTags] (list is invariant, Sequence is covariant). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…r-1537

rlundeen2 and others added 14 commits February 24, 2026 17:06

moving refusal yaml

60618f6

adding more human csv sets

5eaa4d7

adding all refusal scorers to the scripts

d573336

saving work

adfb2a7

Switch Jailbreak scenario default to STRICT scorer

9b5fcac

merging main

af20b15

adding gpt5.4

90e9474

updating

bf58e84

Adding refusals from gpt_4_5

affc1a5

updates but need to re-eval

1029647

Merge branch 'main' into users/rlundeen/2026_02_26_refusal_scorer_update

fb95954

Switch AIRT default refusal scorer to STRICT, add support for underly…

9870049

…ing model in initializers

Merge branch 'main' of https://github.com/Azure/PyRIT into minor_fixes

56e730b

pre-commit

52f0e0b

rlundeen2 self-assigned this Mar 26, 2026

rlundeen2 and others added 6 commits March 26, 2026 11:24

Merge branch 'main' into users/rlundeen/2026_02_26_refusal_scorer_update

72bcf36

fix: resolve merge artifacts from main merge

530c310

- Remove duplicate seed_type in harms.prompt (both sides added it independently) - Update stale REFUSAL_GPT4O docstring reference to REFUSAL_GPT4O_OBJECTIVE_ALLOW_SAFE Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'users/rlundeen/2026_02_26_refusal_scorer_update' into p…

335badf

…r-1537

chore: remove jailbreak.ipynb and labeled entries json

62e196c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rlundeen2 reviewed Mar 26, 2026

View reviewed changes

rlundeen2 and others added 4 commits March 26, 2026 16:12

revert: restore metrics JSONL files to main versions

fefdd2e

Metrics should be regenerated with evaluate_scorers.py after the new refusal scorer variants are finalized. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: resolve mypy strict errors in _try_register tags param

941d09a

Change tags parameter type from list[str] to Sequence[str] to accept list[ScorerInitializerTags] (list is invariant, Sequence is covariant). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'users/rlundeen/2026_02_26_refusal_scorer_update' into p…

da9f4ab

…r-1537

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Set default refusal scorer behavior for scenarios and initializers to STRICT, add support for underlying model name in initializers#1537

FIX: Set default refusal scorer behavior for scenarios and initializers to STRICT, add support for underlying model name in initializers#1537
fdubut wants to merge 24 commits intomicrosoft:mainfrom
fdubut:minor_fixes

fdubut commented Mar 25, 2026 •

edited by rlundeen2

Loading

Uh oh!

rlundeen2 commented Mar 26, 2026

Uh oh!

rlundeen2 Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fdubut commented Mar 25, 2026 • edited by rlundeen2 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rlundeen2 commented Mar 26, 2026

Uh oh!

rlundeen2 Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fdubut commented Mar 25, 2026 •

edited by rlundeen2

Loading