Skip to content

FIX: Set default refusal scorer behavior for scenarios and initializers to STRICT, add support for underlying model name in initializers#1537

Open
fdubut wants to merge 24 commits intomicrosoft:mainfrom
fdubut:minor_fixes
Open

FIX: Set default refusal scorer behavior for scenarios and initializers to STRICT, add support for underlying model name in initializers#1537
fdubut wants to merge 24 commits intomicrosoft:mainfrom
fdubut:minor_fixes

Conversation

@fdubut
Copy link
Copy Markdown
Contributor

@fdubut fdubut commented Mar 25, 2026

Description

  • Refusal scorer variants: Replace single refusal scorer with 4 variants (OBJECTIVE_BLOCK_SAFE, OBJECTIVE_ALLOW_SAFE,
    NO_OBJECTIVE_BLOCK_SAFE, NO_OBJECTIVE_ALLOW_SAFE) and dynamically select the best one by F1 from evaluation metrics
  • Auto-detected scorer dependencies: Add find_dependents_of_tag() to BaseInstanceRegistry — wrapper/composite scorers that use a refusal
    scorer are automatically discovered via eval_hash matching, no explicit depends_on needed
  • Phased scorer initialization: Refactor ScorerInitializer.initialize_async into 5 phases so dependent scorers (e.g.
    TrueFalseInverterScorer) use the best refusal path
  • Tag-based evaluation filtering: Add --tags CLI arg to evaluate_scorers.py to run evals for specific scorer groups (e.g. --tags
    refusal)
  • Underlying model support in initializers: Add OPENAI_CHAT_UNDERLYING_MODEL and AZURE_OPENAI_GPT4O_UNSAFE_CHAT_UNDERLYING_MODEL{,2} env
    vars with fallback to model name (credit: @fdubut)

Co-authored with Rich, merged into @fdubut's branch to take their changes also

@rlundeen2
Copy link
Copy Markdown
Contributor

I have another PR that is related; let's hold off on merging this one for a sec, I'll merge mine into this branch and consolidate

@rlundeen2 rlundeen2 self-assigned this Mar 26, 2026
rlundeen2 and others added 6 commits March 26, 2026 11:24
- Remove duplicate seed_type in harms.prompt (both sides added it independently)
- Update stale REFUSAL_GPT4O docstring reference to REFUSAL_GPT4O_OBJECTIVE_ALLOW_SAFE

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion

- Add _collect_child_eval_hashes() to ComponentIdentifier for recursive
  child eval_hash collection
- Add find_dependents_of_tag() to BaseInstanceRegistry for auto-detecting
  wrapper/composite scorer dependencies via eval_hash matching
- Add 4 refusal scorer variants with REFUSAL tag in ScorerInitializer
- Add _register_best_refusal_f1() to tag the best refusal scorer by F1
  from existing metrics (parallels _register_best_objective_f1)
- Refactor initialize_async into 5 phases: base refusal, best refusal
  selection, dependent scorers, other scorers, best objective selection
- Add --tags CLI filtering to evaluate_scorers.py via argparse
- Add comprehensive unit tests for all new functionality

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STRICT was removed in favor of named variants (OBJECTIVE_BLOCK_SAFE,
OBJECTIVE_ALLOW_SAFE, etc.). Use the default OBJECTIVE_ALLOW_SAFE path
which aligns with the dynamic best-refusal selection system.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Use harm_definition from CSV headers if available (e.g., "fairness_bias.yaml"),
# otherwise fall back to deriving from harm_category (e.g., "bias" -> "bias.yaml").
# The CSV header is authoritative since the harm_category name may differ from
# the YAML filename (e.g., harm_category="bias" but file is "fairness_bias.yaml").
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was moved, need to check if this fix still matters

rlundeen2 and others added 4 commits March 26, 2026 16:12
Metrics should be regenerated with evaluate_scorers.py after
the new refusal scorer variants are finalized.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document evaluate_scorers.py usage with --tags filtering and the
recommended two-step workflow: evaluate refusal scorers first,
then re-run all scorers so dependents use the best refusal variant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change tags parameter type from list[str] to Sequence[str] to accept
list[ScorerInitializerTags] (list is invariant, Sequence is covariant).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants