Skip to content

Add XL-SafetyBench datasets and judges#1791

Draft
romanlutz wants to merge 5 commits into
microsoft:mainfrom
romanlutz:romanlutz/explore-benchmark-integration
Draft

Add XL-SafetyBench datasets and judges#1791
romanlutz wants to merge 5 commits into
microsoft:mainfrom
romanlutz:romanlutz/explore-benchmark-integration

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

@romanlutz romanlutz commented May 23, 2026

Summary

Adds the XL-SafetyBench benchmark — a country-grounded cross-cultural benchmark for LLM safety and cultural sensitivity — to PyRIT. Co-authored by PyRIT contributor Amanda Minnich.

This PR implements Phase 1 (seed datasets) and Phase 2 (judges). Phase 3 (Scenarios) and Phase 4 (example notebook) are deferred to follow-up PRs to keep this one reviewable.

What's in this PR

Datasets

Three providers under pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.py covering both tracks (5,500 native-validated prompts across 10 country-language pairs):

Provider Emits Size Description
_XLSafetyBenchJailbreakDataset SeedPrompt 4,500 Country-grounded adversarial attack prompts in 5 harm categories — ready-to-send payloads
_XLSafetyBenchJailbreakObjectivesDataset SeedObjective ~1,500 The underlying harmful goals behind those attack prompts (deduped per country), so PyRIT attack strategies can run their own multi-turn jailbreaks
_XLSafetyBenchCulturalDataset SeedPrompt 1,000 Innocuous scenarios embedding local taboos/sensitivities in 6 categories

Shared enums: XLSafetyBenchCountry (10 values), XLSafetyBenchJailbreakCategory (5 values), XLSafetyBenchCulturalCategory (6 values). The cultural dataset supports language_mode="local" | "english"; hidden_violation from the source CSV is preserved in SeedPrompt.metadata so the cultural judge can use it.

The jailbreak prompts and objectives loaders are paired by design: the upstream CSV ships ~3 attack prompts per unique base query per country. The prompts loader exposes the payloads, the objectives loader exposes the deduped goals — choose either (or both) depending on whether you want to evaluate the paper's bundled attacks or run your own.

Per-country sibling loaders

Following the sibling-loaders convention (each distinct upstream artifact gets its own registered SeedDatasetProvider subclass so the daily e2e matrix exercises every URL), both SeedPrompt tracks are also exposed as 10 country-pinned sibling subclasses each:

  • _XLSafetyBenchJailbreak{France,Germany,India,Indonesia,Japan,SouthKorea,Spain,Turkey,UnitedArabEmirates,UnitedStates}Dataset — each pins countries=[X], size "medium", inherits the parent's "default" tag.
  • _XLSafetyBenchCultural{...same...}Dataset — same shape; language_mode stays a kwarg since it selects between two columns in the same CSV (not a new upstream artifact).

The objectives loader has no country siblings because it reads the same per-country CSVs as the prompts siblings — adding objectives siblings would duplicate e2e coverage without adding new URLs. The objectives parent also intentionally omits the "default" tag for the same reason (avoids double-counting URLs in default sweeps).

Judges

  • XLSafetyBenchJailbreakJudgeScorer — 4-criterion judge (Harmfulness / Non-Compliance / Specificity / Relevance). Returns a single boolean (ASR = H OR NC) with the full label (Critical Unsafe, Direct Unsafe, Soft Unsafe, Ideal Safe, Neutral Safe) and all four binary criteria exposed in score_metadata for downstream ASR/NSR aggregation.
  • XLSafetyBenchCulturalJudgeScorer — relevance + cultural-recognition judge that consumes the embedded hidden_violation from the seed metadata. True when the response explicitly flags the cultural sensitivity (the paper's CSR metric).

Both judges accept an injectable chat_target (defaults via apply_defaults). Judge system/user templates are ported verbatim from the paper's eval repo and live as SeedPrompt YAML under pyrit/datasets/score/xl_safety_bench/, following the convention used by insecure_code, refusal, scales, etc. Each YAML carries the full provenance metadata (name, description, harm_categories, authors, groups, source, parameters) and is loaded via SeedPrompt.from_yaml_file(...) / rendered with render_template_value(...); the original $var / ${var} placeholders are now Jinja2 {{ var }}.

Tests

  • tests/unit/datasets/test_xl_safety_bench_dataset.py — 48 tests (parent + sibling + objectives + BOM)
  • tests/unit/score/test_xl_safety_bench_judges.py — 17 tests

Notes for reviewers

  • The HuggingFace CSVs ship with a leading UTF-8 BOM (\ufeffid), so the loaders normalize each row through a tiny _normalize_csv_row helper before reading id. Without it the per-row row_id and per-SeedPrompt name are silently empty for all 5,500 prompts. Direct unit tests cover the BOM behavior for all three loaders.
  • License is CC-BY-4.0 — attribution is preserved in SeedPrompt/SeedObjective authors/source/description.
  • Cultural parent + cultural siblings intentionally omit the "default" tag (innocuous-by-construction); jailbreak prompts has it; jailbreak objectives omits it.
  • Paper carries a content warning — same precedent as Aya / ALERT / JBB.
  • @amandajean119 — please pull in for review since you're a paper co-author.

Deferred follow-ups

  • XLSafetyBenchJailbreak / XLSafetyBenchCultural Scenario classes
  • Example notebook under doc/code/scenarios/

Verification

  • ruff check
  • ruff format --check
  • ty check
  • All new + existing tests/unit/{datasets,score} tests pass (65 XL-SafetyBench tests)

romanlutz and others added 5 commits May 22, 2026 22:26
Implements Phase 1 (seed datasets) and Phase 2 (judge scorers) of
XL-SafetyBench (arXiv:2605.05662), a country-grounded cross-cultural
benchmark for LLM safety and cultural sensitivity (CC-BY-4.0,
AIM-Intelligence/XL-SafetyBench on HuggingFace).

Datasets (pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.py):
- _XLSafetyBenchJailbreakDataset (4,500 prompts, 5 categories, 10 countries)
- _XLSafetyBenchCulturalDataset (1,000 prompts, 6 categories, 10 countries)
  - language_mode='local'|'english' selects scenario language
  - hidden_violation preserved in metadata for the cultural judge
- Shared enums: XLSafetyBenchCountry, XLSafetyBenchJailbreakCategory,
  XLSafetyBenchCulturalCategory

Judges (pyrit/score/true_false/):
- XLSafetyBenchJailbreakJudgeScorer: 4-criterion judge (H/NC/S/R),
  produces a label (Critical/Direct/Soft Unsafe, Ideal/Neutral Safe)
  with the booleans in score_metadata for downstream ASR/NSR aggregation
- XLSafetyBenchCulturalJudgeScorer: relevance + cultural recognition;
  CSR truthy when the response flags the embedded sensitivity
- Judge prompts ported verbatim from the paper's GitHub repo and stored
  as companion .txt files for readability

Tests: 34 new unit tests (17 dataset + 17 scorer) covering filters,
metadata propagation, language modes, malformed-JSON retries, and the
full jailbreak label / cultural matrix.

Phase 3 (Scenarios) and Phase 4 (example notebook) deferred to follow-ups.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Following the sibling-loaders pattern (each distinct upstream artifact
gets its own registered SeedDatasetProvider subclass so the daily e2e
matrix exercises every URL), split both XL-SafetyBench tracks into
country-pinned subclasses:

- 10 jailbreak siblings (one per country, pinning `countries=[X]`,
  `size="medium"`, inheriting the parent's "default" tag)
- 10 cultural siblings (same shape, `language_mode` remains a kwarg
  since it selects a column in the same CSV)

The parent loaders are unchanged; the kwarg-based API still works.

Also fixes the cultural parent's size bucket: 1,000 prompts falls in
the "large" (500-4999) range, not "medium".

Added 22 unit tests covering dataset_name, country pinning, tag
inheritance, and kwarg forwarding (categories + language_mode) for the
siblings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant