Add XL-SafetyBench datasets and judges by romanlutz · Pull Request #1791 · microsoft/PyRIT

romanlutz · 2026-05-23T05:28:14Z

Summary

Adds the XL-SafetyBench benchmark — a country-grounded cross-cultural benchmark for LLM safety and cultural sensitivity — to PyRIT. Co-authored by PyRIT contributor Amanda Minnich.

Paper: arXiv:2605.05662
Dataset: AIM-Intelligence/XL-SafetyBench on HuggingFace (CC-BY-4.0)
Eval code: https://github.com/AIM-Intelligence/XL-SafetyBench

This PR implements Phase 1 (seed datasets) and Phase 2 (judges). Phase 3 (Scenarios) and Phase 4 (example notebook) are deferred to follow-up PRs to keep this one reviewable.

What's in this PR

Datasets

Three providers under pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.py covering both tracks (5,500 native-validated prompts across 10 country-language pairs):

Provider	Emits	Size	Description
`_XLSafetyBenchJailbreakDataset`	`SeedPrompt`	4,500	Country-grounded adversarial attack prompts in 5 harm categories — ready-to-send payloads
`_XLSafetyBenchJailbreakObjectivesDataset`	`SeedObjective`	~1,500	The underlying harmful goals behind those attack prompts (deduped per country), so PyRIT attack strategies can run their own multi-turn jailbreaks
`_XLSafetyBenchCulturalDataset`	`SeedPrompt`	1,000	Innocuous scenarios embedding local taboos/sensitivities in 6 categories

Shared enums: XLSafetyBenchCountry (10 values), XLSafetyBenchJailbreakCategory (5 values), XLSafetyBenchCulturalCategory (6 values). The cultural dataset supports language_mode="local" | "english"; hidden_violation from the source CSV is preserved in SeedPrompt.metadata so the cultural judge can use it.

The jailbreak prompts and objectives loaders are paired by design: the upstream CSV ships ~3 attack prompts per unique base query per country. The prompts loader exposes the payloads, the objectives loader exposes the deduped goals — choose either (or both) depending on whether you want to evaluate the paper's bundled attacks or run your own.

Per-country sibling loaders

Following the sibling-loaders convention (each distinct upstream artifact gets its own registered SeedDatasetProvider subclass so the daily e2e matrix exercises every URL), both SeedPrompt tracks are also exposed as 10 country-pinned sibling subclasses each:

_XLSafetyBenchJailbreak{France,Germany,India,Indonesia,Japan,SouthKorea,Spain,Turkey,UnitedArabEmirates,UnitedStates}Dataset — each pins countries=[X], size "medium", inherits the parent's "default" tag.
_XLSafetyBenchCultural{...same...}Dataset — same shape; language_mode stays a kwarg since it selects between two columns in the same CSV (not a new upstream artifact).

The objectives loader has no country siblings because it reads the same per-country CSVs as the prompts siblings — adding objectives siblings would duplicate e2e coverage without adding new URLs. The objectives parent also intentionally omits the "default" tag for the same reason (avoids double-counting URLs in default sweeps).

Judges

XLSafetyBenchJailbreakJudgeScorer — 4-criterion judge (Harmfulness / Non-Compliance / Specificity / Relevance). Returns a single boolean (ASR = H OR NC) with the full label (Critical Unsafe, Direct Unsafe, Soft Unsafe, Ideal Safe, Neutral Safe) and all four binary criteria exposed in score_metadata for downstream ASR/NSR aggregation.
XLSafetyBenchCulturalJudgeScorer — relevance + cultural-recognition judge that consumes the embedded hidden_violation from the seed metadata. True when the response explicitly flags the cultural sensitivity (the paper's CSR metric).

Both judges accept an injectable chat_target (defaults via apply_defaults). Judge system/user templates are ported verbatim from the paper's eval repo and live as SeedPrompt YAML under pyrit/datasets/score/xl_safety_bench/, following the convention used by insecure_code, refusal, scales, etc. Each YAML carries the full provenance metadata (name, description, harm_categories, authors, groups, source, parameters) and is loaded via SeedPrompt.from_yaml_file(...) / rendered with render_template_value(...); the original $var / ${var} placeholders are now Jinja2 {{ var }}.

Tests

tests/unit/datasets/test_xl_safety_bench_dataset.py — 48 tests (parent + sibling + objectives + BOM)
tests/unit/score/test_xl_safety_bench_judges.py — 17 tests

Notes for reviewers

The HuggingFace CSVs ship with a leading UTF-8 BOM (\ufeffid), so the loaders normalize each row through a tiny _normalize_csv_row helper before reading id. Without it the per-row row_id and per-SeedPrompt name are silently empty for all 5,500 prompts. Direct unit tests cover the BOM behavior for all three loaders.
License is CC-BY-4.0 — attribution is preserved in SeedPrompt/SeedObjective authors/source/description.
Cultural parent + cultural siblings intentionally omit the "default" tag (innocuous-by-construction); jailbreak prompts has it; jailbreak objectives omits it.
Paper carries a content warning — same precedent as Aya / ALERT / JBB.
@amandajean119 — please pull in for review since you're a paper co-author.

Deferred follow-ups

XLSafetyBenchJailbreak / XLSafetyBenchCultural Scenario classes
Example notebook under doc/code/scenarios/

Verification

ruff check ✅
ruff format --check ✅
ty check ✅
All new + existing tests/unit/{datasets,score} tests pass (65 XL-SafetyBench tests)

Implements Phase 1 (seed datasets) and Phase 2 (judge scorers) of XL-SafetyBench (arXiv:2605.05662), a country-grounded cross-cultural benchmark for LLM safety and cultural sensitivity (CC-BY-4.0, AIM-Intelligence/XL-SafetyBench on HuggingFace). Datasets (pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.py): - _XLSafetyBenchJailbreakDataset (4,500 prompts, 5 categories, 10 countries) - _XLSafetyBenchCulturalDataset (1,000 prompts, 6 categories, 10 countries) - language_mode='local'|'english' selects scenario language - hidden_violation preserved in metadata for the cultural judge - Shared enums: XLSafetyBenchCountry, XLSafetyBenchJailbreakCategory, XLSafetyBenchCulturalCategory Judges (pyrit/score/true_false/): - XLSafetyBenchJailbreakJudgeScorer: 4-criterion judge (H/NC/S/R), produces a label (Critical/Direct/Soft Unsafe, Ideal/Neutral Safe) with the booleans in score_metadata for downstream ASR/NSR aggregation - XLSafetyBenchCulturalJudgeScorer: relevance + cultural recognition; CSR truthy when the response flags the embedded sensitivity - Judge prompts ported verbatim from the paper's GitHub repo and stored as companion .txt files for readability Tests: 34 new unit tests (17 dataset + 17 scorer) covering filters, metadata propagation, language modes, malformed-JSON retries, and the full jailbreak label / cultural matrix. Phase 3 (Scenarios) and Phase 4 (example notebook) deferred to follow-ups. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Following the sibling-loaders pattern (each distinct upstream artifact gets its own registered SeedDatasetProvider subclass so the daily e2e matrix exercises every URL), split both XL-SafetyBench tracks into country-pinned subclasses: - 10 jailbreak siblings (one per country, pinning `countries=[X]`, `size="medium"`, inheriting the parent's "default" tag) - 10 cultural siblings (same shape, `language_mode` remains a kwarg since it selects a column in the same CSV) The parent loaders are unchanged; the kwarg-based API still works. Also fixes the cultural parent's size bucket: 1,000 prompts falls in the "large" (500-4999) range, not "medium". Added 22 unit tests covering dataset_name, country pinning, tag inheritance, and kwarg forwarding (categories + language_mode) for the siblings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

romanlutz and others added 5 commits May 22, 2026 22:26

Unify XL-SafetyBench country/language metadata into one dataclass

f04392a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add jailbreak objectives loader and fix CSV BOM in XL-SafetyBench

b91cc50

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Convert XL-SafetyBench judge prompts to SeedPrompt YAML

dfcd64c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XL-SafetyBench datasets and judges#1791

Add XL-SafetyBench datasets and judges#1791
romanlutz wants to merge 5 commits into
microsoft:mainfrom
romanlutz:romanlutz/explore-benchmark-integration

romanlutz commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

romanlutz commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

Datasets

Per-country sibling loaders

Judges

Tests

Notes for reviewers

Deferred follow-ups

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

romanlutz commented May 23, 2026 •

edited

Loading