Add XL-SafetyBench datasets and judges#1791
Draft
romanlutz wants to merge 5 commits into
Draft
Conversation
Implements Phase 1 (seed datasets) and Phase 2 (judge scorers) of XL-SafetyBench (arXiv:2605.05662), a country-grounded cross-cultural benchmark for LLM safety and cultural sensitivity (CC-BY-4.0, AIM-Intelligence/XL-SafetyBench on HuggingFace). Datasets (pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.py): - _XLSafetyBenchJailbreakDataset (4,500 prompts, 5 categories, 10 countries) - _XLSafetyBenchCulturalDataset (1,000 prompts, 6 categories, 10 countries) - language_mode='local'|'english' selects scenario language - hidden_violation preserved in metadata for the cultural judge - Shared enums: XLSafetyBenchCountry, XLSafetyBenchJailbreakCategory, XLSafetyBenchCulturalCategory Judges (pyrit/score/true_false/): - XLSafetyBenchJailbreakJudgeScorer: 4-criterion judge (H/NC/S/R), produces a label (Critical/Direct/Soft Unsafe, Ideal/Neutral Safe) with the booleans in score_metadata for downstream ASR/NSR aggregation - XLSafetyBenchCulturalJudgeScorer: relevance + cultural recognition; CSR truthy when the response flags the embedded sensitivity - Judge prompts ported verbatim from the paper's GitHub repo and stored as companion .txt files for readability Tests: 34 new unit tests (17 dataset + 17 scorer) covering filters, metadata propagation, language modes, malformed-JSON retries, and the full jailbreak label / cultural matrix. Phase 3 (Scenarios) and Phase 4 (example notebook) deferred to follow-ups. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Following the sibling-loaders pattern (each distinct upstream artifact gets its own registered SeedDatasetProvider subclass so the daily e2e matrix exercises every URL), split both XL-SafetyBench tracks into country-pinned subclasses: - 10 jailbreak siblings (one per country, pinning `countries=[X]`, `size="medium"`, inheriting the parent's "default" tag) - 10 cultural siblings (same shape, `language_mode` remains a kwarg since it selects a column in the same CSV) The parent loaders are unchanged; the kwarg-based API still works. Also fixes the cultural parent's size bucket: 1,000 prompts falls in the "large" (500-4999) range, not "medium". Added 22 unit tests covering dataset_name, country pinning, tag inheritance, and kwarg forwarding (categories + language_mode) for the siblings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the XL-SafetyBench benchmark — a country-grounded cross-cultural benchmark for LLM safety and cultural sensitivity — to PyRIT. Co-authored by PyRIT contributor Amanda Minnich.
This PR implements Phase 1 (seed datasets) and Phase 2 (judges). Phase 3 (Scenarios) and Phase 4 (example notebook) are deferred to follow-up PRs to keep this one reviewable.
What's in this PR
Datasets
Three providers under
pyrit/datasets/seed_datasets/remote/xl_safety_bench_dataset.pycovering both tracks (5,500 native-validated prompts across 10 country-language pairs):_XLSafetyBenchJailbreakDatasetSeedPrompt_XLSafetyBenchJailbreakObjectivesDatasetSeedObjective_XLSafetyBenchCulturalDatasetSeedPromptShared enums:
XLSafetyBenchCountry(10 values),XLSafetyBenchJailbreakCategory(5 values),XLSafetyBenchCulturalCategory(6 values). The cultural dataset supportslanguage_mode="local" | "english";hidden_violationfrom the source CSV is preserved inSeedPrompt.metadataso the cultural judge can use it.The jailbreak prompts and objectives loaders are paired by design: the upstream CSV ships ~3 attack prompts per unique base query per country. The prompts loader exposes the payloads, the objectives loader exposes the deduped goals — choose either (or both) depending on whether you want to evaluate the paper's bundled attacks or run your own.
Per-country sibling loaders
Following the sibling-loaders convention (each distinct upstream artifact gets its own registered
SeedDatasetProvidersubclass so the daily e2e matrix exercises every URL), bothSeedPrompttracks are also exposed as 10 country-pinned sibling subclasses each:_XLSafetyBenchJailbreak{France,Germany,India,Indonesia,Japan,SouthKorea,Spain,Turkey,UnitedArabEmirates,UnitedStates}Dataset— each pinscountries=[X], size"medium", inherits the parent's"default"tag._XLSafetyBenchCultural{...same...}Dataset— same shape;language_modestays a kwarg since it selects between two columns in the same CSV (not a new upstream artifact).The objectives loader has no country siblings because it reads the same per-country CSVs as the prompts siblings — adding objectives siblings would duplicate e2e coverage without adding new URLs. The objectives parent also intentionally omits the
"default"tag for the same reason (avoids double-counting URLs in default sweeps).Judges
XLSafetyBenchJailbreakJudgeScorer— 4-criterion judge (Harmfulness / Non-Compliance / Specificity / Relevance). Returns a single boolean (ASR = H OR NC) with the full label (Critical Unsafe,Direct Unsafe,Soft Unsafe,Ideal Safe,Neutral Safe) and all four binary criteria exposed inscore_metadatafor downstream ASR/NSR aggregation.XLSafetyBenchCulturalJudgeScorer— relevance + cultural-recognition judge that consumes the embeddedhidden_violationfrom the seed metadata. True when the response explicitly flags the cultural sensitivity (the paper's CSR metric).Both judges accept an injectable
chat_target(defaults viaapply_defaults). Judge system/user templates are ported verbatim from the paper's eval repo and live asSeedPromptYAML underpyrit/datasets/score/xl_safety_bench/, following the convention used byinsecure_code,refusal,scales, etc. Each YAML carries the full provenance metadata (name,description,harm_categories,authors,groups,source,parameters) and is loaded viaSeedPrompt.from_yaml_file(...)/ rendered withrender_template_value(...); the original$var/${var}placeholders are now Jinja2{{ var }}.Tests
tests/unit/datasets/test_xl_safety_bench_dataset.py— 48 tests (parent + sibling + objectives + BOM)tests/unit/score/test_xl_safety_bench_judges.py— 17 testsNotes for reviewers
\ufeffid), so the loaders normalize each row through a tiny_normalize_csv_rowhelper before readingid. Without it the per-rowrow_idand per-SeedPromptnameare silently empty for all 5,500 prompts. Direct unit tests cover the BOM behavior for all three loaders.SeedPrompt/SeedObjectiveauthors/source/description."default"tag (innocuous-by-construction); jailbreak prompts has it; jailbreak objectives omits it.Deferred follow-ups
XLSafetyBenchJailbreak/XLSafetyBenchCulturalScenarioclassesdoc/code/scenarios/Verification
ruff check✅ruff format --check✅ty check✅tests/unit/{datasets,score}tests pass (65 XL-SafetyBench tests)