FEAT: Dataset Loading Changes by ValbuenaVC · Pull Request #1451 · Azure/PyRIT

ValbuenaVC · 2026-03-10T23:49:00Z

Description

Features:

Addition of filters argument to get_all_dataset_names, which rejects datasets that don't meet filter criteria. filters has type SeedDatasetFilters.
SeedDatasetProviders have two options for storing static metadata (dynamic metadata, like derived attributes at runtime, has been scoped out of this PR):
- If they are remote datasets, they are stored directly as named class attributes (e.g. harm_categories), and use types like SeedDatasetFoobar.
- If they are local datasets, they are stored in the *.prompt file as tags, and extracted from it.
In all cases, SeedDatasetMetadata acts as a unified schema and ground truth for logic related to parsing metadata, and we expect only a few class attributes to count as metadata.
SeedDatasetMetadata dataclass contains: tags: set[str]; size: SeedDatasetSize; modalities: list[SeedDatasetModality]; source: SeedDatasetSourceType; rank: SeedDatasetLoadingRank; harm_categories: list[str].
Datasets that don't have any metadata fields are excluded from filtering logic, with the sole exception that if the filter asks for tags = {"all"}, all filtering logic is bypassed.

Issues: reviewers, please provide your opinion on these.

The necessary imports to set up metadata are very verbose (import SeedDatasetMetadata, SeedDatasetSize, ...). This could be fixed by adding constructors that build out the actual metadata enum types from primitive types, but this adds a layer of indirection.
Who gets responsibility of filter parsing and matching? I prefer keeping it in SeedDatasetProvider since the logic of filtering is different from the filtering fields itself, but this may be more confusing.
Some fields like SeedDatasetLoadingRank and SeedDatasetSourceType seem like they could be excluded.
- It's not easy for users to interact with filters or metadata. The intuitive and Pythonic angle would be to pass dictionaries or typed dictionaries, but we need strong type safety.

Potential Follow-Up PRs:

Populating Metadata I've started working on a SeedDatasetMetadataUtilities class that could parse the datasets in PyRIT and extract metadata from them programmatically, but this feels like it could be worth a second PR given that the parsing logic makes a lot of implicit decisions.
Dynamic Metadata including things like timestamps, exact size, and caching of changes made to remote datasets on local disk. Not possible with the class attribute and frozen dataclass approach we have currently.
SQL Passthrough from SeedDatasetProvider to CentralMemory to allow for complex operations across datasets. For example, consider a user that wants to get all text prompts with string "harm" from two datasets. Something like a SQL JOIN would be ideal in this situation.
Rich Encoding/Decoding for metadata filtering and storage. Not quite a DSL, but make it easier to convert filters and CentralMemory queries.

Tests and Documentation

Addition of test_seed_dataset_metadata.py under unit tests.
Addition of two integration tests under test_seed_dataset_provider_integration.py in tests.integration.datasets to account for local and remote metadata population.

rlundeen2 · 2026-03-11T19:38:52Z

pyrit/datasets/seed_datasets/remote/aegis_ai_content_safety_dataset.py

+            invalid_categories = {
+                cat for cat in harm_categories if cat not in self.HARM_CATEGORIES}
            if invalid_categories:
                raise ValueError(


I think we likely still want to load these; should we use a default harm category here?

I agree. The current implementation will just crash the dataset loading if any invalid categories are detected though, so I think we have three options if we find invalid categories:

Option 1: Load them anyway, along with whatever else is in the dataset.
Option 2: Don't load them. Just pass them over and return whatever is in an accepted harm category.
Option 3: Set a default harm category, and if we find anything invalid, only return what matches the default harm category.

I feel like option 3 is the best one and the one you were describing, so I'm going to implement that unless you have any objections

pyrit/datasets/seed_datasets/seed_dataset_provider.py

pyrit/datasets/seed_datasets/seed_metadata.py

rlundeen2 · 2026-03-13T21:24:35Z

pyrit/datasets/seed_datasets/remote/harmbench_dataset.py

+    modalities: list[SeedDatasetModality] = [SeedDatasetModality.TEXT]
+    size: SeedDatasetSize = SeedDatasetSize.LARGE  # 504 seeds
+    # "default" means included in curated set
+    tags: set[str] = {"default", "safety"}


I don't like mixing the pieces with tags. Would "default" actually be part of "ranks"?

Agreed, this needs a refactor.

rlundeen2 · 2026-03-13T21:26:27Z

pyrit/datasets/seed_datasets/seed_metadata.py

+class SeedDatasetSize(Enum):
+    """Ordinal size (by bucket) of the dataset."""
+
+    TINY = "tiny"  # < 10


I recommend adjusting these values, some are truly gigantic. One thing to note though is that it's not necessarily correlated to number of datasets. WDYT if we write a script that can base this on time? That way we can bucket these on the time it takes to load.

So we could change this to SeedDatasetApproximateLoadTime

There are two concerns I think we should separate in line with my other comments: the literal size, which is more useful for dataset filtering and for setting up AtomicAttacks, etc., and the performance, which is more useful for runtime concerns.

I actually think reviving my original idea to have a SeedDatasetMetadataUtilities class that holds logic for performance and metadata population would be better than a script, but I think it would be better to have as a follow-up because there are a lot of other design choices that aren't relevant to the metadata structuring or filtering logic.

rlundeen2 · 2026-03-13T21:27:49Z

pyrit/datasets/seed_datasets/seed_metadata.py

+    HUGE = "huge"  # >= 5000
+
+
+class SeedDatasetLoadingRank(Enum):


What do you think of this being the one class attribute that isn't optional?

I think it makes a lot of sense to me, but I also want to make sure new datasets can be added easily without users spending too much time adding metadata. Suggestion: loading_rank: SeedDatasetLoadingRank is included in the SeedDatasetProvider base class with the slowest/lowest default value?

Also a thought that just came to me. The suggestion I just made lets us make rank non-optional while scoping the actual determination of which datasets are faster/slower for a follow-up PR.

rlundeen2 · 2026-03-13T21:28:41Z

pyrit/datasets/seed_datasets/remote/harmbench_dataset.py

+    # Metadata
+    harm_categories: list[str] = ["cybercrime", "illegal", "harmful", "chemical_biological", "harassment"]
+    modalities: list[SeedDatasetModality] = [SeedDatasetModality.TEXT]
+    size: SeedDatasetSize = SeedDatasetSize.LARGE  # 504 seeds


I mention this in another comment, but harmbench actually loads super fast, which makes me thing we may want a better measure

This is the role I imagined for SeedDatasetLoadingRank, since that's more of a performance measurement than a literal description of the dataset. I'll change it a bit to make that more obvious. What do you think of using SeedDatasetLoadingRank?

scaffolding

857f596

rlundeen2 reviewed Mar 11, 2026

View reviewed changes

ValbuenaVC and others added 2 commits March 11, 2026 12:50

Merge branch 'main' into datasetloader

1aaa978

more scaffolding

15b58e8

rlundeen2 reviewed Mar 11, 2026

View reviewed changes

pyrit/datasets/seed_datasets/seed_dataset_provider.py Outdated Show resolved Hide resolved

ValbuenaVC and others added 4 commits March 11, 2026 17:31

Merge branch 'main' into datasetloader

5209a5a

.

fc43c8c

Merge branch 'main' into datasetloader

f4296f0

data types

9f357e6

ValbuenaVC commented Mar 12, 2026

View reviewed changes

pyrit/datasets/seed_datasets/seed_metadata.py Outdated Show resolved Hide resolved

Victor Valbuena and others added 3 commits March 13, 2026 19:30

redesign

34f8953

review

8dcbd5f

Merge branch 'main' into datasetloader

04d298e

ValbuenaVC requested a review from rlundeen2 March 13, 2026 20:02

ValbuenaVC marked this pull request as ready for review March 13, 2026 20:02

ValbuenaVC changed the title ~~[DRAFT] FEAT: Dataset Loading Changes~~ FEAT: Dataset Loading Changes Mar 13, 2026

Victor Valbuena and others added 3 commits March 13, 2026 20:34

tests

32b6752

precommit

c94a6da

Merge branch 'main' into datasetloader

5fe1992

rlundeen2 reviewed Mar 13, 2026

View reviewed changes

ValbuenaVC and others added 2 commits March 13, 2026 16:51

Merge branch 'main' into datasetloader

93e9f81

utilities scaffolding

2e7e937

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Dataset Loading Changes#1451

FEAT: Dataset Loading Changes#1451
ValbuenaVC wants to merge 15 commits intoAzure:mainfrom
ValbuenaVC:datasetloader

ValbuenaVC commented Mar 10, 2026 •

edited

Loading

Uh oh!

rlundeen2 Mar 11, 2026

Uh oh!

ValbuenaVC Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

rlundeen2 Mar 13, 2026

Uh oh!

ValbuenaVC Mar 13, 2026

Uh oh!

rlundeen2 Mar 13, 2026

Uh oh!

ValbuenaVC Mar 13, 2026

Uh oh!

rlundeen2 Mar 13, 2026

Uh oh!

ValbuenaVC Mar 13, 2026

Uh oh!

ValbuenaVC Mar 13, 2026

Uh oh!

rlundeen2 Mar 13, 2026

Uh oh!

ValbuenaVC Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ValbuenaVC commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests and Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ValbuenaVC commented Mar 10, 2026 •

edited

Loading