Skip to content

FEAT: Dataset Loading Changes#1451

Open
ValbuenaVC wants to merge 15 commits intoAzure:mainfrom
ValbuenaVC:datasetloader
Open

FEAT: Dataset Loading Changes#1451
ValbuenaVC wants to merge 15 commits intoAzure:mainfrom
ValbuenaVC:datasetloader

Conversation

@ValbuenaVC
Copy link
Contributor

@ValbuenaVC ValbuenaVC commented Mar 10, 2026

Description

Features:

  • Addition of filters argument to get_all_dataset_names, which rejects datasets that don't meet filter criteria. filters has type SeedDatasetFilters.
  • SeedDatasetProviders have two options for storing static metadata (dynamic metadata, like derived attributes at runtime, has been scoped out of this PR):
    • If they are remote datasets, they are stored directly as named class attributes (e.g. harm_categories), and use types like SeedDatasetFoobar.
    • If they are local datasets, they are stored in the *.prompt file as tags, and extracted from it.
  • In all cases, SeedDatasetMetadata acts as a unified schema and ground truth for logic related to parsing metadata, and we expect only a few class attributes to count as metadata.
  • SeedDatasetMetadata dataclass contains: tags: set[str]; size: SeedDatasetSize; modalities: list[SeedDatasetModality]; source: SeedDatasetSourceType; rank: SeedDatasetLoadingRank; harm_categories: list[str].
  • Datasets that don't have any metadata fields are excluded from filtering logic, with the sole exception that if the filter asks for tags = {"all"}, all filtering logic is bypassed.

Issues: reviewers, please provide your opinion on these.

  • The necessary imports to set up metadata are very verbose (import SeedDatasetMetadata, SeedDatasetSize, ...). This could be fixed by adding constructors that build out the actual metadata enum types from primitive types, but this adds a layer of indirection.
  • Who gets responsibility of filter parsing and matching? I prefer keeping it in SeedDatasetProvider since the logic of filtering is different from the filtering fields itself, but this may be more confusing.
  • Some fields like SeedDatasetLoadingRank and SeedDatasetSourceType seem like they could be excluded.
    • It's not easy for users to interact with filters or metadata. The intuitive and Pythonic angle would be to pass dictionaries or typed dictionaries, but we need strong type safety.

Potential Follow-Up PRs:

  • Populating Metadata I've started working on a SeedDatasetMetadataUtilities class that could parse the datasets in PyRIT and extract metadata from them programmatically, but this feels like it could be worth a second PR given that the parsing logic makes a lot of implicit decisions.
  • Dynamic Metadata including things like timestamps, exact size, and caching of changes made to remote datasets on local disk. Not possible with the class attribute and frozen dataclass approach we have currently.
  • SQL Passthrough from SeedDatasetProvider to CentralMemory to allow for complex operations across datasets. For example, consider a user that wants to get all text prompts with string "harm" from two datasets. Something like a SQL JOIN would be ideal in this situation.
  • Rich Encoding/Decoding for metadata filtering and storage. Not quite a DSL, but make it easier to convert filters and CentralMemory queries.

Tests and Documentation

  • Addition of test_seed_dataset_metadata.py under unit tests.
  • Addition of two integration tests under test_seed_dataset_provider_integration.py in tests.integration.datasets to account for local and remote metadata population.

invalid_categories = {
cat for cat in harm_categories if cat not in self.HARM_CATEGORIES}
if invalid_categories:
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we likely still want to load these; should we use a default harm category here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The current implementation will just crash the dataset loading if any invalid categories are detected though, so I think we have three options if we find invalid categories:

Option 1: Load them anyway, along with whatever else is in the dataset.
Option 2: Don't load them. Just pass them over and return whatever is in an accepted harm category.
Option 3: Set a default harm category, and if we find anything invalid, only return what matches the default harm category.

I feel like option 3 is the best one and the one you were describing, so I'm going to implement that unless you have any objections

@ValbuenaVC ValbuenaVC requested a review from rlundeen2 March 13, 2026 20:02
@ValbuenaVC ValbuenaVC marked this pull request as ready for review March 13, 2026 20:02
@ValbuenaVC ValbuenaVC changed the title [DRAFT] FEAT: Dataset Loading Changes FEAT: Dataset Loading Changes Mar 13, 2026
modalities: list[SeedDatasetModality] = [SeedDatasetModality.TEXT]
size: SeedDatasetSize = SeedDatasetSize.LARGE # 504 seeds
# "default" means included in curated set
tags: set[str] = {"default", "safety"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like mixing the pieces with tags. Would "default" actually be part of "ranks"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this needs a refactor.

class SeedDatasetSize(Enum):
"""Ordinal size (by bucket) of the dataset."""

TINY = "tiny" # < 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adjusting these values, some are truly gigantic. One thing to note though is that it's not necessarily correlated to number of datasets. WDYT if we write a script that can base this on time? That way we can bucket these on the time it takes to load.

So we could change this to SeedDatasetApproximateLoadTime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two concerns I think we should separate in line with my other comments: the literal size, which is more useful for dataset filtering and for setting up AtomicAttacks, etc., and the performance, which is more useful for runtime concerns.

I actually think reviving my original idea to have a SeedDatasetMetadataUtilities class that holds logic for performance and metadata population would be better than a script, but I think it would be better to have as a follow-up because there are a lot of other design choices that aren't relevant to the metadata structuring or filtering logic.

HUGE = "huge" # >= 5000


class SeedDatasetLoadingRank(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of this being the one class attribute that isn't optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes a lot of sense to me, but I also want to make sure new datasets can be added easily without users spending too much time adding metadata. Suggestion: loading_rank: SeedDatasetLoadingRank is included in the SeedDatasetProvider base class with the slowest/lowest default value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a thought that just came to me. The suggestion I just made lets us make rank non-optional while scoping the actual determination of which datasets are faster/slower for a follow-up PR.

# Metadata
harm_categories: list[str] = ["cybercrime", "illegal", "harmful", "chemical_biological", "harassment"]
modalities: list[SeedDatasetModality] = [SeedDatasetModality.TEXT]
size: SeedDatasetSize = SeedDatasetSize.LARGE # 504 seeds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mention this in another comment, but harmbench actually loads super fast, which makes me thing we may want a better measure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the role I imagined for SeedDatasetLoadingRank, since that's more of a performance measurement than a literal description of the dataset. I'll change it a bit to make that more obvious. What do you think of using SeedDatasetLoadingRank?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants