Skip to content

Convert hybrid HDF5+Parquet data into wds#57

Draft
thompsonmj wants to merge 10 commits into
mainfrom
hybrid2wds
Draft

Convert hybrid HDF5+Parquet data into wds#57
thompsonmj wants to merge 10 commits into
mainfrom
hybrid2wds

Conversation

@thompsonmj

Copy link
Copy Markdown
Contributor

Provides scripts needed to generate wds from the hybrid data format that relies on HDF5 for random access to images and Parquet for taxonomic and other metadata.

Also addresses #26 with usage instructions to accomplish the above.

thompsonmj and others added 10 commits June 30, 2026 17:50
Reads WebP images from the HDF5 store joined with resolved-taxonomy and
uuid->h5 lookup parquet, writing .tar shards whose samples carry the ten
text-prompt sidecars used for BioCLIP-style training (scientific, taxonomic,
and common-name variants). When no vernacular common name exists, the common
name falls back to the scientific name without author citation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…during the build

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant