feat: Add Hugging Face Datasets plugin by andreahlert · Pull Request #629 · flyteorg/flyte-sdk

andreahlert · 2026-02-07T19:29:54Z

Summary

Port of the flytekit-huggingface plugin to flyte-sdk v2, enabling native support for datasets.Dataset as a Flyte DataFrame type.

DataFrameEncoder/Decoder for datasets.Dataset with Parquet serialization
Cloud storage support (S3, GCS, Azure) via fsspec-compatible storage options
Anonymous S3 fallback for public datasets (mirrors Polars plugin pattern)
Column filtering on both encode and decode via type annotations
Auto-registration via flyte.plugins.types entry point

Follows the same patterns as the existing Polars plugin.

Usage Example

import flyte
import datasets

env = flyte.TaskEnvironment("hf-example")

@env.task
async def create_dataset() -> datasets.Dataset:
    return datasets.Dataset.from_dict({
        "text": ["hello", "world"],
        "label": [0, 1],
    })

@env.task
async def process(ds: datasets.Dataset) -> int:
    return len(ds)

Test plan

Type recognition tests (Dataset, with columns, with format annotations)
Handler registration and property tests
Encode/decode roundtrip tests
DataFrame wrapper integration test
Raw task I/O test via flyte.TaskEnvironment
Column subsetting on decode
Various data types (int, float, str, bool)
Empty dataset roundtrip
Storage options tests (S3, GCS, Azure, anonymous, unknown protocol)

andreahlert · 2026-02-07T20:05:53Z

@pingsutw @cosmicBboy @kumare3 could you take a look? This ports the huggingface plugin from flytekit to v2, same approach as the polars plugin.

andreahlert · 2026-03-06T09:57:26Z

Hi Ketan! With 2.0 out I’ve rebased and addressed several of your comments: using the public storage API instead of get_storage, and public flyte.io imports where possible.

andreahlert · 2026-03-11T19:54:25Z

Addressed all comments. Switched both encode and decode to use pq.write_table/pq.read_table with the Flyte filesystem directly. This avoids HuggingFace's DownloadManager cache entirely and keeps credentials handled by Flyte's storage layer.

As a side effect, _get_storage_options is gone and the empty dataset test is no longer skipped since that was a from_parquet limitation. Also dropped column filtering from encode to match the Polars plugin pattern.

kumare3 · 2026-03-31T03:26:30Z

Can you add an example?

I am thinking either hf:// file system do we need it
Cc @samhita-alla

Add a new plugin that provides native support for HuggingFace datasets.Dataset as a Flyte DataFrame type, enabling seamless serialization/deserialization through Parquet format. Features: - DataFrameEncoder/Decoder for datasets.Dataset <-> Parquet - Cloud storage support (S3, GCS, Azure) via fsspec storage options - Anonymous S3 fallback for public datasets - Column filtering on both encode and decode - Auto-registration via flyte.plugins.types entry point Signed-off-by: André Ahlert <andre@aex.partners>

Signed-off-by: André Ahlert <andre@aex.partners>

… infra - Use storage.get_configured_fsspec_kwargs() instead of get_storage() (fix review) - Add [tool.uv.sources] flyte editable for dev (match Anthropic/OpenAI) - Conftest: use LocalDB._get_db_path and reset _conn (match Polars after main) - Tests: patch flyte.storage._storage.get_storage; run.outputs()[0]; skip empty dataset to avoid CI flakiness Signed-off-by: André Ahlert <andre@aex.partners>

Signed-off-by: André Ahlert <andre@aex.partners>

…, DataFrame Signed-off-by: André Ahlert <andre@aex.partners>

Renamed get_hf_storage_options to _get_storage_options (no public API exposure). Removed column filtering from encode to match the Polars plugin pattern. Removed misleading backwards-compatibility comment on module-level registration. Synced conftest cache isolation to use LocalDB._get_db_path. Signed-off-by: André Ahlert <andre@aex.partners>

Replace datasets.Dataset.from_parquet with pq.read_table + datasets.Dataset(table). from_parquet routes through HuggingFace's DownloadManager which caches files locally under ~/.cache/huggingface/datasets/ before reading. For Flyte-managed storage this is wasteful (double I/O) and bypasses the fsspec filesystem we already have configured. Using pq.read_table with the Flyte filesystem reads directly from storage with no intermediate cache, removes the NoCredentialsError anonymous fallback, and avoids relying on storage_options flowing through **kwargs in from_parquet. Signed-off-by: André Ahlert <andre@aex.partners>

…lesystem Both encode and decode now use pq.write_table/pq.read_table with the Flyte filesystem directly, removing the asymmetry where encode went through HuggingFace's to_parquet (fsspec.open) and decode used pyarrow. Removes _get_storage_options entirely along with its 8 unit tests. Enables the empty dataset test that was previously skipped due to from_parquet not handling empty parquet files. Signed-off-by: André Ahlert <andre@aex.partners>

Signed-off-by: André Ahlert <andre@aex.partners>

flyte-sdk 2.0 is stable, no need for pre-release flag. Signed-off-by: Andre Ahlert <andreahlert@users.noreply.github.com> Signed-off-by: André Ahlert <andre@aex.partners>

Signed-off-by: Andre Ahlert <andreahlert@users.noreply.github.com> Signed-off-by: André Ahlert <andre@aex.partners>

lazy_module was added to flyte.extend in main, resolving the review comment about avoiding _ imports. Signed-off-by: Andre Ahlert <andreahlert@users.noreply.github.com> Signed-off-by: André Ahlert <andre@aex.partners>

andreahlert · 2026-03-31T04:27:48Z

Can you add an example?

I am thinking either hf:// file system do we need it Cc @samhita-alla

I think we do, at least for the read side. HF Hub auto-converts all datasets to Parquet now, so hf:// would let Flyte track what was actually consumed instead of users doing opaque load_dataset calls. The fsspec backend already exists.

Immutability is a concern since Hub data can change, but for pinned dataset revisions (hf://datasets/squad@v1.0) it works.

Write side is trickier and probably not needed for a first pass.

Open to what is best in your opinion.

samhita-alla · 2026-04-06T06:15:50Z

hey @andreahlert, thank you for creating this PR! sharing my thoughts on how we can improve the plugin from a UX perspective:

the two primary use cases i see are:

loading a dataset from HF Hub: to use in training or processing tasks
creating a dataset in a task: to pass between tasks for training and optionally publish to HF Hub

for loading, i think this should follow the same pattern as our HF model prefetch utility i.e. stream from the hub rather than fully materializing the dataset into memory. the output should be a reference (e.g. Dir) stored in cloud storage, which downstream tasks can then stream from.

for creating, when a task outputs a datasets.Dataset, we should:

stream-write it to remote storage
optionally support pushing to HF Hub

i don’t think a type transformer alone will be useful here.

i know the PR mirrors what’s in the v1 plugin, but i think this approach would make it genuinely useful for users. not sure if you have the bandwidth to work on this. if not, totally understandable. we can leave the PR open and i can take a pass at it later!

andreahlert · 2026-04-06T10:35:22Z

hey @andreahlert, thank you for creating this PR! sharing my thoughts on how we can improve the plugin from a UX perspective:

the two primary use cases i see are:

loading a dataset from HF Hub: to use in training or processing tasks

creating a dataset in a task: to pass between tasks for training and optionally publish to HF Hub

for loading, i think this should follow the same pattern as our HF model prefetch utility i.e. stream from the hub rather than fully materializing the dataset into memory. the output should be a reference (e.g. Dir) stored in cloud storage, which downstream tasks can then stream from.

for creating, when a task outputs a datasets.Dataset, we should:

stream-write it to remote storage

optionally support pushing to HF Hub

i don’t think a type transformer alone will be useful here.

i know the PR mirrors what’s in the v1 plugin, but i think this approach would make it genuinely useful for users. not sure if you have the bandwidth to work on this. if not, totally understandable. we can leave the PR open and i can take a pass at it later!

Thanks for reviewing this. I still have sometime to work on this. Will address later on this week.

…xample Add hf_dataset() prefetch function that streams parquet files from HuggingFace Hub directly to Flyte remote storage, following the same pattern as flyte.prefetch.hf_model(). Includes fallback to snapshot download if streaming fails. Switch encoder to batched parquet write via pq.ParquetWriter to avoid materializing the entire table in memory during write. Add workflow example showing prefetch + dataset processing between tasks. Signed-off-by: André Ahlert <andre@aex.partners>

- ParquetWriter now wrapped in try/finally to prevent resource leak on error - Fix temp dir leak in _download_dataset_to_local (flat_dir now managed via TemporaryDirectory context manager) - Remove unused commit SHA fetch from _stream_dataset_to_remote - Add input validation for name and split params in hf_dataset() - Narrow except clause from bare Exception to (OSError, FileNotFoundError) - Rename store_hf_dataset_task to _store_hf_dataset_task (private convention) - Add asyncio_mode = auto to pytest config Signed-off-by: André Ahlert <andre@aex.partners>

- Fix multi-split filename collision: when split=None, parquet files from different splits now go into separate subdirectories (train/, test/, etc.) instead of all writing to the root and overwriting each other - Rework example to use Dir.walk() + pq.read_table inside a task instead of calling data_dir.path directly, demonstrating the proper prefetch-to-task flow - Add HF_HUB_ENABLE_HF_TRANSFER=1 env var to prefetch image (was installing hf-transfer but never activating it) - Add HF_TOKEN warning when token is not set - Rename _store_hf_dataset_task back to store_hf_dataset_task to match the hf_model prefetch convention - Expand test coverage: streaming single/multi split, download fallback, store_hf_dataset_task integration - Update README to document both prefetch and type transformer Signed-off-by: André Ahlert <andre@aex.partners>

…multi-split collision - Remove revision parameter from all signatures and HuggingFaceDatasetInfo. The plugin always uses refs/convert/parquet (HF's auto-converted branch), so the param was silently ignored. - Fix _download_dataset_to_local to preserve relative directory structure instead of flattening, preventing filename collision when split=None downloads train/0000.parquet and test/0000.parquet. - Add test_download_dataset_multi_split_no_collision to verify the fix. Signed-off-by: André Ahlert <andre@aex.partners>

andreahlert · 2026-04-09T18:08:42Z

Hey everybody. Updated based on @samhita-alla feedback.

Took a step back and rethought the approach. The plugin now has a hf_dataset() prefetch that streams parquet from HF Hub to Flyte storage (same pattern as flyte.prefetch.hf_model), batched writes in the encoder, and an example showing the full flow.

Not sure if I got the prefetch design exactly right so would appreciate your take on it.

Hub push is left out for now, figured that's better as a follow-up. WDYT?

samhita-alla · 2026-04-21T09:57:49Z

closing this in favor of #992

andreahlert force-pushed the feat/huggingface-plugin branch from be91055 to 17d010a Compare February 7, 2026 19:37

kumare3 reviewed Feb 8, 2026

View reviewed changes