Skip to content

HF Buckets integration to Trainer#46386

Open
SunMarc wants to merge 10 commits into
mainfrom
buckets-trainer
Open

HF Buckets integration to Trainer#46386
SunMarc wants to merge 10 commits into
mainfrom
buckets-trainer

Conversation

@SunMarc
Copy link
Copy Markdown
Member

@SunMarc SunMarc commented Jun 3, 2026

What does this PR do?

This PR adds support for HF Buckets in Trainer. Instead of storing checkpoints locally or on the hub repository, we can leverage HF Buckets for that (S3 like storage + XET dedup).

With buckets (push_to_buckets=True), users don't need to save the checkpoints on the hub anymore like before (hub_strategy="checkpoint" or "all_checkpoints") as we push all checkpoints to the bucket. Still, pushing to the hub is still useful (hub_strategy="every_save") as it will upload a version of a model that can we load with transformers.

Features:

  • Saving checkpoints to HF Buckets
  • Resuming training from HF Buckets

Right now, everything should be pretty much compatible like before since we are just synching a local dir to the HF Buckets async but in the future, it would be a lot better if we are able load / save directly from / to HF Buckets without going through the disk, might be useful for deepspeed and fsdp.

Usage

from transformers import Trainer, TrainingArguments

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="out",
        save_strategy="steps",
        push_to_bucket=True,
        bucket_id="bucket_id",
    ),
    train_dataset=train_dataset,
)
trainer.train()

# resume on a fresh machine — pulls the latest checkpoint from the bucket
trainer.train(resume_from_checkpoint="hf://buckets/my-org/my-run")

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nice! 🪣

Comment thread docs/source/en/trainer_recipes.md Outdated
Comment thread docs/source/en/trainer_recipes.md Outdated
Comment thread docs/source/en/trainer_recipes.md Outdated
Comment thread docs/source/en/trainer_recipes.md Outdated
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
@SunMarc SunMarc requested a review from qgallouedec June 4, 2026 12:26
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! was wondering if mounting the bucket and just writing to it would not make more sense?

  • good defaults?! since anyways

Comment thread docs/source/en/trainer_recipes.md Outdated
model=model,
args=TrainingArguments(
push_to_bucket=True,
bucket_id="my-org/my-run",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could have a good default for this IMO, the least amount of args the better!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't precise but it will default to hub_model_id if it is set or default to output_dir otherwise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is actually optional

self.push_in_progress = None # Tracks the in-flight repo push
if self.args.push_to_hub:
self.init_hf_repo()
if self.args.push_to_bucket and self.is_world_process_zero():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could have push_in_progress = True default to push to a bucket?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a private attribute not related to push to a bucket, more an arg to track that we are still pushing a checkpoint so we shouldn't interrupt it. If we want to force pushing to a bucket, we would have to set push_to_bucket = True as the default

Copy link
Copy Markdown
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a few question and edge cases

Comment thread src/transformers/training_args.py Outdated
revision=self.args.hub_revision,

# Full checkpoint -> repo (unchanged behavior, gated by hub_strategy).
if self.args.hub_strategy in [HubStrategy.CHECKPOINT, HubStrategy.ALL_CHECKPOINTS]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the modeling_files in this case?

Copy link
Copy Markdown
Member Author

@SunMarc SunMarc Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three things happening sequentially:

  1. if push to hub: we update the output_dir to have the most recent model + tokenizer and so on (this is the modeling_files) and we upload that to the hub

  2. if push to hub +`hub_strategy="checkpoint" or "all_checkpoints", we either update last_checkpoint folder with the new checkpoint or upload the new checkpoint file

  3. if push_to_bucket, we sync the output_dir folder with the bucket

One thing that I didn't add but we should maybe is that right now we don't update the output_dir with modeling_files when pushing to bucket and this is actually something we should maybe to do have parity with push to hub and not create confusion.

"""Push model files to the repo and/or sync the checkpoint to the bucket, from a checkpoint folder."""
if not self.is_world_process_zero() or self.args.hub_strategy == HubStrategy.END:
return
# If we haven't finished the last push, we don't do this one unless args.hub_always_push=True.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is problematic: when push_to_bucket=True, push_to_hub=False, hub_strategy="end", this disables the bucket entirely. Would someone use push_to_hub=False and hub_strategy="end"? Maybe it could be

- self.args.hub_strategy == HubStrategy.END
+ (self.args.push_to_hub and self.args.hub_strategy == HubStrategy.END) and not push_to_bucket

something like this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks for noticing this. I will update this !

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

CI Dashboard: View test results in Grafana

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46386&sha=58a878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants