HF Buckets integration to Trainer#46386
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
ArthurZucker
left a comment
There was a problem hiding this comment.
Nice! was wondering if mounting the bucket and just writing to it would not make more sense?
- good defaults?! since anyways
| model=model, | ||
| args=TrainingArguments( | ||
| push_to_bucket=True, | ||
| bucket_id="my-org/my-run", |
There was a problem hiding this comment.
we could have a good default for this IMO, the least amount of args the better!
There was a problem hiding this comment.
Yeah, I didn't precise but it will default to hub_model_id if it is set or default to output_dir otherwise.
There was a problem hiding this comment.
so this is actually optional
| self.push_in_progress = None # Tracks the in-flight repo push | ||
| if self.args.push_to_hub: | ||
| self.init_hf_repo() | ||
| if self.args.push_to_bucket and self.is_world_process_zero(): |
There was a problem hiding this comment.
we could have push_in_progress = True default to push to a bucket?
There was a problem hiding this comment.
this is a private attribute not related to push to a bucket, more an arg to track that we are still pushing a checkpoint so we shouldn't interrupt it. If we want to force pushing to a bucket, we would have to set push_to_bucket = True as the default
qgallouedec
left a comment
There was a problem hiding this comment.
lgtm, just a few question and edge cases
| revision=self.args.hub_revision, | ||
|
|
||
| # Full checkpoint -> repo (unchanged behavior, gated by hub_strategy). | ||
| if self.args.hub_strategy in [HubStrategy.CHECKPOINT, HubStrategy.ALL_CHECKPOINTS]: |
There was a problem hiding this comment.
we don't need the modeling_files in this case?
There was a problem hiding this comment.
There are three things happening sequentially:
-
if
push to hub: we update the output_dir to have the most recent model + tokenizer and so on (this is the modeling_files) and we upload that to the hub -
if
push to hub+`hub_strategy="checkpoint" or "all_checkpoints", we either update last_checkpoint folder with the new checkpoint or upload the new checkpoint file -
if
push_to_bucket, we sync the output_dir folder with the bucket
One thing that I didn't add but we should maybe is that right now we don't update the output_dir with modeling_files when pushing to bucket and this is actually something we should maybe to do have parity with push to hub and not create confusion.
| """Push model files to the repo and/or sync the checkpoint to the bucket, from a checkpoint folder.""" | ||
| if not self.is_world_process_zero() or self.args.hub_strategy == HubStrategy.END: | ||
| return | ||
| # If we haven't finished the last push, we don't do this one unless args.hub_always_push=True. |
There was a problem hiding this comment.
I think this is problematic: when push_to_bucket=True, push_to_hub=False, hub_strategy="end", this disables the bucket entirely. Would someone use push_to_hub=False and hub_strategy="end"? Maybe it could be
- self.args.hub_strategy == HubStrategy.END
+ (self.args.push_to_hub and self.args.hub_strategy == HubStrategy.END) and not push_to_bucketsomething like this
There was a problem hiding this comment.
Nice thanks for noticing this. I will update this !
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
|
CI Dashboard: View test results in Grafana |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46386&sha=58a878 |
What does this PR do?
This PR adds support for HF Buckets in Trainer. Instead of storing checkpoints locally or on the hub repository, we can leverage HF Buckets for that (S3 like storage + XET dedup).
With buckets (
push_to_buckets=True), users don't need to save the checkpoints on the hub anymore like before (hub_strategy="checkpoint" or "all_checkpoints") as we push all checkpoints to the bucket. Still, pushing to the hub is still useful (hub_strategy="every_save") as it will upload a version of a model that can we load with transformers.Features:
Right now, everything should be pretty much compatible like before since we are just synching a local dir to the HF Buckets async but in the future, it would be a lot better if we are able load / save directly from / to HF Buckets without going through the disk, might be useful for deepspeed and fsdp.
Usage