feat: add progress_format option for machine-readable JSON output#1921
Merged
ArthurZucker merged 2 commits intohuggingface:mainfrom Mar 26, 2026
Merged
Conversation
Add ProgressFormat enum to control how training progress is reported:
- Indicatif (default): Interactive terminal progress bars
- JsonLines: Machine-readable JSON lines to stderr
- Silent: No progress output
Changes:
- Add ProgressFormat enum to tokenizers/src/utils/progress.rs
- Add progress_format field and builder method to BpeTrainer
- Modify setup_progress() to respect progress_format
- Add emit_json_progress() helper for JSON output
- Expose progress_format getter/setter in Python bindings
- Add get_word_count() method to BpeTrainer
JSON output format:
{"stage":"Tokenize words","current":1000,"total":5000000}
This enables programmatic consumption of training progress for web UIs,
logging systems, and other non-TTY environments where indicatif progress
bars are not visible.
Contributor
Author
podarok
added a commit
to podarok/datasets
that referenced
this pull request
Dec 28, 2025
Similar to huggingface/tokenizers#1921, adds machine-readable JSON progress output. - Add set_progress_format() and get_progress_format() functions - Support 'tqdm' (default), 'json', and 'silent' formats - Emit JSON progress every 5% when format='json' - Export new functions from datasets.utils Cross-reference: huggingface/tokenizers#1921
6 tasks
podarok
added a commit
to podarok/huggingface_hub
that referenced
this pull request
Dec 30, 2025
Add set_progress_format() and get_progress_format() functions to control
progress output format:
- "tqdm" (default): Interactive progress bars
- "json": Machine-readable JSON lines to stderr
- "silent": No progress output
When format is "json", emits progress every 5% as:
{"stage":"Downloading file","current":1024,"total":4096,"percent":25.0}
Similar to huggingface/tokenizers#1921 and huggingface/datasets#7920
ArthurZucker
approved these changes
Mar 26, 2026
Collaborator
ArthurZucker
left a comment
There was a problem hiding this comment.
Yeah this should help!
Its breaking for rust but we'll push a major (0.23) to have it in ! 🤗
| #[pyo3( | ||
| signature = (**kwargs), | ||
| text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={})" | ||
| text_signature = "(self, vocab_size=30000, min_frequency=0, show_progress=True, progress_format=\"indicatif\", special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={})" |
Collaborator
There was a problem hiding this comment.
let's move the arg to the end of the signature
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Add a
progress_formatoption toBpeTrainerthat allows choosing between different progress output formats:Indicatif(default): Interactive terminal progress bars (current behavior, unchanged)JsonLines: Machine-readable JSON lines to stderrSilent: No progress outputThis enables programmatic consumption of training progress for web UIs, logging systems, and other non-TTY environments where indicatif progress bars are not visible.
Motivation
When running tokenizer training in a web application backend or logging environment, the indicatif progress bars:
This PR adds an opt-in JSON output mode that emits structured progress data:
{"stage":"Tokenize words","current":1000,"total":5000000} {"stage":"Count pairs","current":500,"total":5000000} {"stage":"Compute merges","current":30000,"total":65536}Changes
Rust Core
ProgressFormatenum totokenizers/src/utils/progress.rsProgressFormatfromtokenizers/src/utils/mod.rsandtokenizers/src/lib.rsprogress_formatfield and.progress_format()builder method toBpeTrainersetup_progress()to only create indicatif bar when format isIndicatifemit_json_progress()helper that outputs JSON when format isJsonLinesget_word_count()method toBpeTrainerfor progress estimationPython Bindings
progress_formatparameter toBpeTrainerconstructor (accepts "indicatif", "json", "silent")progress_formatgetter/setter propertiesget_word_count()methodUsage
Backward Compatibility
Indicatif- identical to current behaviorTest Plan
get_word_count()returns correct count after feeding