CLI to stream JSONL or Parquet datasets, count tokenizer tokens with a base model tokenizer (default: Qwen/Qwen3-1.7B-Base), and write a Markdown summary report.
pip install -r requirements.txt
pip install -e .
-
Parquet:
python -m token_counter.cli --input data/your_file_dataset.parquet --format parquet -
JSONL (limit to first 500 rows):
python -m token_counter.cli --input data/your_file_dataset.jsonl --format jsonl --max-docs 500 -
Hugging Face sharded Parquet (all matching files):
python -m token_counter.cli --input "hf://datasets/g4me/corpus-carolina-v2@main/data/corpus/part-*.parquet" --format parquet
The console script alias is also available after installation: token-counter --input ...
You can stream files directly from the Hub by passing a remote path or glob pattern to --input.
-
Public dataset shards (glob):
python -m token_counter.cli --input "hf://datasets/<org>/<dataset>@main/<folder>/part-*.parquet" --format parquet -
Private dataset shards (PowerShell):
$env:HF_TOKEN="hf_xxx"
python -m token_counter.cli --input "hf://datasets/<org>/<dataset>@main/<folder>/part-*.parquet" --format parquet
If the text field is not named text, pass --field <column_name>.
--input(required): dataset path, URL, or glob pattern.--format(jsonl|parquet, defaultparquet): dataset format.--model(defaultQwen/Qwen3-1.7B-Base): tokenizer to load viatransformers.AutoTokenizer.--field(defaulttext): record field containing the text to tokenize.--add-special-tokens(defaultFalse): include special tokens in the length.--max-docs(int, optional): stop after N documents.--trust-remote-code(flag): allow custom tokenizer code from the model repo.--report(defaultreports/token_count_report.md): Markdown report path. Use empty string to skip.
A Markdown file is generated with summary and performance metrics:
- Documents processed, total tokens, min/avg/max tokens per document.
- Wall-clock time, tokens per second, documents per second.
- Module:
python -m token_counter.cli ... - Console script:
token-counter ... - Wrapper script:
python scripts/count_tokens.py ...
- Uses streaming
datasets.load_dataset(..., streaming=True)to avoid loading full datasets into memory. - Parquet and JSONL files can be local files, remote URLs, or remote glob patterns.