Add dataset type olmo_grain for AI2 OLMo numpy pretrain mixes#3749
Add dataset type olmo_grain for AI2 OLMo numpy pretrain mixes#3749
Conversation
|
🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
9de5321 to
9e3ff8f
Compare
|
🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request introduces a high-quality, Grain-based input pipeline for AI2's OLMo numpy datasets. The implementation is robust, well-documented, and includes a particularly clean approach to stateless resumption by deriving the data offset from the model checkpoint step.
🔍 General Feedback
- Stateless Resume: The
initial_steplogic in the sampler is an excellent design choice that avoids the complexities of Grain iterator-state serialization. - N-gram Filtering: The integration of OLMo-core's repetition filter via a custom transform that masks instances in the loss is both efficient and sharding-friendly.
- Testing and Validation: The inclusion of unit tests, smoke scripts, and end-to-end resume tests provides great confidence in the stability of the new pipeline.
- Performance: While the in-memory permutation for shuffling is currently manageable, it's worth monitoring as dataset sizes scale further.
| total_instances: ``index.total_instances`` from the OLMo index. | ||
| seed: Base seed for the shuffle. | ||
| shard_index: Zero-based index of this data-loading host. Typically | ||
| ``jax.process_index()``. |
There was a problem hiding this comment.
🟡 For very large datasets (e.g., the 724M instance mix mentioned), allocating the full permutation in host memory (~5.8 GB) can be a significant spike, especially if many hosts are doing it simultaneously at an epoch boundary.
While acceptable for the current scope, consider implementing a lazy or on-disk permutation scheme if the dataset size grows further or if host memory becomes a constraint.
There was a problem hiding this comment.
at this scale it's fine, we can follow up with Chunked / Philox-keyed shuffle as follow-up if we need larger.
RissyRan
left a comment
There was a problem hiding this comment.
Thanks for the change!
Have you had a chance to chat with @aireenmei about this yet? I'm wondering if we could design these features to directly leverage the existing grain_data_processing.py. For example, we could add things like n-gram filtering and pre-tokenized numpy mixes as features there to improve code reusability.
The main benefit I see is reducing maintenance overhead. We currently maintain both tfds and c4_tfds_mlperf, but the latter is rarely used and has some maintenance issues. Since Grain will be heavily used moving forward, it makes sense to build on top of it directly. Let me know your thoughts—happy to discuss!
I haven't yet chatted with @aireenmei, I will follow up. The main reason I kept this separate is the stateless-resume contract — record at step k must be a pure function of The general pieces, e.g. the n-gram mask, probably can be lifted into |
Description
dataset_type=olmo_grain, a Grain-based input pipeline for AI2'spre-tokenized OLMo numpy mixes (e.g.
OLMo-mix-0925-official.txt).Reads headerless
.npytoken streams from a gcsfuse mount, appliesOLMo-core's repeated-n-gram filter, and yields the shapes the MaxText
pretrain trainer expects.
(seed, shard, k). Resume reads the latest step fromconfig.checkpoint_dirand shifts the sampler — no Grain iteratorstate in the checkpoint.
download_olmo_data_to_gcs.pywith HTTP-Rangeresume;
build_olmo_npy_index.pyfor header-scan indexing) and twolaunchers (
run_olmo3_7b_grain_smoke.sh,run_olmo3_7b_grain_resume_test.sh).Tests
tests/unit/input_pipeline/olmo_*)Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.authored-by: @aireenmei