Added SFT Pre-Processing for Grain Input Pipeline#3437

Open

ajkv-google wants to merge 4 commits intomainfrom

ajkv/sft-grain-implementation

Collaborator

ajkv-google commented Mar 18, 2026 •

edited

Loading

Description

This PR introduces SFT support to the Grain input pipeline by adding a separate sft_preprocessing_pipeline function. Rather than cluttering the existing pretrain code, it uses simple conditionals inside the train and eval iterators to route to this new SFT logic. I followed the existing Hugging Face SFT implementation and adapted its logic to be compatible with Grain's element-wise datasets.

Tests

I added a unit test to verify end-to-end functionality to make sure the Grain SFT pipeline formats the data and outputs correctly. Ran this command to execute the unit test:

pytest tests/unit/grain_data_processing_test.py::GrainSFTParquetProcessingTest -v

This is the output of the test: Test Passed Output

Also, ran the training pipeline in Maxtext with sft enabled using a grain dataset with this command:

python3 -m maxtext.trainers.post_train.sft.train_sft src/maxtext/configs/post_train/sft.yml run_name=test_grain_sft dataset_type=grain grain_file_type=parquet grain_train_files=gs://maxtext-dataset/hf/ultrachat_200k/train_sft-*.parquet steps=10 tokenizer_type=huggingface tokenizer_path=HuggingFaceH4/zephyr-7b-beta

Verified that the sft processing changes worked and trained successfully: Logs

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.


          Added SFT support for grain input pipeline

30cfc2c

ajkv-google requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners

March 18, 2026 00:14

codecov bot commented Mar 18, 2026 •

edited

Loading

Codecov Report

❌ Patch coverage is 40.57971% with 41 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...rc/maxtext/input_pipeline/grain_data_processing.py	40.57%	33 Missing and 8 partials ⚠️

📢 Thoughts on this report? Let us know!

ajkv-google added 3 commits

March 18, 2026 00:28


          Updated code formatting and spacing

153953a


          Updated grain sft unit test and sft implementation to use tokenizer

a5949cf


          Cleaned up code for readability

15fcafa

vlad-karp reviewed

View reviewed changes

Collaborator

vlad-karp left a comment

It would also be great to test not only with maxtext general sft but with distillation sft pipeline as well

src/maxtext/input_pipeline/grain_data_processing.py

+                  grain_per_worker_buffer_size,
+              ):
+                """Use grain pipeline to pre-process the dataset and return iterators for sft fine-tuning"""
+                if config.grain_file_type == "arrayrecord":

Collaborator

vlad-karp Mar 19, 2026

there is exactly this block in the pretrain_preprocessing_pipeline(). Almost identical code there suggests it would be great to reuse it

Collaborator

aireenmei Mar 20, 2026

Yes, there are many common operations between pretrain and sft. I think it's a good idea to extract the common pattern into util functions

src/maxtext/input_pipeline/grain_data_processing.py

+                else:
+                  dataset = dataset.map(input_pipeline_utils.KeepFeatures(feature_names=data_columns))
+                tokenizer_model = tokenizer.build_tokenizer(

Collaborator

vlad-karp Mar 19, 2026

same for the tokenizer

src/maxtext/input_pipeline/grain_data_processing.py

+                if getattr(config, "chat_template", None) and hasattr(tokenizer_model, "chat_template"):
+                  tokenizer_model.chat_template = config.chat_template
+                supported_columns = [["prompt", "completion"], ["messages"], ["question", "answer"]]

Collaborator

vlad-karp Mar 19, 2026

This is again repeating logic from the hf pipeline, one can extract utility routines to avoid code repetition

src/maxtext/input_pipeline/grain_data_processing.py

+                  messages = [{"role": "user", "content": element["prompt"]}, {"role": "assistant", "content": element["completion"]}]
+                elif set(data_columns) == {"question", "answer"}:
+                  messages = [{"role": "user", "content": element["question"]}, {"role": "assistant", "content": element["answer"]}]
+                else:

Collaborator

vlad-karp Mar 19, 2026

HF pipeline asserts sft is running on a conversational format

src/maxtext/input_pipeline/grain_data_processing.py

+                      )
+                  )
+                dataset = dataset.map(

Collaborator

vlad-karp Mar 19, 2026

any pros/cons of doing it via Grain operations like in the hf pipeline?

Collaborator

aireenmei Mar 20, 2026

dataset.map is the newer and recommended way of using Grain

src/maxtext/input_pipeline/grain_data_processing.py

+                  # global_batch_size_to_load has been expanded in pyconfig.py when expansion_factor_real_data > 1.
+                  batch_size = int(batch_size // config.expansion_factor_real_data)
+                if config.packing:

Collaborator

vlad-karp Mar 19, 2026

this block is again identical to one in pretrain_preprocessing_pipeline().
Also, would be great to check if need some sft related modifications

src/maxtext/input_pipeline/grain_data_processing.py

+                dataset = dataset.batch(batch_size, batch_fn=batch_fn)
+                # Shift inputs for teacher-forced training
+                dataset = dataset.map(

Collaborator

vlad-karp Mar 19, 2026

should it alway be executed in a generic sft_preprocessing_pipeline() ?

aireenmei reviewed

View reviewed changes

src/maxtext/input_pipeline/grain_data_processing.py

+                      )
+                  )
+                dataset = dataset.map(

Collaborator

aireenmei Mar 20, 2026

dataset.map is the newer and recommended way of using Grain

src/maxtext/input_pipeline/grain_data_processing.py

+                  grain_per_worker_buffer_size,
+              ):
+                """Use grain pipeline to pre-process the dataset and return iterators for sft fine-tuning"""
+                if config.grain_file_type == "arrayrecord":

Collaborator

aireenmei Mar 20, 2026

Yes, there are many common operations between pretrain and sft. I think it's a good idea to extract the common pattern into util functions

src/maxtext/input_pipeline/grain_data_processing.py

+                ), f"Dataset column names mismatch. Expected columns to match one of {supported_columns}, but got {data_columns}"
+                dataset = dataset.map(
+                    functools.partial(_format_chat_template_grain, data_columns=data_columns, tokenizer_model=tokenizer_model)

Collaborator

aireenmei Mar 20, 2026

The hf pipeline calls instruction_data_processing.convert_to_conversational_format, do we support the same conversion here? https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/input_pipeline/hf_data_processing.py#L261

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

aireenmei aireenmei left review comments

vlad-karp vlad-karp left review comments

SurbhiJainUSC Awaiting requested review from SurbhiJainUSC SurbhiJainUSC is a code owner

richjames0 Awaiting requested review from richjames0 richjames0 is a code owner

shralex Awaiting requested review from shralex shralex is a code owner

NicoGrande Awaiting requested review from NicoGrande NicoGrande is a code owner

gobbleturk Awaiting requested review from gobbleturk gobbleturk is a code owner

khatwanimohit Awaiting requested review from khatwanimohit khatwanimohit is a code owner

bvandermoon Awaiting requested review from bvandermoon bvandermoon is a code owner

vipannalla Awaiting requested review from vipannalla vipannalla is a code owner

RissyRan Awaiting requested review from RissyRan RissyRan is a code owner

gagika Awaiting requested review from gagika gagika is a code owner

hengtaoguo Awaiting requested review from hengtaoguo hengtaoguo is a code owner

A9isha Awaiting requested review from A9isha A9isha is a code owner

NuojCheng Awaiting requested review from NuojCheng NuojCheng is a code owner

jiangjy1982 Awaiting requested review from jiangjy1982 jiangjy1982 is a code owner

suexu1025 Awaiting requested review from suexu1025 suexu1025 is a code owner

jesselu-google Awaiting requested review from jesselu-google jesselu-google is a code owner

dipannita08 Awaiting requested review from dipannita08 dipannita08 is a code owner

igorts-git Awaiting requested review from igorts-git igorts-git is a code owner

At least 2 approving reviews are required to merge this pull request.

Labels

None yet