Skip to content

Add configurable train_on_eos for conversation data preparation#535

Closed
jlamypoirier wants to merge 1 commit into
mainfrom
jlp_train-on-eos
Closed

Add configurable train_on_eos for conversation data preparation#535
jlamypoirier wants to merge 1 commit into
mainfrom
jlp_train-on-eos

Conversation

@jlamypoirier
Copy link
Copy Markdown
Collaborator

Authored by Claude Opus 4.8 (with @jlamypoirier).

Splits out — as a standalone, opt-in flag — the loss-masking change that was bundled into #473.

What

Add train_on_eos: bool = False to ConversationSourceConfig. It controls whether the end-of-sequence token that tokenize_chat appends after the final message is included in the training loss:

  • false (default): the appended EOS is masked from the loss — unchanged behavior.
  • true: the appended EOS becomes a training target.

Threaded through Tokenizer.tokenize_chat as a train_on_eos parameter. Per-turn terminators emitted by the chat template (e.g. ChatML <|im_end|>) are unaffected — they remain governed by the template's {% generation %} markers. This flag only touches the single sequence-terminating EOS that the tokenizer appends when no EOS already appears in the conversation.

Why

Masking the terminal EOS means the model never gets a loss signal to emit end-of-sequence — a well-known cause of models that don't stop generating at inference. Training on the final/assistant EOS is the common recommendation, and frameworks expose it as a knob (Axolotl train_on_eos: turn|all|last; TRL assistant_only_loss, which includes the assistant turn's EOS). It's also an asymmetry within Fast-LLM today: the document path already trains on its appended EOS (unmasked tokens all contribute to the loss), while the conversation path masks it. This flag lets conversation prep opt into the same behavior.

Kept off by default so existing datasets' loss masking is unchanged.

Testing

tests/data/test_tokenizer.py (incl. the new test_tokenize_chat_train_on_eos) and test_preparator.py pass (41). The new test asserts that train_on_eos changes only the appended EOS's loss mask, not the tokens.

Note

Touches the same tokenize_chat call site as #534 (no-BOS prep); whichever merges first, the other needs a one-line rebase.

🤖 Generated with Claude Code

Add a `train_on_eos` flag (default `False`) to `ConversationSourceConfig`
controlling whether the end-of-sequence token appended after the final
message is included in the training loss. When disabled (the default,
unchanged behavior) that token is masked from the loss; when enabled it
becomes a training target. Threaded through `tokenize_chat` as a
`train_on_eos` parameter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jlamypoirier
Copy link
Copy Markdown
Collaborator Author

Claude Opus 4.8 note: Folded into #534 along with the no-BOS data-prep changes (one PR instead of two, per maintainer preference). The train_on_eos flag is unchanged. Closing in favor of #534.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant