Skip to content

feat: add util for tokenizer pad id #310

Merged
stephantul merged 3 commits intomainfrom
fix-tokenizer-padding
Mar 12, 2026
Merged

feat: add util for tokenizer pad id #310
stephantul merged 3 commits intomainfrom
fix-tokenizer-padding

Conversation

@stephantul
Copy link
Contributor

The padding token in our classifier module defaulted to [PAD], which works for our own models (models that have bert-base-uncased as a base), but not for e.g., e5 based models. This required that users manually look up their padding token and set it.

Because we now use skeletoken, we actually store the model's pad token, so we can just look it up. I introduced a new utility that tries to guess the pad token and which will show a warning if we default to 0.

pad_token is still an accepted argument, but is now set to None by default, so this is a backwards compatible change. For users that didn't fill in this argument, it will still work, because we guess that [PAD] can be a padding token.

@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
model2vec/train/base.py 97.97% <100.00%> (+0.08%) ⬆️
model2vec/train/utils.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@stephantul stephantul requested a review from Pringled March 12, 2026 15:45
Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stephantul stephantul merged commit e707803 into main Mar 12, 2026
5 checks passed
@stephantul stephantul deleted the fix-tokenizer-padding branch March 12, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants