TensorFlow-Text 2.20.0 FastWordpieceTokenizer Bug Analysis
Bug Summary
FastWordpieceTokenizer fails with "Cannot find unk_token in the vocab!" when:
- Vocabulary size ≥ 7, AND
- The
unknown_token is NOT the last element in the vocabulary
Despite the unknown_token being present in the vocabulary.
Reproduction
import tensorflow_text as tf_text
# ✓ Works: 6 tokens
vocab = ['[UNK]', 'token1', 'token2', 'token3', 'token4', 'token5']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)
# ✗ Fails: 7 tokens, unknown_token at position 0
vocab = ['[UNK]', 'token1', 'token2', 'token3', 'token4', 'token5', 'token6']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)
# RuntimeError: Cannot find unk_token in the vocab!
# ✓ Works: 7 tokens, unknown_token at LAST position
vocab = ['token1', 'token2', 'token3', 'token4', 'token5', 'token6', '[UNK]']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)
Root Cause
File: tensorflow_text/core/kernels/string_vocab.cc
Function: StringVocab::StringVocab() constructor
Lines: 20-25
StringVocab::StringVocab(const std::vector<std::string>& vocab)
: vocab_(vocab) {
for (int i = 0; i < vocab.size(); ++i) {
index_map_[vocab_[i]] = i; // BUG: No reserve() before loop
}
}
The Bug
- The constructor builds an
absl::flat_hash_map<absl::string_view, int> by inserting vocabulary tokens one by one
- No
index_map_.reserve(vocab.size()) is called before the loop
- When the hash map reaches its load factor threshold (~0.875), it triggers a rehash/resize
- For vocabularies with size ≥ 7, this rehash typically occurs during insertion
- There appears to be a bug in how
absl::flat_hash_map handles string_view keys during rehashing
Why Position Matters
- Last position works: When
unknown_token is at the last position, it's inserted AFTER any rehashing occurs
- Earlier positions fail: When
unknown_token is inserted before the rehash point, something goes wrong during rehashing that causes subsequent lookups to fail
Evidence
Testing shows the exact pattern:
Vocab Size | Unknown_token Position | Result
-----------|----------------------|--------
6 | first/middle/last | ✓ SUCCESS
7 | first/middle | ✗ FAILED
7 | last | ✓ SUCCESS
8 | first/middle | ✗ FAILED
8 | last | ✓ SUCCESS
The failure occurs precisely when:
- Vocabulary size >= 7 (triggers hash map rehashing)
- unknown_token is inserted before the rehash occurs
Why It Works with no_pretokenization=True
Setting no_pretokenization=True disables the punctuation-skipping logic, allowing "unk_token" to be added to the trie.
Why Vocabulary Size Matters
Note: The exact reason for the vocabulary size threshold (≥7) is unclear without deeper C++ debugging, but several hypotheses:
-
Different Code Paths: The C++ implementation may use different algorithms or optimizations based on vocabulary size, potentially affecting how token lookups are performed.
-
Hash Map vs Linear Search: Smaller vocabularies might use linear search while larger ones use hash-based lookups, where string comparison or hashing behavior differs.
-
Memory Layout Effects: With larger vocabularies, memory allocation patterns or string storage mechanisms might change, affecting how "unk_token" is stored or compared.
-
Trie Construction Dependencies: The issue occurs during vocab_->LookupId(unk_token_) which should query the original StringVocab, not the trie. However, there may be interdependencies where trie construction side effects influence the original vocabulary lookup mechanism.
-
Internal Optimizations: The FastWordpiece implementation may have size-based optimizations that trigger different behavior patterns at the 7-token threshold.
Current Evidence:
- Works: vocabularies with ≤6 tokens
- Fails: vocabularies with ≥7 tokens containing
"unk_token"
- The error occurs in
StringVocab::LookupId(), not during trie operations
- Setting
no_pretokenization=True bypasses the issue entirely
Impact
- KerasHub: Breaks WordPieceTokenizer for BERT/DistilBERT models when using tensorflow-text 2.20+
- LiteRT Export: Causes export tests to fail
- Workaround: Either:
- Use
no_pretokenization=True (not always appropriate)
- Use a different token name without punctuation (e.g.,
"<unk>" instead of "unk_token")
- Skip tests gracefully on affected systems (current KerasHub approach)
Regression
This appears to be a regression introduced in tensorflow-text 2.20.0. Earlier versions did not exhibit this behavior.
Recommendation
Submit bug report to tensorflow-text repository with the minimal reproduction script.
TensorFlow-Text 2.20.0 FastWordpieceTokenizer Bug Analysis
Bug Summary
FastWordpieceTokenizer fails with "Cannot find unk_token in the vocab!" when:
unknown_tokenis NOT the last element in the vocabularyDespite the
unknown_tokenbeing present in the vocabulary.Reproduction
Root Cause
File:
tensorflow_text/core/kernels/string_vocab.ccFunction:
StringVocab::StringVocab()constructorLines: 20-25
The Bug
absl::flat_hash_map<absl::string_view, int>by inserting vocabulary tokens one by oneindex_map_.reserve(vocab.size())is called before the loopabsl::flat_hash_maphandlesstring_viewkeys during rehashingWhy Position Matters
unknown_tokenis at the last position, it's inserted AFTER any rehashing occursunknown_tokenis inserted before the rehash point, something goes wrong during rehashing that causes subsequent lookups to failEvidence
Testing shows the exact pattern:
The failure occurs precisely when:
Why It Works with
no_pretokenization=TrueSetting
no_pretokenization=Truedisables the punctuation-skipping logic, allowing"unk_token"to be added to the trie.Why Vocabulary Size Matters
Note: The exact reason for the vocabulary size threshold (≥7) is unclear without deeper C++ debugging, but several hypotheses:
Different Code Paths: The C++ implementation may use different algorithms or optimizations based on vocabulary size, potentially affecting how token lookups are performed.
Hash Map vs Linear Search: Smaller vocabularies might use linear search while larger ones use hash-based lookups, where string comparison or hashing behavior differs.
Memory Layout Effects: With larger vocabularies, memory allocation patterns or string storage mechanisms might change, affecting how
"unk_token"is stored or compared.Trie Construction Dependencies: The issue occurs during
vocab_->LookupId(unk_token_)which should query the originalStringVocab, not the trie. However, there may be interdependencies where trie construction side effects influence the original vocabulary lookup mechanism.Internal Optimizations: The FastWordpiece implementation may have size-based optimizations that trigger different behavior patterns at the 7-token threshold.
Current Evidence:
"unk_token"StringVocab::LookupId(), not during trie operationsno_pretokenization=Truebypasses the issue entirelyImpact
no_pretokenization=True(not always appropriate)"<unk>"instead of"unk_token")Regression
This appears to be a regression introduced in tensorflow-text 2.20.0. Earlier versions did not exhibit this behavior.
Recommendation
Submit bug report to tensorflow-text repository with the minimal reproduction script.