Remove Google.Protobuf dependency from Microsoft.ML.Tokenizers#7587
Remove Google.Protobuf dependency from Microsoft.ML.Tokenizers#7587stephentoub wants to merge 3 commits intodotnet:mainfrom
Conversation
Instead of an ~4700 line SentencepieceModel.cs and the Google.Protobuf dependency, we can use an ~400 line minimal parser implementation.
|
cc: @crickman |
There was a problem hiding this comment.
Pull request overview
This PR removes the Google.Protobuf dependency from Microsoft.ML.Tokenizers by replacing the large generated SentencepieceModel.cs with a small, purpose-built protobuf wire-format reader/parser for the subset of SentencePiece fields the tokenizer consumes.
Changes:
- Replace the generated SentencePiece protobuf model types with a minimal internal parser (
SentencepieceModel.cs). - Remove the
Google.Protobufpackage reference and adjust warning suppression to project-level. - Add/extend tests covering
SentencePieceTokenizer.Create(...)/LlamaTokenizer.Create(...)null/invalid stream scenarios.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Microsoft.ML.Tokenizers.Tests/SentencePieceTests.cs | Adds basic factory/stream validation tests for SentencePieceTokenizer. |
| test/Microsoft.ML.Tokenizers.Tests/LlamaTests.cs | Adds a null-stream guard test for LlamaTokenizer.Create. |
| src/Microsoft.ML.Tokenizers/SentencepieceModel.cs | Replaces generated protobuf code with a minimal wire-format reader and parsers for ModelProto/TrainerSpec/NormalizerSpec. |
| src/Microsoft.ML.Tokenizers/Model/WordPieceOptions.cs | Removes local warning pragmas now suppressed at project level. |
| src/Microsoft.ML.Tokenizers/Model/BertOptions.cs | Removes local warning pragmas now suppressed at project level. |
| src/Microsoft.ML.Tokenizers/Microsoft.ML.Tokenizers.csproj | Drops Google.Protobuf reference and adds NoWarn for MSML_NoInstanceInitializers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// <summary>Lightweight replacement for Google.Protobuf.ByteString with a Span property.</summary> | ||
| internal readonly struct SentencePieceByteString(byte[] data, int offset, int length) | ||
| { | ||
| internal ReadOnlySpan<byte> Span => data.AsSpan(offset, length); |
There was a problem hiding this comment.
SentencePieceByteString.Span will throw a NullReferenceException when PrecompiledCharsmap is left at its default struct value (e.g., empty/invalid model or missing normalizer_spec/precompiled_charsmap). Normalizer construction expects an empty span to be safe, so SentencePieceByteString should handle a null backing array (or NormalizerSpec should initialize PrecompiledCharsmap to an explicit empty value).
| internal ReadOnlySpan<byte> Span => data.AsSpan(offset, length); | |
| internal ReadOnlySpan<byte> Span => data is null ? ReadOnlySpan<byte>.Empty : data.AsSpan(offset, length); |
There was a problem hiding this comment.
[Automated Copilot response]
Addressed. The Span property now null-guards the backing array, returning ReadOnlySpan<byte>.Empty for the default struct case (e.g. when NormalizerSpec has no precompiled_charsmap). The SentencePieceNormalizer constructor handles empty spans safely via its IsEmpty early-return. Covered by Synthetic_BpeModel_DefaultProperties and Synthetic_BpeModel_EmptyNormalizerSpecUsesDefaults tests.
|
@ericstj / @tarekgh, the nuget config looks broken: |
|
The feed was deleted -- we can remove it since the packages are now public. |
| internal TrainerSpec TrainerSpec { get; private set; } = new(); | ||
| internal NormalizerSpec NormalizerSpec { get; private set; } = new(); | ||
|
|
||
| internal static ModelProto Parse(byte[] data, int start, int end) |
There was a problem hiding this comment.
Do we have a gammut of tests that give this thing every version of sentencepiece.bpe.model that was ever shipped? Will this be resilient to versioning in the same way as the old implementation? I love the reduction in dependencies and deletion of the generated code, but we are adding a parser (albiet for trusted code) and it needs to meet compatibility and versioning requirements.
There was a problem hiding this comment.
[Automated Copilot response]
Great question. Here's how compatibility and versioning are handled:
Forward compatibility by design. The parser skips unknown field numbers (via the default case in every switch) and unknown wire types are handled by SkipField. This mirrors protobuf's core forward-compat guarantee: a parser can read data from a newer proto version without crashing, it just ignores the fields it doesn't recognize. The sentencepiece_model.proto also defines extensions 200 to max; on several messages, which our parser handles the same way.
Wire type mismatch tolerance. The when wireType == 2 guards on known fields mean that if a future proto version changes a field's wire type, it falls through to skip instead of misinterpreting the data.
Test coverage for real model files. The existing tests exercise three real model files from the Microsoft.ML.TestTokenizers package: Llama (BPE), Mistral (BPE), and Paraphrase-MiniLM (Unigram). These cover both model types and different normalizer configurations. Adding more models (e.g. T5, Gemma) would require updating that external package — worth doing as a follow-up.
Synthetic protobuf tests. In addition to real files, the latest push adds 49 tests that construct model files from raw wire-format bytes, exercising every parser code path: all field types, all wire types (read and skip), non-canonical varints, negative int32 (10-byte varint), extension-range field numbers, forward-compat patterns (wire type mismatch, unknown fields at every message level), submessage boundary handling, MemoryStream position offsets, and error handling for truncated/malformed data.
One known semantic difference from Google.Protobuf. For non-repeated message fields (TrainerSpec, NormalizerSpec), if the same field number appears twice, our parser uses the last occurrence (last-wins) rather than merging. This is documented in a code comment and validated by tests (MultipleTrainerSpecLastWins, MultipleTrainerSpecFieldsNotMerged). SentencePiece's own serializer never produces duplicate top-level messages, so this difference is moot in practice.
There was a problem hiding this comment.
We have a good test using the actual tokenizer data for the models like Llama, Mistral, and Paraphrase-multilingual-MiniLM-L12. This covering using the SentencePiece Bpe and Unigram models we support too.
Regarding the versions, protopuf serialized data depends on fields/types which is agnostic between different versions (like proto2 and proto3). Maybe we only need to ensure the data we need is read, like Pieces table for instance.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7587 +/- ##
==========================================
+ Coverage 69.07% 69.54% +0.47%
==========================================
Files 1483 1484 +1
Lines 274513 273206 -1307
Branches 28285 27919 -366
==========================================
+ Hits 189625 190011 +386
+ Misses 77503 75831 -1672
+ Partials 7385 7364 -21
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
- Handle null backing array in SentencePieceByteString.Span (default struct) - Add big-endian guard to ReadFloat - Make ModelProto.Parser readonly - Optimize ParseFrom with MemoryStream.TryGetBuffer fast-path and pre-sized fallback - Document last-wins vs merge semantics for non-repeated message fields - Add comprehensive synthetic protobuf tests (49 SentencePiece tests total) covering wire format edge cases, forward compatibility, error handling, all field types, and stream path variations Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of an ~4700 line SentencepieceModel.cs and the Google.Protobuf dependency, we can use an ~400 line minimal parser implementation.