Skip to content

Remove Google.Protobuf dependency from Microsoft.ML.Tokenizers#7587

Open
stephentoub wants to merge 3 commits intodotnet:mainfrom
stephentoub:protobufdep
Open

Remove Google.Protobuf dependency from Microsoft.ML.Tokenizers#7587
stephentoub wants to merge 3 commits intodotnet:mainfrom
stephentoub:protobufdep

Conversation

@stephentoub
Copy link
Member

Instead of an ~4700 line SentencepieceModel.cs and the Google.Protobuf dependency, we can use an ~400 line minimal parser implementation.

Instead of an ~4700 line SentencepieceModel.cs and the Google.Protobuf dependency, we can use an ~400 line minimal parser implementation.
@stephentoub
Copy link
Member Author

cc: @crickman

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the Google.Protobuf dependency from Microsoft.ML.Tokenizers by replacing the large generated SentencepieceModel.cs with a small, purpose-built protobuf wire-format reader/parser for the subset of SentencePiece fields the tokenizer consumes.

Changes:

  • Replace the generated SentencePiece protobuf model types with a minimal internal parser (SentencepieceModel.cs).
  • Remove the Google.Protobuf package reference and adjust warning suppression to project-level.
  • Add/extend tests covering SentencePieceTokenizer.Create(...) / LlamaTokenizer.Create(...) null/invalid stream scenarios.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/Microsoft.ML.Tokenizers.Tests/SentencePieceTests.cs Adds basic factory/stream validation tests for SentencePieceTokenizer.
test/Microsoft.ML.Tokenizers.Tests/LlamaTests.cs Adds a null-stream guard test for LlamaTokenizer.Create.
src/Microsoft.ML.Tokenizers/SentencepieceModel.cs Replaces generated protobuf code with a minimal wire-format reader and parsers for ModelProto/TrainerSpec/NormalizerSpec.
src/Microsoft.ML.Tokenizers/Model/WordPieceOptions.cs Removes local warning pragmas now suppressed at project level.
src/Microsoft.ML.Tokenizers/Model/BertOptions.cs Removes local warning pragmas now suppressed at project level.
src/Microsoft.ML.Tokenizers/Microsoft.ML.Tokenizers.csproj Drops Google.Protobuf reference and adds NoWarn for MSML_NoInstanceInitializers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

/// <summary>Lightweight replacement for Google.Protobuf.ByteString with a Span property.</summary>
internal readonly struct SentencePieceByteString(byte[] data, int offset, int length)
{
internal ReadOnlySpan<byte> Span => data.AsSpan(offset, length);
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SentencePieceByteString.Span will throw a NullReferenceException when PrecompiledCharsmap is left at its default struct value (e.g., empty/invalid model or missing normalizer_spec/precompiled_charsmap). Normalizer construction expects an empty span to be safe, so SentencePieceByteString should handle a null backing array (or NormalizerSpec should initialize PrecompiledCharsmap to an explicit empty value).

Suggested change
internal ReadOnlySpan<byte> Span => data.AsSpan(offset, length);
internal ReadOnlySpan<byte> Span => data is null ? ReadOnlySpan<byte>.Empty : data.AsSpan(offset, length);

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Automated Copilot response]

Addressed. The Span property now null-guards the backing array, returning ReadOnlySpan<byte>.Empty for the default struct case (e.g. when NormalizerSpec has no precompiled_charsmap). The SentencePieceNormalizer constructor handles empty spans safely via its IsEmpty early-return. Covered by Synthetic_BpeModel_DefaultProperties and Synthetic_BpeModel_EmptyNormalizerSpecUsesDefaults tests.

@stephentoub
Copy link
Member Author

@ericstj / @tarekgh, the nuget config looks broken:

/home/cloudtest_azpcontainer/.nuget/packages/microsoft.dotnet.arcade.sdk/11.0.0-beta.25626.7/tools/Tools.proj : error NU1301: Unable to load the service index for source https://dnceng.pkgs.visualstudio.com/public/_packaging/darc-pub-dotnet-maintenance-packages-ab95a1f1/nuget/v3/index.json.

@ericstj
Copy link
Member

ericstj commented Mar 10, 2026

The feed was deleted -- we can remove it since the packages are now public.

internal TrainerSpec TrainerSpec { get; private set; } = new();
internal NormalizerSpec NormalizerSpec { get; private set; } = new();

internal static ModelProto Parse(byte[] data, int start, int end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a gammut of tests that give this thing every version of sentencepiece.bpe.model that was ever shipped? Will this be resilient to versioning in the same way as the old implementation? I love the reduction in dependencies and deletion of the generated code, but we are adding a parser (albiet for trusted code) and it needs to meet compatibility and versioning requirements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Automated Copilot response]

Great question. Here's how compatibility and versioning are handled:

Forward compatibility by design. The parser skips unknown field numbers (via the default case in every switch) and unknown wire types are handled by SkipField. This mirrors protobuf's core forward-compat guarantee: a parser can read data from a newer proto version without crashing, it just ignores the fields it doesn't recognize. The sentencepiece_model.proto also defines extensions 200 to max; on several messages, which our parser handles the same way.

Wire type mismatch tolerance. The when wireType == 2 guards on known fields mean that if a future proto version changes a field's wire type, it falls through to skip instead of misinterpreting the data.

Test coverage for real model files. The existing tests exercise three real model files from the Microsoft.ML.TestTokenizers package: Llama (BPE), Mistral (BPE), and Paraphrase-MiniLM (Unigram). These cover both model types and different normalizer configurations. Adding more models (e.g. T5, Gemma) would require updating that external package — worth doing as a follow-up.

Synthetic protobuf tests. In addition to real files, the latest push adds 49 tests that construct model files from raw wire-format bytes, exercising every parser code path: all field types, all wire types (read and skip), non-canonical varints, negative int32 (10-byte varint), extension-range field numbers, forward-compat patterns (wire type mismatch, unknown fields at every message level), submessage boundary handling, MemoryStream position offsets, and error handling for truncated/malformed data.

One known semantic difference from Google.Protobuf. For non-repeated message fields (TrainerSpec, NormalizerSpec), if the same field number appears twice, our parser uses the last occurrence (last-wins) rather than merging. This is documented in a code comment and validated by tests (MultipleTrainerSpecLastWins, MultipleTrainerSpecFieldsNotMerged). SentencePiece's own serializer never produces duplicate top-level messages, so this difference is moot in practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a good test using the actual tokenizer data for the models like Llama, Mistral, and Paraphrase-multilingual-MiniLM-L12. This covering using the SentencePiece Bpe and Unigram models we support too.

// Phi3 is using the same tokenizer.model used by Llama. https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/tree/main

using Stream remoteStream = File.OpenRead(Path.Combine(@"Mistral", "tokenizer.model"));

using Stream remoteStream = File.OpenRead(Path.Combine(@"Paraphrase-multilingual-MiniLM-L12-v2", "sentencepiece.bpe.model"));

Regarding the versions, protopuf serialized data depends on fields/types which is agnostic between different versions (like proto2 and proto3). Maybe we only need to ensure the data we need is read, like Pieces table for instance.

@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 96.65552% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.54%. Comparing base (70d7603) to head (5b2f4e7).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...icrosoft.ML.Tokenizers.Tests/SentencePieceTests.cs 96.63% 17 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7587      +/-   ##
==========================================
+ Coverage   69.07%   69.54%   +0.47%     
==========================================
  Files        1483     1484       +1     
  Lines      274513   273206    -1307     
  Branches    28285    27919     -366     
==========================================
+ Hits       189625   190011     +386     
+ Misses      77503    75831    -1672     
+ Partials     7385     7364      -21     
Flag Coverage Δ
Debug 69.54% <96.65%> (+0.47%) ⬆️
production 63.81% <ø> (+0.47%) ⬆️
test 89.59% <96.65%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/Microsoft.ML.Tokenizers/Model/BertOptions.cs 100.00% <ø> (ø)
.../Microsoft.ML.Tokenizers/Model/WordPieceOptions.cs 100.00% <ø> (ø)
src/Microsoft.ML.Tokenizers/SentencepieceModel.cs 91.04% <ø> (+70.19%) ⬆️
test/Microsoft.ML.Tokenizers.Tests/LlamaTests.cs 99.85% <100.00%> (+<0.01%) ⬆️
...icrosoft.ML.Tokenizers.Tests/SentencePieceTests.cs 96.63% <96.63%> (ø)

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Handle null backing array in SentencePieceByteString.Span (default struct)
- Add big-endian guard to ReadFloat
- Make ModelProto.Parser readonly
- Optimize ParseFrom with MemoryStream.TryGetBuffer fast-path and pre-sized fallback
- Document last-wins vs merge semantics for non-repeated message fields
- Add comprehensive synthetic protobuf tests (49 SentencePiece tests total)
  covering wire format edge cases, forward compatibility, error handling,
  all field types, and stream path variations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants