Skip to content

.Net: fix: count text chunker orphan glue by tokens#14002

Open
he-yufeng wants to merge 1 commit into
microsoft:mainfrom
he-yufeng:fix/textchunker-paragraph-token-count
Open

.Net: fix: count text chunker orphan glue by tokens#14002
he-yufeng wants to merge 1 commit into
microsoft:mainfrom
he-yufeng:fix/textchunker-paragraph-token-count

Conversation

@he-yufeng
Copy link
Copy Markdown

Summary

  • use the configured token counter when deciding whether to glue the final short paragraph into the previous one
  • keep the existing whitespace normalization behavior for glued orphan paragraphs
  • add a regression test where word-count based gluing would exceed the requested token limit

Fixes #13713.

To verify

  • dotnet test .\src\SemanticKernel.UnitTests\SemanticKernel.UnitTests.csproj -f net10.0 --filter "FullyQualifiedName~SplitPlainTextParagraphsDoesNotGlueLastParagraphPastTokenLimit"
  • dotnet test .\src\SemanticKernel.UnitTests\SemanticKernel.UnitTests.csproj -f net10.0 --filter "FullyQualifiedName~TextChunkerTests" --no-restore
  • dotnet build .\src\SemanticKernel.Core\SemanticKernel.Core.csproj -f net10.0 --no-restore
  • dotnet format .\src\SemanticKernel.Core\SemanticKernel.Core.csproj --verify-no-changes --no-restore
  • dotnet format .\src\SemanticKernel.UnitTests\SemanticKernel.UnitTests.csproj --verify-no-changes --no-restore

Copilot AI review requested due to automatic review settings May 13, 2026 06:22
@he-yufeng he-yufeng requested a review from a team as a code owner May 13, 2026 06:22
@moonbox3 moonbox3 added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel labels May 13, 2026
@github-actions github-actions Bot changed the title fix: count text chunker orphan glue by tokens .Net: fix: count text chunker orphan glue by tokens May 13, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an issue in TextChunker.ProcessParagraphs where the “orphan paragraph” gluing heuristic could exceed the configured token budget because it used word-count instead of the configured token counter. The change ensures gluing decisions honor the same token-counting strategy used throughout the chunking pipeline, and adds a regression test to prevent reintroduction.

Changes:

  • Update orphan-paragraph gluing to use GetTokenCount(..., tokenCounter) on the combined paragraph rather than adding word counts.
  • Preserve the existing whitespace normalization behavior for glued paragraphs by continuing to split/join on spaces.
  • Add a regression unit test covering a case where word-count-based gluing would exceed the token limit when using a custom token counter.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
dotnet/src/SemanticKernel.Core/Text/TextChunker.cs Fixes orphan-paragraph gluing to respect the configured token counter when checking max token limits.
dotnet/src/SemanticKernel.UnitTests/Text/TextChunkerTests.cs Adds a regression test ensuring the last paragraph is not glued if it would exceed the requested token limit.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 93% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach


Automated review by he-yufeng's agents

@he-yufeng he-yufeng force-pushed the fix/textchunker-paragraph-token-count branch from d7c77b4 to 37f7f45 Compare May 13, 2026 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kernel Issues or pull requests impacting the core kernel .NET Issue or Pull requests regarding .NET code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: the TextChunker.SplitPlainTextParagraphs sometimes overcount the chunk sizes

3 participants