Skip to content

fix decode token#7102

Open
zhuangzhuang12 wants to merge 4 commits intoPaddlePaddle:developfrom
zhuangzhuang12:accumulate-garbled-tokens
Open

fix decode token#7102
zhuangzhuang12 wants to merge 4 commits intoPaddlePaddle:developfrom
zhuangzhuang12:accumulate-garbled-tokens

Conversation

@zhuangzhuang12
Copy link
Copy Markdown
Contributor

Title: [Engine][DataProcessor] Simplify force decode logic in _decode_token and add unit tests

Body:

Motivation

When streaming ends with undecoded tokens (e.g., partial UTF-8 byte-level tokens),
_decode_token needs to return these remaining token IDs. The original force decode
logic used a two-level lookup with prefix_offset, prev_cum_len, and start_idx,
which was unnecessarily complex — cum_tokens[read_offset:] is sufficient to capture
all unreturned tokens in every case.

Modifications

  • engine/common_engine.py: Simplified the force decode path in _decode_token.
    Replaced the two-level remaining token lookup (cum_tokens[start_idx:read_offset]
    with fallback to cum_tokens[read_offset:]) with a single cum_tokens[read_offset:].
  • test_decode_token.py: Added unit tests for _decode_token covering:
    • Empty end (no tokens, is_end=True)
    • Incremental decoding with normal Chinese characters
    • Force decode of undecoded byte-level tokens at stream end

Usage or Command

python test_decode_token.py

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 31, 2026

Thanks for your contribution!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@4425142). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7102   +/-   ##
==========================================
  Coverage           ?   73.93%           
==========================================
  Files              ?      402           
  Lines              ?    56576           
  Branches           ?     8942           
==========================================
  Hits               ?    41831           
  Misses             ?    11810           
  Partials           ?     2935           
Flag Coverage Δ
GPU 73.93% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants