zstdgpu: clamp forward bit-buffer over-read on degenerate tiny frames#111
Closed
dannychenmsft wants to merge 1 commit into
Closed
zstdgpu: clamp forward bit-buffer over-read on degenerate tiny frames#111dannychenmsft wants to merge 1 commit into
dannychenmsft wants to merge 1 commit into
Conversation
The forward bit-buffer Refill and Skip both fetch one dword *ahead* of the bytes actually consumed (a documented design that relies on the unused tail bits never being read; see the ZSTDGPU_ASSERT(offset <= (bytesz>>2)-1) just above each fetch and the disabled #if 0 explanatory block). Release builds do not evaluate that assert, so on degenerate/tiny decodecorpus frames the running dword offset can advance one past the last dword of the compressed-input SRV. Because the compressed-input SRV is bound as a (static) root descriptor with no bounds, that one-dword over-read is an out-of-bounds fetch that the driver faults (observed as a fast DEVICE_REMOVED / page-fault with VA=0 on both NVIDIA hardware and the WARP software adapter). GPU-based Validation confirms the fix collapses the resulting OOB cascade in the decode path (799 -> 28 reported OOBs). Clamp the *read index* of both fetches to the last valid dword. For all valid, in-bounds positions offset is already <= (bytesz>>2)-1, so the clamp is a no-op and decoded output is byte-identical (T2 Quick corpus: 0 MISMATCH, 0 new DEVICE_REMOVED); it only ever changes the unused tail bits on the degenerate path. Partial hardening: this removes a real, GBV-confirmed OOB over-read but does not by itself clear the tiny-frame device-removal (a separate downstream fault gated by the PrefixBlockSizes cumulative offsets remains under investigation). Pre-existing in origin/development (the forward bit buffer is byte-identical across baseline and the optimization branches). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pm4rtx
added a commit
that referenced
this pull request
Jun 16, 2026
This PR adds support for minimal frame/block info constants to make sure all buffers and views are always created irrespectively of whether decompression workload needs all buffers/views or not. This is the case because we switched to GPU-driven submission, therefore all buffers must be present (created) to submit barriers and created correct descriptors. This PR supersedes #112 and #111 and addresses issues described in them.
Collaborator
|
The issue this PR tries to address was fixed by #113 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The forward bit-buffer Refill and Skip both (by design) fetch one dword ahead of the bytes actually consumed (see the ZSTDGPU_ASSERT(offset <= (bytesz>>2)-1) just above each fetch and the disabled #if 0 explanatory block). Release builds do not evaluate that assert, so on degenerate/tiny decodecorpus frames the running dword offset can advance one past the last dword of the compressed-input SRV.
Because the compressed-input SRV is bound as a (static) root descriptor with no bounds, that one-dword over-read is an out-of-bounds fetch that the driver faults (observed as a fast DEVICE_REMOVED / page-fault with VA=0 on both AMD/NVIDIA hardware and the WARP software adapter).
Clamp the read index of both fetches to the last valid dword. For all valid, in-bounds positions offset is already <= (bytesz>>2)-1, so the clamp is a no-op and decoded output is byte-identical; it only ever changes the unused tail bits on the degenerate path.