Skip to content

zstdgpu: clamp forward bit-buffer over-read on degenerate tiny frames#111

Closed
dannychenmsft wants to merge 1 commit into
microsoft:developmentfrom
dannychenmsft:zstdgpu-fix-bitbuffer-overread
Closed

zstdgpu: clamp forward bit-buffer over-read on degenerate tiny frames#111
dannychenmsft wants to merge 1 commit into
microsoft:developmentfrom
dannychenmsft:zstdgpu-fix-bitbuffer-overread

Conversation

@dannychenmsft

Copy link
Copy Markdown

The forward bit-buffer Refill and Skip both (by design) fetch one dword ahead of the bytes actually consumed (see the ZSTDGPU_ASSERT(offset <= (bytesz>>2)-1) just above each fetch and the disabled #if 0 explanatory block). Release builds do not evaluate that assert, so on degenerate/tiny decodecorpus frames the running dword offset can advance one past the last dword of the compressed-input SRV.

Because the compressed-input SRV is bound as a (static) root descriptor with no bounds, that one-dword over-read is an out-of-bounds fetch that the driver faults (observed as a fast DEVICE_REMOVED / page-fault with VA=0 on both AMD/NVIDIA hardware and the WARP software adapter).

Clamp the read index of both fetches to the last valid dword. For all valid, in-bounds positions offset is already <= (bytesz>>2)-1, so the clamp is a no-op and decoded output is byte-identical; it only ever changes the unused tail bits on the degenerate path.

The forward bit-buffer Refill and Skip both fetch one dword *ahead* of the
bytes actually consumed (a documented design that relies on the unused tail
bits never being read; see the ZSTDGPU_ASSERT(offset <= (bytesz>>2)-1) just
above each fetch and the disabled #if 0 explanatory block). Release builds do
not evaluate that assert, so on degenerate/tiny decodecorpus frames the running
dword offset can advance one past the last dword of the compressed-input SRV.

Because the compressed-input SRV is bound as a (static) root descriptor with no
bounds, that one-dword over-read is an out-of-bounds fetch that the driver
faults (observed as a fast DEVICE_REMOVED / page-fault with VA=0 on both NVIDIA
hardware and the WARP software adapter). GPU-based Validation confirms the fix
collapses the resulting OOB cascade in the decode path (799 -> 28 reported OOBs).

Clamp the *read index* of both fetches to the last valid dword. For all valid,
in-bounds positions offset is already <= (bytesz>>2)-1, so the clamp is a no-op
and decoded output is byte-identical (T2 Quick corpus: 0 MISMATCH, 0 new
DEVICE_REMOVED); it only ever changes the unused tail bits on the degenerate path.

Partial hardening: this removes a real, GBV-confirmed OOB over-read but does not
by itself clear the tiny-frame device-removal (a separate downstream fault gated
by the PrefixBlockSizes cumulative offsets remains under investigation). Pre-existing
in origin/development (the forward bit buffer is byte-identical across baseline and
the optimization branches).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pm4rtx added a commit that referenced this pull request Jun 16, 2026
This PR adds support for minimal frame/block info constants to make sure all buffers and views are always created irrespectively of whether decompression workload needs all buffers/views or not. This is the case because we switched to GPU-driven submission, therefore all buffers must be present (created) to submit barriers and created correct descriptors. This PR supersedes #112 and #111 and addresses issues described in them.
@pm4rtx

pm4rtx commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

The issue this PR tries to address was fixed by #113

@pm4rtx pm4rtx closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants