Skip to content

[cDAC] Handle VirtualReadException in StackWalk_1.IsManaged()#124844

Draft
max-charlamb wants to merge 1 commit intodotnet:mainfrom
max-charlamb:cdac-stackwalk-crash-fix
Draft

[cDAC] Handle VirtualReadException in StackWalk_1.IsManaged()#124844
max-charlamb wants to merge 1 commit intodotnet:mainfrom
max-charlamb:cdac-stackwalk-crash-fix

Conversation

@max-charlamb
Copy link
Member

Problem

Three cDAC CI tests are failing on windows_x64_Release:

  • SOS.DualRuntimes
  • SOS.GCTests
  • SOS.VarargPInvokeInteropMD

The SOS.DualRuntimes failure is caused by an unhandled VirtualReadException (CORDBG_E_READVIRTUAL_FAILURE, 0x80131c49) during stack walking. When the stack walker encounters a frame whose instruction pointer is in unmanaged code (e.g. ntdll), IsManaged() calls ExecutionManager.GetCodeBlockHandle(), which traverses the multi-level RangeSectionMap. The map can find a RangeSectionFragment for the address range, but the fragment's RangeSection pointer may reference memory that is unreadable in the dump, causing VirtualReadException to propagate out of the stack walk.

The native DAC handles this scenario because ExecutionManager::FindCodeRange has a NOTHROW contract and SUPPORTS_DAC — access violations during DAC memory reads are caught by SEH (DAC_ENTER/DAC_LEAVE) and translated to a NULL return. The cDAC has no equivalent protection, so the exception escapes.

Fix

Wrap the GetCodeBlockHandle call in StackWalk_1.IsManaged() with a try/catch for VirtualReadException. When a memory read fails during the code block lookup, we treat the IP as not being in managed code (return false) and allow the stack walk to continue. This matches the native DAC's behavior, where FindCodeRange returns NULL on read failures.

The catch is intentionally narrow — only VirtualReadException is caught, not all exceptions. A memory read failure is a definitive "this IP is not resolvable as managed code" answer, not a masked error.

Validation

  • Reproduced the SOS.DualRuntimes failure using an ad-hoc dump diagnostic tool that loads the CI dump files via ClrMD and drives the cDAC stack walker.
  • Verified the fix against all 29 dump files from the failing CI run — all threads walk successfully with 0 failures.
  • All 832 existing cDAC unit tests pass.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the cDAC stack walker to tolerate unreadable memory during managed-code detection, aligning behavior with the native DAC (treating read failures during code-range lookup as “not managed” so the walk can continue).

Changes:

  • Wrap ExecutionManager.GetCodeBlockHandle in StackWalk_1.IsManaged() with a narrow try/catch (VirtualReadException).
  • On VirtualReadException, return false (unmanaged) and continue the stack walk rather than failing the entire walk.

@max-charlamb max-charlamb added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Feb 25, 2026
@max-charlamb max-charlamb marked this pull request as draft February 25, 2026 05:03
}
catch (VirtualReadException)
{
// If we fail to read memory while looking up the code block, the IP is not in managed code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fail to read memory while looking up the code block, the IP is not in managed code

How can this happen? I do not think we should ever fail to read the memory (unless the data are corrupted or missing - but then we cannot really tell whether the IP is in managed code or not).

@noahfalk
Copy link
Member

noahfalk commented Feb 25, 2026

A memory read failure is a definitive "this IP is not resolvable as managed code" answer, not a masked error.

I'm skeptical :) As a thought experiment, start with a valid dump, pick an arbitrary RangeSection that points to a managed frame on the stack and delete it from the dump contents. Now we've got a dump that will trigger memory read failure during the stack walk and by construction its masking managed code. Even if we assume there are zero bugs in our dump memory enumeration customers could have created dumps in other ways.

I'd put my wager on the opposite conclusion - DAC was masking bugs. This looks like one (the conditional is flipped).

if (level[i].IsNull())

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've likely got a memory enumeration bug and ignoring the read exception is only serving to mask it. If its really the case that by-design this memory shouldn't be in the dump it would be helpful to have some analysis of why that is.

@max-charlamb
Copy link
Member Author

A memory read failure is a definitive "this IP is not resolvable as managed code" answer, not a masked error.

I'm skeptical :) As a thought experiment, start with a valid dump, pick an arbitrary RangeSection that points to a managed frame on the stack and delete it from the dump contents. Now we've got a dump that will trigger memory read failure during the stack walk and by construction its masking managed code. Even if we assume there are zero bugs in our dump memory enumeration customers could have created dumps in other ways.

I'd put my wager on the opposite conclusion - DAC was masking bugs. This looks like one (the conditional is flipped).

if (level[i].IsNull())

Here's a PR to fix that bug: #124862

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-Diagnostics-coreclr NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants