Skip to content

[Wasm RyuJIT] Block Stores pt. 2#124846

Draft
kg wants to merge 7 commits intodotnet:mainfrom
kg:wasm-blockstore-2
Draft

[Wasm RyuJIT] Block Stores pt. 2#124846
kg wants to merge 7 commits intodotnet:mainfrom
kg:wasm-blockstore-2

Conversation

@kg
Copy link
Member

@kg kg commented Feb 25, 2026

  • Implement genCodeForCpObj
  • Fix two NIY scenarios in the stackifier
  • Fix generated stores not being lowered after RewriteLocalStackStore
  • Expand and add comments to ease development

This is sufficient to compile the following to valid WASM (It crashes when attempting to call the write barrier helper):

    [StructLayout(LayoutKind.Explicit)]
    public struct S2 {
        [FieldOffset(0)]
        public string A;
        [FieldOffset(8)]
        public int b;
        [FieldOffset(16)]
        public string C;
        [FieldOffset(24)]
        public int d;
    }

    [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
    static unsafe void copyStructWithRefs (ref S2 a, ref S2 b) {
        a = b;
    }

And the following just plain works, since copies to the stack don't need a write barrier:

    [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
    static unsafe void loadStructWithRefs (ref S2 src) {
        S2 local = src;
    }

The following doesn't work yet due to containable memops (will be addressed separately):

    [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.NoOptimization)]
    static unsafe void fillStructFromLocal (ref S2 dest) {
        S2 local = default;
        local.b = 37;
        dest = local;
    }

@kg kg added the arch-wasm WebAssembly architecture label Feb 25, 2026
Copilot AI review requested due to automatic review settings February 25, 2026 05:52
@kg kg added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 25, 2026
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Reorder invariant nodes in simple scenarios in stackifier

Jitdump when moving nodes in stackifier

When regallocwasm creates a new store node, lower it
Apply regallocwasm fix from andy

Checkpoint

Checkpoint

Add comment

Speculatively implement the dstOnStack optimization (code that hits it doesn't compile yet)
@kg kg force-pushed the wasm-blockstore-2 branch from 867f769 to 4c8682d Compare February 25, 2026 05:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR advances the WASM RyuJIT backend’s block-store support by adding cpobj codegen, addressing a stackifier NIY case, and ensuring newly generated stores are lowered post-rewrite to keep the pipeline consistent.

Changes:

  • Implement CodeGen::genCodeForCpObj for GT_STORE_BLK cpobj unrolling on WASM.
  • Extend WASM regalloc to track multi-use operands for GT_STORE_BLK and to re-lower stores created by RewriteLocalStackStore.
  • Improve lowering/stackifier behavior: mark cpobj operands as multiply-used and relax one stackifier NIY by moving invariant nodes.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/coreclr/jit/regallocwasm.h Adds CollectReferencesForBlockStore declaration for block store multi-use tracking.
src/coreclr/jit/regallocwasm.cpp Implements block-store reference collection and re-lowers stores created during local-store rewrite.
src/coreclr/jit/lowerwasm.cpp Marks cpobj operands as MultiplyUsed and adjusts stackifier handling for invariant nodes.
src/coreclr/jit/gentree.cpp Improves WASM dump text for native block-store opcodes (memory.copy vs memory.fill).
src/coreclr/jit/compiler.h Adds WasmRegAlloc friend access for invoking lowering post-rewrite.
src/coreclr/jit/codegenwasm.cpp Implements cpobj unrolled copying sequence (load/store vs helper call) and adds pointer-sized instruction aliases.
Comments suppressed due to low confidence (5)

src/coreclr/jit/codegenwasm.cpp:2368

  • The call to genEmitHelperCall(CORINFO_HELP_ASSIGN_BYREF, ...) is missing the required SP argument. CodeGen::genEmitHelperCall explicitly notes that for WASM helper calls, the stack-pointer argument must be first on the value stack (below any other args). Not pushing SP here would mismatch the helper signature and can explain the reported crash when calling the write barrier helper.

Suggestion: push GetStackPointerReg() (via local.get) as the first argument before pushing the destination/source byrefs, for each helper call site (or refactor to a small helper that emits the correct argument sequence).

        case PackOperAndType(GT_LE, TYP_DOUBLE):
            ins = INS_f64_le;
            break;
        case PackOperAndType(GT_GE, TYP_FLOAT):

src/coreclr/jit/codegenwasm.cpp:2324

  • genCodeForCpObj calls genConsumeRegs(cpObjNode), which only updates liveness for the GT_STORE_BLK node itself and does not consume/update liveness for its operands. This differs from patterns like genCodeForStoreInd, and can lead to incorrect liveness (and thus wrong code) for Addr()/Data().

Suggestion: consume the destination address and source address/value explicitly (e.g., using genConsumeAddress(dstAddr) and genConsumeRegs(...) in the correct execution order, or genConsumeOperands if appropriate) before emitting the copy sequence, and then do the usual life update for the store node.

        // So we can re-express say GT_GE (UN) as !GT_LT
        //

src/coreclr/jit/codegenwasm.cpp:2373

  • gcPtrCount is initialized to the total GC pointer slot count, but it is only decremented in the write-barrier path. When dstOnStack is true, GC pointer slots are copied via the non-WB load/store path and gcPtrCount will never reach 0, causing the final assert(gcPtrCount == 0) to fire in debug builds.

Suggestion: either decrement gcPtrCount whenever layout->IsGCPtr(i) (regardless of dstOnStack), or move the gcPtrCount accounting + assert under the !dstOnStack branch (similar to other target implementations).

    instruction ins;
    switch (PackOperAndType(op, treeNode->gtOp1->TypeGet()))
    {
        case PackOperAndType(GT_EQ, TYP_FLOAT):
            ins = INS_f32_eq;
            break;
        case PackOperAndType(GT_EQ, TYP_DOUBLE):
            ins = INS_f64_eq;
            break;
        case PackOperAndType(GT_NE, TYP_FLOAT):
            ins = INS_f32_ne;
            break;
        case PackOperAndType(GT_NE, TYP_DOUBLE):
            ins = INS_f64_ne;
            break;
        case PackOperAndType(GT_LT, TYP_FLOAT):
            ins = INS_f32_lt;
            break;
        case PackOperAndType(GT_LT, TYP_DOUBLE):
            ins = INS_f64_lt;
            break;
        case PackOperAndType(GT_LE, TYP_FLOAT):
            ins = INS_f32_le;
            break;
        case PackOperAndType(GT_LE, TYP_DOUBLE):
            ins = INS_f64_le;
            break;
        case PackOperAndType(GT_GE, TYP_FLOAT):
            ins = INS_f32_ge;
            break;
        case PackOperAndType(GT_GE, TYP_DOUBLE):
            ins = INS_f64_ge;
            break;

src/coreclr/jit/codegenwasm.cpp:2364

  • The offset computation for the write-barrier helper path hard-codes INS_i32_const/INS_i32_add, but the address locals may be i64 when TARGET_64BIT (and you already abstracted other pointer-sized ops via INS_I_*). This will produce invalid wasm or incorrect values on 64-bit.

Suggestion: use the pointer-sized const/add instructions (INS_I_const/INS_I_add) and the appropriate emitAttr for the constant (or select i32 vs i64 based on TARGET_64BIT) so the address arithmetic matches the address type.

        case PackOperAndType(GT_LT, TYP_DOUBLE):
            ins = INS_f64_lt;
            break;
        case PackOperAndType(GT_LE, TYP_FLOAT):
            ins = INS_f32_le;
            break;

src/coreclr/jit/codegenwasm.cpp:2310

  • noway_assert(source->IsLocal()) / noway_assert(dstAddr->IsLocal()) will hard-fail compilation in non-DEBUG builds if either operand is not a local. Nothing in LowerBlockStore guarantees Addr() is a local, and other targets handle arbitrary address expressions here.

Suggestion: remove these noway_asserts and rely on GetMultiUseOperandReg (with MultiplyUsed marking where needed) to support non-local address expressions; if some shapes are truly unsupported for now, prefer an explicit NYI/IMPL_LIMITATION gate instead of a noway_assert in release builds.


    genTreeOps op          = treeNode->OperGet();

Comment on lines 249 to 254
{
// Try to use bulk copy helper
if (TryLowerBlockStoreAsGcBulkCopyCall(blkNode))
{
return;
}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LowerBlockStore marks the block-store source src as Contained unconditionally, but in the doCpObj path it also marks src itself as MultiplyUsed when src is not GT_IND. A contained node won't be emitted as a standalone value, so marking it MultiplyUsed doesn't ensure a local.tee happens, and it can leave codegen (e.g., GetMultiUseOperandReg) reading from an uninitialized temp local.

Suggestion: ensure the MultiplyUsed flag is applied to the actual address/value node that will be emitted (e.g., the GT_IND address child), or uncontain/reshape non-GT_IND sources (such as local struct sources) into an address form that codegen can evaluate once and tee into a temp.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 25, 2026 15:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment on lines +2499 to +2508
emit->emitIns_I(INS_local_get, attrDstAddr, WasmRegToIndex(dstReg));
emit->emitIns_I(INS_I_const, attrDstAddr, offset);
emit->emitIns(INS_I_add);
emit->emitIns_I(INS_local_get, attrSrcAddr, WasmRegToIndex(srcReg));
emit->emitIns_I(INS_I_const, attrSrcAddr, offset);
emit->emitIns(INS_I_add);
// Call the byref assign helper. On other targets this updates the dst/src regs but here it won't,
// so we have to do the local.get+i32.const+i32.add dance every time.
genEmitHelperCall(CORINFO_HELP_ASSIGN_BYREF, 0, EA_PTRSIZE);
gcPtrCount--;
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In genCodeForCpObj, the CORINFO_HELP_ASSIGN_BYREF helper call is emitted without pushing the implicit SP argument required by the WASM managed calling convention (see the contract documented in genEmitHelperCall). This will shift the helper arguments and can easily explain the current runtime crash. Push the stack pointer local first (below dst/src) before calling genEmitHelperCall so the helper receives (sp, dst, src) in the expected order.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 25, 2026 18:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment on lines +548 to +577
/*
else
{
// To resolve this scenario we have two options:
// 1. We try moving the whole tree rooted at `node`.
// To avoid quadratic behavior, we first stackify it and collect all the side effects
// from it. Then we check for interference of those side effects with nodes between
// 'node' and 'prev'.
// 2. Failing that, we insert a temporary ('ReplaceWithLclVar') for 'node'.
// To avoid explosion of temporaries, we maintain a busy/free set of them.
// For now, for simplicity we are implementing #2 only.

LIR::Use nodeUse;
// FIXME-WASM: TryGetUse is inefficient here, replace it with something more optimal
if (!m_lower->BlockRange().TryGetUse(node, &nodeUse))
{
JITDUMP("node==[%06u] prev==[%06u]\n", Compiler::dspTreeID(node),
Compiler::dspTreeID(prev)); NYI_WASM("Could not get a LIR::Use for the node to be moved by the
stackifier");
}

unsigned lclNum = nodeUse.ReplaceWithLclVar(m_lower->m_compiler);
GenTree* newNode = nodeUse.Def();
JITDUMP("Stackifier replaced node [%06u] with lcl var %u\n", Compiler::dspTreeID(node), lclNum);
m_lower->BlockRange().Remove(newNode);
m_lower->BlockRange().InsertAfter(prev, newNode);
JITDUMP("Stackifier moved new node [%06u] after [%06u]\n", Compiler::dspTreeID(newNode),
Compiler::dspTreeID(prev)); break;
}
*/
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This large block of commented-out code (lines 548-577) should either be removed or converted to a proper TODO comment with a tracking issue. Leaving this much implementation detail in comments makes the code harder to maintain and review. If this is experimental code for future work, consider moving it to a design document or tracking issue instead.

Copilot uses AI. Check for mistakes.
{
// Copy the pointer-sized non-gc-pointer slots one at a time (and GC pointer slots if the destination is stack)
// using regular I-sized load/store pairs.
if (dstOnStack || !layout->IsGCPtr(i))
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition dstOnStack at line 2491 is unreachable because of the assertion assert(!dstOnStack) at line 2454. The assertion ensures that genCodeForCpObj is only called when the destination is on the heap (dstOnStack is false). Either remove the dstOnStack || part of this condition since it will always be false, or reconsider whether the assertion at line 2454 is correct. Looking at the lowering code (lowerwasm.cpp:249-250), destinations on the stack should use BlkOpKindNativeOpcode (memory.copy) instead of BlkOpKindCpObjUnroll, so the assertion appears correct.

Copilot uses AI. Check for mistakes.
Comment on lines +2489 to +2490
// Copy the pointer-sized non-gc-pointer slots one at a time (and GC pointer slots if the destination is stack)
// using regular I-sized load/store pairs.
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "and GC pointer slots if the destination is stack" but this scenario is unreachable due to the assertion at line 2454 which ensures dstOnStack is always false in this function. Update the comment to accurately reflect the actual behavior: only non-GC-pointer slots are copied using regular load/store pairs.

Suggested change
// Copy the pointer-sized non-gc-pointer slots one at a time (and GC pointer slots if the destination is stack)
// using regular I-sized load/store pairs.
// Copy the pointer-sized non-GC-pointer slots one at a time using regular I-sized load/store pairs.
// GC pointer slots are handled via the byref assign helper in the else-branch below.

Copilot uses AI. Check for mistakes.
offset += TARGET_POINTER_SIZE;
}

assert(dstOnStack || (gcPtrCount == 0));
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dstOnStack || part of this assertion is always false due to the assertion at line 2454 (assert(!dstOnStack)). Simplify this to assert(gcPtrCount == 0) since we know dstOnStack is always false here.

Suggested change
assert(dstOnStack || (gcPtrCount == 0));
assert(gcPtrCount == 0);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arch-wasm WebAssembly architecture area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants