Fix safepoint stack slot reuse#13469
Conversation
Backport of bytecodealliance#13469 (fixes bytecodealliance#13461). The safepoint spiller walks instructions backwards and can free a stack slot for a value defined by a safepoint instruction before assigning stack-map slots for values live across that same safepoint. If the freed slot is reused, the stack map can point at a slot that contains the instruction result rather than the value that must remain live across the call. This surfaced as a `null reference` trap on a provably non-null GC ref carried across a call safepoint. Reorder the rewrite so safepoint stack-map entries are assigned before result slots for that instruction are freed for reuse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Confirmed — this fixes #13461 on our end. We originally hit the What I did:
Thanks for the quick turnaround. |
Bump the vendor/wasmtime fork (wado-lang/wasmtime, gfx/wasmtime-45) to pick up the backport of bytecodealliance/wasmtime#13469, which fixes the Cranelift safepoint-spiller regression (#13461) that miscompiled GC refs live across call safepoints into null reads. This unblocks the wasmtime 44->45 upgrade: the 7 previously failing e2e tests (serde_json_* and value_copy_nested_array_helper_chain) now pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump the vendor/wasmtime fork (wado-lang/wasmtime, gfx/wasmtime-45) to pick up the backport of bytecodealliance/wasmtime#13469, which fixes the Cranelift safepoint-spiller regression (#13461) that miscompiled GC refs live across call safepoints into null reads. This unblocks the wasmtime 44->45 upgrade: the 7 previously failing e2e tests (serde_json_* and value_copy_nested_array_helper_chain) now pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump the vendor/wasmtime fork (wado-lang/wasmtime, gfx/wasmtime-45) to pick up the backport of bytecodealliance/wasmtime#13469, which fixes the Cranelift safepoint-spiller regression (#13461) that miscompiled GC refs live across call safepoints into null reads. This unblocks the wasmtime 44->45 upgrade: the 7 previously failing e2e tests (serde_json_* and value_copy_nested_array_helper_chain) now pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi @angelnereira, thanks for the PR. However, it seems like the write up is a AI text. Please review https://github.com/bytecodealliance/governance/blob/main/AI_TOOL_POLICY.md, in particular regardless whether you are using AI as a tool yourself, you must review and own that output, and you must not simply use AI output for comments, PR descriptions, etc. You must fully own your contributions and take responsibility for them, not foist the work of understanding what the LLM did on project maintainers. |
There was a problem hiding this comment.
This test is too large and effectively useless. Maintainers cannot understand it, and if it ever regressed in the future due to a reintroduction of this bug or one like it, we wouldn't be able to diagnose what is happening. Please make a test case that is no more than ~100 lines long. Which you should be able to do as the contributor taking responsibility for your own pull request and understanding the bug it is fixing.
Thanks for pointing this out. My native language is Spanish, so I sometimes use tools to help translate or synthesize comments in English. That said, I understand the concern and the policy: I am responsible for fully reviewing, understanding, and owning anything I post. I’ll be more careful going forward and make sure my comments and PR descriptions are written and reviewed by me, and only posted when I can fully stand behind them. Sorry for the noise, and thanks for the clarification. |
To be clear, using an LLM to translate a human-written comment from Spanish to English is perfectly fine. Having the LLM write an english comment based on a Spanish prompt is not. Thanks! Appreciate that you are receptive to this feedback. |
4c52f1d to
748f399
Compare
|
Thanks for the review. I removed the large WAST regression The new test directly checks the slot-reuse condition fixed here: a safepoint result's stack slot must not be reused for another I verified it with: |
The safepoint spiller walks instructions backwards. Before this change, it could free a stack slot for the value defined by a safepoint instruction before reserving stack-map slots for values live across that same safepoint. That made it possible to reuse the same slot for both values. Rewrite safepoints before rewriting the instruction results, so live-across values reserve their stack-map slots before result slots are returned to the free list. Replace the large WAST regression test with a focused safepoint-spiller unit test that directly checks this slot-reuse condition. Testing: - cargo test -p cranelift-frontend safepoint_reserves_live_slots_before_freeing_result_slots - cargo test -p cranelift-frontend safepoints
748f399 to
33c80eb
Compare
| let mut spiller = SafepointSpiller::default(); | ||
| spiller.liveness.post_order.push(block0); | ||
| spiller.liveness.live_across_any_safepoint.insert(live); | ||
| spiller | ||
| .liveness | ||
| .safepoints | ||
| .insert(call, [live].into_iter().collect()); | ||
|
|
||
| let result_slot = spiller | ||
| .stack_slots | ||
| .get_or_create_stack_slot(&mut func, result); | ||
| spiller.rewrite(&mut func); | ||
|
|
||
| let live_slot = spiller.stack_slots.get(live).unwrap(); | ||
| assert_ne!( | ||
| result_slot, live_slot, | ||
| "the safepoint result slot must not be reused for a value live across that same safepoint" | ||
| ); |
There was a problem hiding this comment.
This test is a little too low-level to really be very useful, because the bug is not in the low-level get-or-create-stack-slot APIs, it is in the order that those APIs are called when rewriting the whole function based on the liveness analysis.
It should be possible to make a test at the whole CLIF function level which asserts the expected output CLIF after running the safepoint spiller via assert_eq_output!(...), similar to e.g. the needs_stack_map_and_loop test in this test module, but which exercises this bug and checks for regressions. If I understand correctly, what is needed is something like this:
block0(v0: i64):
v1 = call f(v0)
;; v1 needs inclusion in stack maps
v2 = call f(v1)
;; v2 needs inclusion in stack maps
v3 = call f(v2)
return v3
That is, we have two values that need inclusion in stack maps, have the same type and non-overlapping live ranges and therefore could possibly reuse the same stack slot, and the live range for one ends at a safepoint.
This might not be the exact shape necessary to trigger the bug. It might require another call or that the values have longer live ranges across additional safepoints. I'm not exactly sure, but you should be able to come up with something based off this initial starting point. Basically just look at the low-level API call sequence you're currently making and craft a CLIF function that will trigger that same low-level API call sequence.
Please make sure that the invalid stack slot reuse is present in this test without the fix, and then that the invalid stack slot reuse goes away after the fix is reapplied.
|
I think this is superseded by #13480. The root cause is that loop-invariant values aren't tracked properly in the rewrite walk. Reordering |
|
Closing in favor of #13498 but if you can create a test case that still fails and isn't fixed by that PR, then please open a new PR/issue! Thanks |
Summary
Fixes #13461.
The safepoint spiller walks instructions backwards and can free a stack slot for a value defined by a safepoint instruction before assigning stack-map slots for values live across that same safepoint. If the freed slot is reused, the stack map can point at a slot that contains the instruction result rather than the value that must remain live across the call.
This changes the rewrite order so safepoint stack-map entries are assigned before result slots for that instruction are freed for reuse. It also adds the GC regression test from the issue to cover the null-reference trap that exposed this.
Testing