Configurable native stack size for Stylus with auto-retry on overflow#4538
Configurable native stack size for Stylus with auto-retry on overflow#4538
Conversation
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #4538 +/- ##
==========================================
- Coverage 35.08% 33.90% -1.19%
==========================================
Files 498 494 -4
Lines 59096 59400 +304
==========================================
- Hits 20735 20139 -596
- Misses 34651 35708 +1057
+ Partials 3710 3553 -157 |
❌ 16 Tests Failed:
View the top 3 failed tests by shortest run time
📣 Thoughts on this report? Let Codecov know! | Powered by Codecov |
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
| if status == userNativeStackOverflow { | ||
| return nil, fmt.Errorf("%w (program=%v, module=%v, depth=%d, allowFallback=%v, onChain=%v)", | ||
| ErrNativeStackOverflow, address, moduleHash, depth, GetAllowFallback(), runCtx.IsExecutedOnChain()) | ||
| } |
There was a problem hiding this comment.
do we want to return error on all userNativeStackOverflow or only when status == userNativeStackOverflow && (!GetAllowFallback() || !runCtx.IsExecutedOnChain())? Or do we want to panic here?
| "program", address, "module", moduleHash) | ||
| return userNativeStackOverflow, nil | ||
| } | ||
| if !runCtx.IsExecutedOnChain() { |
There was a problem hiding this comment.
IsExecutedOnChain is true for when the call is executed in:
messageCommitMode- new synced/sequenced blockmessageRecordingMode- block re-executed to record preimages required to create ValidationInputmessageReplayMode- block/transaction is executed i.e. in trace RPC call
IsExecutedOnChain is means:
these message modes are executed onchain so cannot make any gas shortcuts
I am not sure if we want to use fallback in any other mode then messageCommitMode. Do we need to support fallback in messageRecordingMode @tsahee ?
| // doubling path. No corruption or crash results. | ||
| baseStackSize := configuredNativeStackSize.Load() | ||
| defer func() { | ||
| SetNativeStackSize(baseStackSize) |
There was a problem hiding this comment.
Not sure yet, but I think that there might be an issue with setting stack size globally.
Because IsExecutedOnChain is true at this point the concurrent execution of trace calls can possibly mess with the head block execution.
If we decide that we want to allow fallback only when IsCommitMode is true, then I think we should have some guarantee that only one thread will set the stack size. That still will mess with stack size used by block recording thread and RPC calls - not sure how serious is that.
There was a problem hiding this comment.
Let me give some more context.
SetNativeStackSize is process-wide — it modifies a global AtomicUsize (DEFAULT_STACK_SIZE) in Wasmer shared across all threads.
DrainStackPool is best-effort (lock-free queue). So yes, when retryOnStackOverflow doubles the stack, every concurrent Stylus execution in the process is affected.
Concurrent execution paths that share this global:
┌──────────────────────────────────┬───────────────────┬──────────────┬───────────────────────────────────────┐
│ Context │ IsExecutedOnChain │ IsCommitMode │ Concurrent? │
├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
│ Sequencer (block production) │ true │ true │ Yes, with all below │
├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
│ Block validation (non-sequencer) │ true │ true │ Yes │
├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
│ Debug trace (debug_traceBlock) │ true │ false │ Highly — runtime.NumCPU() goroutines │
├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
│ Block recording (proofs) │ true │ false │ Yes │
├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
│ eth_call / gas estimation │ false │ false │ Yes, but fallback is already disabled │
└──────────────────────────────────┴───────────────────┴──────────────┴───────────────────────────────────────┘
The key issue: debug trace calls run with IsExecutedOnChain=true and spawn multiple goroutines sharing one runCtx. If the sequencer doubles the stack during block production, concurrent trace goroutines may see the doubled size (benign — they just get more stack than needed). If the sequencer's defer restores the base size while a trace goroutine is about to allocate a coroutine, that goroutine gets the smaller stack and may overflow again — but that's equivalent to the trace never seeing the doubled value. No corruption, just a duplicate overflow.
Lets take 2 threads example in a happy path:
┌──────┬──────────────────────────────────┬─────────────────────────────────────┬──────────────────────┐
│ Step │ G1 │ G2 │ DEFAULT_STACK_SIZE │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 1 │ SetNativeStackSize(2MB) │ │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 2 │ DrainStackPool() │ │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 3 │ doStylusCall() → enters Rust │ │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 4 │ catch_traps: reads 2MB, allocs │ │ 2MB │
│ │ 2MB stack │ │ │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 5 │ executing wasm... │ SetNativeStackSize(2MB) │ 2MB (no-op, same │
│ │ │ │ value) │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 6 │ executing wasm... │ DrainStackPool() │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 7 │ executing wasm... │ doStylusCall() → enters Rust │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 8 │ executing wasm... │ catch_traps: reads 2MB, │ 2MB │
│ │ │ DefaultStack::new(2MB) │ │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 9 │ returns from Rust │ executing wasm with 2MB stack... │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 10 │ defer: SetNativeStackSize(1MB) │ executing wasm with 2MB stack... │ 1MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 11 │ defer: DrainStackPool() │ executing wasm with 2MB stack... │ 1MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 12 │ done │ returns successfully │ 1MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 13 │ │ defer: SetNativeStackSize(1MB) │ 1MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 14 │ │ defer: DrainStackPool() │ 1MB │
└──────┴──────────────────────────────────┴─────────────────────────────────────┴──────────────────────┘
And now the same threads in an unhappy path:
┌──────┬─────────────────────────────────┬─────────────────────────────────────────┬────────────────────┐
│ Step │ G1 │ G2 │ DEFAULT_STACK_SIZE │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 1 │ SetNativeStackSize(2MB) │ │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 2 │ DrainStackPool() │ │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 3 │ doStylusCall() → enters Rust │ │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 4 │ catch_traps: reads 2MB, allocs │ │ 2MB │
│ │ 2MB │ │ │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 5 │ executing wasm... │ SetNativeStackSize(2MB) │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 6 │ executing wasm... │ DrainStackPool() │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 7 │ returns from Rust │ │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 8 │ defer: SetNativeStackSize(1MB) │ │ 1MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 9 │ defer: DrainStackPool() │ │ 1MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 10 │ done │ doStylusCall() → enters Rust │ 1MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 11 │ │ catch_traps: reads 1MB, │ 1MB │
│ │ │ DefaultStack::new(1MB) │ │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 12 │ │ overflows again on 1MB stack │ 1MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 13 │ │ returns NativeStackOverflow → tx │ 1MB │
│ │ │ reverts │ │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 14 │ │ defer: SetNativeStackSize(1MB) │ 1MB │
└──────┴─────────────────────────────────┴─────────────────────────────────────────┴────────────────────┘
the worst case in any race scenario is that a concurrent thread overflows again (same outcome as if the doubling never happened) which is the last scenario above. Does that help?
arbos/programs/native.go
Outdated
|
|
||
| evmApi := newApi(evm, tracingInfo, scope, memoryModel) | ||
| defer evmApi.drop() | ||
| savedGas := scope.Contract.Gas |
There was a problem hiding this comment.
Is this only field that we need to save for possible later revert?
I think that at least Contract.UseMultiGas and Contract.RetainedMultiGas might need to also be reverted.
Also what about the rest of vm.ScopeContext?
I haven't read into that deep enough, but I think that we might need to save whole vm.ScopeContext as it includes Stack and Memory context of the contract and we will need to revert those also.
There was a problem hiding this comment.
Here's a quick analysis by Claude and was able to confirm by looking further at the code:
scope.Stack— This is the EVM operand stack (used by the interpreter forPUSH,POP,ADD, etc.). Stylus programs run inside Wasmer with their own WASM stack. The CGo stylus_call never touches scope.Stack. In fact, when Stylus is invoked, the EVM interpreter isn't running — callProgram is called directly from the precompile path, so the EVM stack is dormant.scope.Memory— Same story. This is EVM memory (used byMLOAD,MSTORE, etc.). Stylus programs have their own WASM linear memory managed by Wasmer. The stylus_call FFI doesn't read or writescope.Memory.scope.Contract(other fields) — The immutable fields (caller, address, code, codehash, value) don't change during execution.RetainedMultiGasis only modified by EVM instructions, never by Stylus host I/O.
I'll save UsedMultiGas as well. Does the above make sense, or did I miss something and we need to save other fields?
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
arbos/programs/native.go
Outdated
|
|
||
| if compiled { | ||
| // Freshly compiled means localAsm was singlepass, so cranelift is different | ||
| // code worth retrying. If !compiled, the cranelift ASM was already in the |
There was a problem hiding this comment.
not necessarily true. If database has both we have previously singlepass.
I think it's worthwhile to retry with cranelift anywat
There was a problem hiding this comment.
I was thinking about the very first time this gets compiled, but for subsequent calls, when db has both, it'll have been singlepass, so yeah you're right
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
| return nil, fmt.Errorf("failed to get wasm for cranelift compilation: %w", err) | ||
| } | ||
|
|
||
| asm, err := compileNative(wasm, version, debug, rawdb.LocalTarget(), true, 15*time.Second) |
There was a problem hiding this comment.
I might be wrong here, but if a debug vs. non-debug node run this they will compile different cranelift asm's right? Is that a problem? Do we want to encode whether or not the asm was in debug mode in the wasm store?
There was a problem hiding this comment.
If that's a problem it seems to be a pre-existing one. Correct if I'm wrong but in practice, debug mode is a per-node setting and the wasm store is local (not shared between nodes), so a single node would always use the same debug value consistently. A debug node would only ever read its own debug-compiled ASM. The only risk would be if a node changed its debug flag between restarts without clearing the wasm store? Not sure if that's realistic. Deferring to @tsahee :)
| // 3. Retry with cranelift ASM. | ||
| // 4. If cranelift also overflows, double the stack size once and retry with cranelift. | ||
| // 5. If still overflowing, give up. | ||
| func retryOnStackOverflow( |
There was a problem hiding this comment.
Should we "remember" via some LRU cache that a program failed singlepass so that we don't have to do the expensive operation again? That is, go straight for cranelift next time if re-attempted?
There was a problem hiding this comment.
Good point, if that problem keeps being submitted we'll keep overflowing. The good thing is that cranelift compilation will be remembered second time around but we haven't prevented the original overflow. Maybe an in-memory cache here with single pass overflows wouldn't be that bad. What you think @tsahee ?
| return nil, fmt.Errorf("failed to get wasm for cranelift compilation: %w", err) | ||
| } | ||
|
|
||
| asm, err := compileNative(wasm, version, debug, rawdb.LocalTarget(), true, 15*time.Second) |
There was a problem hiding this comment.
This could also return an empty asm, right? Should we handle that case in any special way?
There was a problem hiding this comment.
Looking at the code, a successful compilation that produces zero bytes would be a Wasmer bug. So in practice no; if compileNative succeeds (returns nil error), the ASM shouldn't be empty. The Rust stylus_compile either returns a non-zero status (mapped to error) or writes the compiled module bytes to output. But I guess we can never be too careful? Maybe we can add something like the following after if err != nil {?
if asm == nil {
return nil, fmt.Errorf("asm shouldn't have been nil, something is wrong!")
}| // Get or compile cranelift ASM (persisted to wasm store on compilation). | ||
| craneliftAsm, err := getCraneliftAsm(moduleHash, address, db, code, params, program.version, debug) | ||
| if err != nil || len(craneliftAsm) == 0 { | ||
| log.Error("native stack overflow, cranelift ASM unavailable", |
There was a problem hiding this comment.
Why don't we revert snapshot always when this function errors? It seems safer than doing it after these lines
There was a problem hiding this comment.
When retryOnStackOverflow returns userNativeStackOverflow, the caller returns an error. That error bubbles up through the EVM, which reverts using its own snapshot from the outer call frame. Doing an explicit RevertToSnapshot here would be redundant; and if we reverted but then some future code change made the caller not return an error, we'd have silently discarded state for no reason. This is the call stack
evm.Call() / evm.CallCode() / evm.DelegateCall() / evm.StaticCall()
│
│ snapshot := evm.StateDB.Snapshot() ← EVM takes its own snapshot
│
├─► EVMInterpreter.Run()
│
│ if IsStylusDeployableProgramPrefix(contract.Code) {
│ ret, err = evm.ProcessingHook.ExecuteWASM(callContext, input, evm)
│ return ← returns (ret, err) directly to evm.Call()
│ }
│
│ ├─► TxProcessor.ExecuteWASM()
│ │
│ │ └─► Programs.CallProgram()
│ │
│ │ └─► callProgram()
│ │
│ │ snapshot := db.Snapshot() ← inner snapshot for retry
│ │ status, output := doStylusCall(localAsm, ...)
│ │
│ │ if status == userNativeStackOverflow {
│ │ └─► retryOnStackOverflow()
│ │
│ │ // Early returns without revert <<<<<<<<<<<
│ │ }
│ │
│ │ if status == userNativeStackOverflow {
│ │ return nil, ErrNativeStackOverflow ← error returned
│ │ }
│ │
│ │ ◄── returns (nil, ErrNativeStackOverflow)
│ │
│ │ ◄── returns (nil, ErrNativeStackOverflow)
│ │
│ ◄── returns (nil, ErrNativeStackOverflow)
│
◄── returns (ret=nil, err=ErrNativeStackOverflow)
│
if err != nil {
evm.StateDB.RevertToSnapshot(snapshot) ← EVM reverts everything
}
let me know if I missed something
| DrainStackPool() | ||
|
|
||
| // Revert any partial state changes from the cranelift attempt before retrying. | ||
| db.RevertToSnapshot(snapshot) |
There was a problem hiding this comment.
Just a question: is the HostIO AddStylusPages journaled? That is, will it be reverted when we do RevertToSnapshot?
There was a problem hiding this comment.
I'm not too familiar with journaling but it doesn't seem like AddStylusPages is journaled. It directly mutates s.arbExtraData.openWasmPages and s.arbExtraData.everWasmPages without writing a journal entry. But I guess your concern is concerning using more pages in case we have to retry because we didn't reset them? The sequence of events would be:
CallProgram()adds the program's initial footprint: statedb.AddStylusPages(program.footprint)- During the first attempt, if the WASM calls
memory.grow,addPagescallsdb.AddStylusPages(pages), further incrementingopenWasmPages - On overflow,
retryOnStackOverflowcallsdb.RevertToSnapshot(snapshot): this reverts storage, balance, etc., but notopenWasmPages GasandusedMultiGasare correctly restored from saved values- On retry, when
memory.growfires again,addPagesreads the inflated open fromdb.AddStylusPages(), andGasCost(pages, open, ever)could compute a slightly higher cost since open is higher than it should be
The retry path only runs on-chain and only on native stack overflow — a rare, non-deterministic event. The consequence is slightly wrong gas charging for memory.grow on retry. Since this path is inherently non-deterministic (whether a node overflows depends on its native stack size), different nodes already can't rely on getting identical gas accounting here. That being said, it's probably safer to save and store openWasmPages to avoid spurious gas charges?
There was a problem hiding this comment.
Added the following changes so we also remember openPages and everPages:
- nitro: e8d60b3
- go-ethereum: OffchainLabs/go-ethereum@323a65c
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
| asm, err := compileNative(wasm, stylusVersion, debug, target, cranelift, timeout) | ||
| if err != nil { | ||
| if useFallback { | ||
| if craneliftTarget, ctErr := rawdb.CraneliftTarget(target); useFallback && ctErr == nil { |
There was a problem hiding this comment.
let's have fallback be applied both ways. So user could by default use cranelift and then fallback would be to singlpass.
There was a problem hiding this comment.
right now we always try singlepass first as that's hardcoded as cranelift := false. Would we introduce a new CLI for it or would you want to update the usage of the existing CLI AllowFallback flag? Meaning, if AllowFallback is true then we try cranelift target first, in that case it would be best to rename it?
|
|
||
| // selectLocalAsm returns the best available ASM for the local target from an asmMap. | ||
| // Singlepass (non-cranelift) takes precedence over cranelift. | ||
| func selectLocalAsm(asmMap map[rawdb.WasmTarget][]byte) ([]byte, bool) { |
There was a problem hiding this comment.
let's do it like that:
If cranelift version exists and GetAllowFallbcak() - return cranelift version.
Otherwise - return singlepass.
That way next time should of a fallbak-using contract will start with cranelift
| // 3. Retry with cranelift ASM. | ||
| // 4. If cranelift also overflows, double the stack size once and retry with cranelift. | ||
| // 5. If still overflowing, give up. | ||
| func retryOnStackOverflow( |
| craneliftTarget, err := rawdb.CraneliftTarget(rawdb.LocalTarget()) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to determine cranelift target: %w", err) | ||
| } |
There was a problem hiding this comment.
start with statedb.ActivatedAsmMap({craneliftTarget}, moduleHash) as it might already exist in the statedb entry
--stylus-target.native-stack-sizenode config to set the initial Wasmer coroutine stack size for Stylus execution (default: 0 = Wasmer's 1 MB default)NativeStackOverflowvariant toUserOutcome/UserOutcomeKindto distinguish retriable native overflows from deterministic OutOfStack (DepthChecker)--stylus-target.allow-fallback=true(default):a. Recompile the program with Cranelift and retry
b. Persist the Cranelift-compiled ASM to the wasm store so subsequent overflows skip recompilation
c. If Cranelift also overflows, double the stack size once (capped at 100 MB) and retry with Cranelift
d. If still overflowing, fail gracefully as out-of-stack
--stylus-target.allow-fallback=false, no retry is attempted on overfloweth_call, gas estimation) do not trigger retries or Cranelift compilationpulls in OffchainLabs/go-ethereum#645
pulls in OffchainLabs/wasmer#37
closes NIT-4686