Skip to content

Configurable native stack size for Stylus with auto-retry on overflow#4538

Open
bragaigor wants to merge 34 commits intomasterfrom
nm004-config-stylus-stack-size
Open

Configurable native stack size for Stylus with auto-retry on overflow#4538
bragaigor wants to merge 34 commits intomasterfrom
nm004-config-stylus-stack-size

Conversation

@bragaigor
Copy link
Copy Markdown
Contributor

@bragaigor bragaigor commented Mar 20, 2026

  • Add --stylus-target.native-stack-size node config to set the initial Wasmer coroutine stack size for Stylus execution (default: 0 = Wasmer's 1 MB default)
  • Add NativeStackOverflow variant to UserOutcome/UserOutcomeKind to distinguish retriable native overflows from deterministic OutOfStack (DepthChecker)
  • On native stack overflow with --stylus-target.allow-fallback=true (default):
    a. Recompile the program with Cranelift and retry
    b. Persist the Cranelift-compiled ASM to the wasm store so subsequent overflows skip recompilation
    c. If Cranelift also overflows, double the stack size once (capped at 100 MB) and retry with Cranelift
    d. If still overflowing, fail gracefully as out-of-stack
  • With --stylus-target.allow-fallback=false, no retry is attempted on overflow
  • Off-chain calls (eth_call, gas estimation) do not trigger retries or Cranelift compilation
  • Add new WasmTarget variants for Cranelift ASM storage (arm64-cranelift, amd64-cranelift, host-cranelift) in go-ethereum

pulls in OffchainLabs/go-ethereum#645
pulls in OffchainLabs/wasmer#37
closes NIT-4686

Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 8.65385% with 475 lines in your changes missing coverage. Please review.
✅ Project coverage is 33.90%. Comparing base (620aa42) to head (e8d60b3).
⚠️ Report is 44 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4538      +/-   ##
==========================================
- Coverage   35.08%   33.90%   -1.19%     
==========================================
  Files         498      494       -4     
  Lines       59096    59400     +304     
==========================================
- Hits        20735    20139     -596     
- Misses      34651    35708    +1057     
+ Partials     3710     3553     -157     

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 20, 2026

❌ 16 Tests Failed:

Tests completed Failed Passed Skipped
4824 16 4808 0
View the top 3 failed tests by shortest run time
TestAliasingFlaky
Stack Traces | -0.000s run time
=== RUN   TestAliasingFlaky
=== PAUSE TestAliasingFlaky
=== CONT  TestAliasingFlaky
    common_test.go:777: BuildL1 deployConfig: DeployBold=true, DeployReferenceDAContracts=false
INFO [04-03|23:31:46.739] Starting peer-to-peer node               instance=test-stack-name/linux-amd64/go1.25.8
WARN [04-03|23:31:46.739] P2P server will be useless, neither dialing nor listening
INFO [04-03|23:31:46.740] Started log indexer
WARN [04-03|23:31:46.740] Getting file info                        dir= error="stat : no such file or directory"
TestPruningDBSizeReduction
Stack Traces | 0.000s run time
=== RUN   TestPruningDBSizeReduction
--- FAIL: TestPruningDBSizeReduction (0.00s)
TestBatchPosterL1SurplusMatchesBatchGasFlaky
Stack Traces | 0.580s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
panic: runtime error: invalid memory address or nil pointer dereference [recovered, repanicked]
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2080b12]

goroutine 42 [running]:
testing.tRunner.func1.2({0x37eb220, 0x62099e0})
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1872 +0x237
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1875 +0x35b
panic({0x37eb220?, 0x62099e0?})
	/opt/hostedtoolcache/go/1.25.8/x64/src/runtime/panic.go:783 +0x132
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).GetBatchCount(0x1d31900?)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:210 +0x12
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).FindInboxBatchContainingMessage(0x0, 0x7)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:225 +0x2f
github.com/offchainlabs/nitro/system_tests.TestBatchPosterL1SurplusMatchesBatchGasFlaky(0xc0002c28c0)
	/home/runner/work/nitro/nitro/system_tests/batch_poster_test.go:839 +0x725
testing.tRunner(0xc0002c28c0, 0x41be9e8)
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1934 +0xea
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1997 +0x465

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
@bragaigor bragaigor marked this pull request as ready for review March 30, 2026 22:22
@bragaigor bragaigor marked this pull request as draft March 31, 2026 00:46
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Comment on lines +496 to +499
if status == userNativeStackOverflow {
return nil, fmt.Errorf("%w (program=%v, module=%v, depth=%d, allowFallback=%v, onChain=%v)",
ErrNativeStackOverflow, address, moduleHash, depth, GetAllowFallback(), runCtx.IsExecutedOnChain())
}
Copy link
Copy Markdown
Contributor Author

@bragaigor bragaigor Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to return error on all userNativeStackOverflow or only when status == userNativeStackOverflow && (!GetAllowFallback() || !runCtx.IsExecutedOnChain())? Or do we want to panic here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's panic here

@bragaigor bragaigor marked this pull request as ready for review April 1, 2026 13:26
@bragaigor bragaigor requested a review from magicxyyz April 1, 2026 13:27
"program", address, "module", moduleHash)
return userNativeStackOverflow, nil
}
if !runCtx.IsExecutedOnChain() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsExecutedOnChain is true for when the call is executed in:

  • messageCommitMode - new synced/sequenced block
  • messageRecordingMode - block re-executed to record preimages required to create ValidationInput
  • messageReplayMode - block/transaction is executed i.e. in trace RPC call

IsExecutedOnChain is means:

these message modes are executed onchain so cannot make any gas shortcuts

https://github.com/OffchainLabs/go-ethereum/blob/461e5177cdbb8d237702dbb889ed2db19515f8c7/core/state_transition.go#L296-L299

I am not sure if we want to use fallback in any other mode then messageCommitMode. Do we need to support fallback in messageRecordingMode @tsahee ?

// doubling path. No corruption or crash results.
baseStackSize := configuredNativeStackSize.Load()
defer func() {
SetNativeStackSize(baseStackSize)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure yet, but I think that there might be an issue with setting stack size globally.

Because IsExecutedOnChain is true at this point the concurrent execution of trace calls can possibly mess with the head block execution.

If we decide that we want to allow fallback only when IsCommitMode is true, then I think we should have some guarantee that only one thread will set the stack size. That still will mess with stack size used by block recording thread and RPC calls - not sure how serious is that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give some more context.

SetNativeStackSize is process-wide — it modifies a global AtomicUsize (DEFAULT_STACK_SIZE) in Wasmer shared across all threads.
DrainStackPool is best-effort (lock-free queue). So yes, when retryOnStackOverflow doubles the stack, every concurrent Stylus execution in the process is affected.

Concurrent execution paths that share this global:

  ┌──────────────────────────────────┬───────────────────┬──────────────┬───────────────────────────────────────┐
  │             Context              │ IsExecutedOnChain │ IsCommitMode │              Concurrent?              │
  ├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
  │ Sequencer (block production)     │ true              │ true         │ Yes, with all below                   │
  ├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
  │ Block validation (non-sequencer) │ true              │ true         │ Yes                                   │
  ├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
  │ Debug trace (debug_traceBlock)   │ true              │ false        │ Highly — runtime.NumCPU() goroutines  │
  ├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
  │ Block recording (proofs)         │ true              │ false        │ Yes                                   │
  ├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
  │ eth_call / gas estimation        │ false             │ false        │ Yes, but fallback is already disabled │
  └──────────────────────────────────┴───────────────────┴──────────────┴───────────────────────────────────────┘

The key issue: debug trace calls run with IsExecutedOnChain=true and spawn multiple goroutines sharing one runCtx. If the sequencer doubles the stack during block production, concurrent trace goroutines may see the doubled size (benign — they just get more stack than needed). If the sequencer's defer restores the base size while a trace goroutine is about to allocate a coroutine, that goroutine gets the smaller stack and may overflow again — but that's equivalent to the trace never seeing the doubled value. No corruption, just a duplicate overflow.

Lets take 2 threads example in a happy path:

  ┌──────┬──────────────────────────────────┬─────────────────────────────────────┬──────────────────────┐    
  │ Step │                G1                │                 G2                  │  DEFAULT_STACK_SIZE  │    
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 1    │ SetNativeStackSize(2MB)          │                                     │ 2MB                  │    
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 2    │ DrainStackPool()                 │                                     │ 2MB                  │    
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
  │ 3    │ doStylusCall() → enters Rust     │                                     │ 2MB                  │    
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 4    │ catch_traps: reads 2MB, allocs   │                                     │ 2MB                  │
  │      │ 2MB stack                        │                                     │                      │    
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
  │ 5    │ executing wasm...                │ SetNativeStackSize(2MB)             │ 2MB (no-op, same     │    
  │      │                                  │                                     │ value)               │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 6    │ executing wasm...                │ DrainStackPool()                    │ 2MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 7    │ executing wasm...                │ doStylusCall() → enters Rust        │ 2MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
  │ 8    │ executing wasm...                │ catch_traps: reads 2MB,             │ 2MB                  │
  │      │                                  │ DefaultStack::new(2MB)              │                      │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 9    │ returns from Rust                │ executing wasm with 2MB stack...    │ 2MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 10   │ defer: SetNativeStackSize(1MB)   │ executing wasm with 2MB stack...    │ 1MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 11   │ defer: DrainStackPool()          │ executing wasm with 2MB stack...    │ 1MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 12   │ done                             │ returns successfully                │ 1MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 13   │                                  │ defer: SetNativeStackSize(1MB)      │ 1MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 14   │                                  │ defer: DrainStackPool()             │ 1MB                  │
  └──────┴──────────────────────────────────┴─────────────────────────────────────┴──────────────────────┘

And now the same threads in an unhappy path:

  ┌──────┬─────────────────────────────────┬─────────────────────────────────────────┬────────────────────┐
  │ Step │               G1                │                   G2                    │ DEFAULT_STACK_SIZE │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 1    │ SetNativeStackSize(2MB)         │                                         │ 2MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 2    │ DrainStackPool()                │                                         │ 2MB                │
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤   
  │ 3    │ doStylusCall() → enters Rust    │                                         │ 2MB                │
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤   
  │ 4    │ catch_traps: reads 2MB, allocs  │                                         │ 2MB                │
  │      │ 2MB                             │                                         │                    │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 5    │ executing wasm...               │ SetNativeStackSize(2MB)                 │ 2MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 6    │ executing wasm...               │ DrainStackPool()                        │ 2MB                │
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 7    │ returns from Rust               │                                         │ 2MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 8    │ defer: SetNativeStackSize(1MB)  │                                         │ 1MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 9    │ defer: DrainStackPool()         │                                         │ 1MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 10   │ done                            │ doStylusCall() → enters Rust            │ 1MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 11   │                                 │ catch_traps: reads 1MB,                 │ 1MB                │   
  │      │                                 │ DefaultStack::new(1MB)                  │                    │
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤   
  │ 12   │                                 │ overflows again on 1MB stack            │ 1MB                │
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 13   │                                 │ returns NativeStackOverflow → tx        │ 1MB                │
  │      │                                 │ reverts                                 │                    │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 14   │                                 │ defer: SetNativeStackSize(1MB)          │ 1MB                │   
  └──────┴─────────────────────────────────┴─────────────────────────────────────────┴────────────────────┘

the worst case in any race scenario is that a concurrent thread overflows again (same outcome as if the doubling never happened) which is the last scenario above. Does that help?


evmApi := newApi(evm, tracingInfo, scope, memoryModel)
defer evmApi.drop()
savedGas := scope.Contract.Gas
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only field that we need to save for possible later revert?
I think that at least Contract.UseMultiGas and Contract.RetainedMultiGas might need to also be reverted.

Also what about the rest of vm.ScopeContext?

I haven't read into that deep enough, but I think that we might need to save whole vm.ScopeContext as it includes Stack and Memory context of the contract and we will need to revert those also.

Copy link
Copy Markdown
Contributor Author

@bragaigor bragaigor Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a quick analysis by Claude and was able to confirm by looking further at the code:

  • scope.Stack — This is the EVM operand stack (used by the interpreter for PUSH, POP, ADD, etc.). Stylus programs run inside Wasmer with their own WASM stack. The CGo stylus_call never touches scope.Stack. In fact, when Stylus is invoked, the EVM interpreter isn't running — callProgram is called directly from the precompile path, so the EVM stack is dormant.
  • scope.Memory — Same story. This is EVM memory (used by MLOAD, MSTORE, etc.). Stylus programs have their own WASM linear memory managed by Wasmer. The stylus_call FFI doesn't read or write scope.Memory.
  • scope.Contract (other fields) — The immutable fields (caller, address, code, codehash, value) don't change during execution. RetainedMultiGas is only modified by EVM instructions, never by Stylus host I/O.

I'll save UsedMultiGas as well. Does the above make sense, or did I miss something and we need to save other fields?

Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>

if compiled {
// Freshly compiled means localAsm was singlepass, so cranelift is different
// code worth retrying. If !compiled, the cranelift ASM was already in the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily true. If database has both we have previously singlepass.
I think it's worthwhile to retry with cranelift anywat

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the very first time this gets compiled, but for subsequent calls, when db has both, it'll have been singlepass, so yeah you're right

@bragaigor bragaigor self-assigned this Apr 2, 2026
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
return nil, fmt.Errorf("failed to get wasm for cranelift compilation: %w", err)
}

asm, err := compileNative(wasm, version, debug, rawdb.LocalTarget(), true, 15*time.Second)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong here, but if a debug vs. non-debug node run this they will compile different cranelift asm's right? Is that a problem? Do we want to encode whether or not the asm was in debug mode in the wasm store?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's a problem it seems to be a pre-existing one. Correct if I'm wrong but in practice, debug mode is a per-node setting and the wasm store is local (not shared between nodes), so a single node would always use the same debug value consistently. A debug node would only ever read its own debug-compiled ASM. The only risk would be if a node changed its debug flag between restarts without clearing the wasm store? Not sure if that's realistic. Deferring to @tsahee :)

// 3. Retry with cranelift ASM.
// 4. If cranelift also overflows, double the stack size once and retry with cranelift.
// 5. If still overflowing, give up.
func retryOnStackOverflow(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we "remember" via some LRU cache that a program failed singlepass so that we don't have to do the expensive operation again? That is, go straight for cranelift next time if re-attempted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, if that problem keeps being submitted we'll keep overflowing. The good thing is that cranelift compilation will be remembered second time around but we haven't prevented the original overflow. Maybe an in-memory cache here with single pass overflows wouldn't be that bad. What you think @tsahee ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answered in selectLocalAsm

return nil, fmt.Errorf("failed to get wasm for cranelift compilation: %w", err)
}

asm, err := compileNative(wasm, version, debug, rawdb.LocalTarget(), true, 15*time.Second)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also return an empty asm, right? Should we handle that case in any special way?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, a successful compilation that produces zero bytes would be a Wasmer bug. So in practice no; if compileNative succeeds (returns nil error), the ASM shouldn't be empty. The Rust stylus_compile either returns a non-zero status (mapped to error) or writes the compiled module bytes to output. But I guess we can never be too careful? Maybe we can add something like the following after if err != nil {?

	if asm == nil {
		return nil, fmt.Errorf("asm shouldn't have been nil, something is wrong!")
	}

// Get or compile cranelift ASM (persisted to wasm store on compilation).
craneliftAsm, err := getCraneliftAsm(moduleHash, address, db, code, params, program.version, debug)
if err != nil || len(craneliftAsm) == 0 {
log.Error("native stack overflow, cranelift ASM unavailable",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we revert snapshot always when this function errors? It seems safer than doing it after these lines

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When retryOnStackOverflow returns userNativeStackOverflow, the caller returns an error. That error bubbles up through the EVM, which reverts using its own snapshot from the outer call frame. Doing an explicit RevertToSnapshot here would be redundant; and if we reverted but then some future code change made the caller not return an error, we'd have silently discarded state for no reason. This is the call stack

  evm.Call() / evm.CallCode() / evm.DelegateCall() / evm.StaticCall()
    │
    │  snapshot := evm.StateDB.Snapshot()    ← EVM takes its own snapshot
    │
    ├─► EVMInterpreter.Run()
    │
    │     if IsStylusDeployableProgramPrefix(contract.Code) {
    │         ret, err = evm.ProcessingHook.ExecuteWASM(callContext, input, evm)
    │         return    ← returns (ret, err) directly to evm.Call()
    │     }
    │
    │   ├─► TxProcessor.ExecuteWASM()
    │   │
    │   │   └─► Programs.CallProgram()
    │   │
    │   │       └─► callProgram()
    │   │
    │   │             snapshot := db.Snapshot()   ← inner snapshot for retry
    │   │             status, output := doStylusCall(localAsm, ...)
    │   │
    │   │             if status == userNativeStackOverflow {
    │   │               └─► retryOnStackOverflow()
    │   │
    │   │                     // Early returns without revert <<<<<<<<<<<
    │   │             }
    │   │
    │   │             if status == userNativeStackOverflow {
    │   │                 return nil, ErrNativeStackOverflow  ← error returned
    │   │             }
    │   │
    │   │       ◄── returns (nil, ErrNativeStackOverflow)
    │   │
    │   │     ◄── returns (nil, ErrNativeStackOverflow)
    │   │
    │   ◄── returns (nil, ErrNativeStackOverflow)
    │
    ◄── returns (ret=nil, err=ErrNativeStackOverflow)
    │
    if err != nil {
        evm.StateDB.RevertToSnapshot(snapshot)   ← EVM reverts everything
    }

let me know if I missed something

DrainStackPool()

// Revert any partial state changes from the cranelift attempt before retrying.
db.RevertToSnapshot(snapshot)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: is the HostIO AddStylusPages journaled? That is, will it be reverted when we do RevertToSnapshot?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with journaling but it doesn't seem like AddStylusPages is journaled. It directly mutates s.arbExtraData.openWasmPages and s.arbExtraData.everWasmPages without writing a journal entry. But I guess your concern is concerning using more pages in case we have to retry because we didn't reset them? The sequence of events would be:

  1. CallProgram() adds the program's initial footprint: statedb.AddStylusPages(program.footprint)
  2. During the first attempt, if the WASM calls memory.grow, addPages calls db.AddStylusPages(pages), further incrementing openWasmPages
  3. On overflow, retryOnStackOverflow calls db.RevertToSnapshot(snapshot): this reverts storage, balance, etc., but not openWasmPages
  4. Gas and usedMultiGas are correctly restored from saved values
  5. On retry, when memory.grow fires again, addPages reads the inflated open from db.AddStylusPages(), and GasCost(pages, open, ever) could compute a slightly higher cost since open is higher than it should be

The retry path only runs on-chain and only on native stack overflow — a rare, non-deterministic event. The consequence is slightly wrong gas charging for memory.grow on retry. Since this path is inherently non-deterministic (whether a node overflows depends on its native stack size), different nodes already can't rely on getting identical gas accounting here. That being said, it's probably safer to save and store openWasmPages to avoid spurious gas charges?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the following changes so we also remember openPages and everPages:

Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
asm, err := compileNative(wasm, stylusVersion, debug, target, cranelift, timeout)
if err != nil {
if useFallback {
if craneliftTarget, ctErr := rawdb.CraneliftTarget(target); useFallback && ctErr == nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have fallback be applied both ways. So user could by default use cranelift and then fallback would be to singlpass.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now we always try singlepass first as that's hardcoded as cranelift := false. Would we introduce a new CLI for it or would you want to update the usage of the existing CLI AllowFallback flag? Meaning, if AllowFallback is true then we try cranelift target first, in that case it would be best to rename it?


// selectLocalAsm returns the best available ASM for the local target from an asmMap.
// Singlepass (non-cranelift) takes precedence over cranelift.
func selectLocalAsm(asmMap map[rawdb.WasmTarget][]byte) ([]byte, bool) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do it like that:
If cranelift version exists and GetAllowFallbcak() - return cranelift version.
Otherwise - return singlepass.

That way next time should of a fallbak-using contract will start with cranelift

// 3. Retry with cranelift ASM.
// 4. If cranelift also overflows, double the stack size once and retry with cranelift.
// 5. If still overflowing, give up.
func retryOnStackOverflow(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answered in selectLocalAsm

craneliftTarget, err := rawdb.CraneliftTarget(rawdb.LocalTarget())
if err != nil {
return nil, fmt.Errorf("failed to determine cranelift target: %w", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start with statedb.ActivatedAsmMap({craneliftTarget}, moduleHash) as it might already exist in the statedb entry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants