Reduce peak memory during prove by releasing witness shared memory early by hczphn · Pull Request #204 · PolyhedraZK/ExpanderCompilerCollection

hczphn · 2026-02-22T01:26:20Z

Summary

Based on # (hc/export_device_memories_unchecked).

Adds a witness ACK shared memory signaling mechanism to release witness memory as early as possible during the prove phase:

Client resets a 1-byte witness_ack shared memory signal before writing witness data
Server sets witness_ack to 1 after reading witness into MPI shared memory
Client polls witness_ack asynchronously; once confirmed, immediately drops the witness shared memory and calls malloc_trim(0) to return memory to OS
Prove HTTP request runs concurrently via tokio, so witness memory is freed while proving is in progress
Skip reading PCS setup from shared memory after setup (client doesn't need it; return default to save memory)

Memory flow (before → after)

Before: witness shared memory lives for the entire prove duration

write_witness → request_prove (blocks) → read_proof
                ^--- witness memory held here ---^

After: witness shared memory released as soon as server reads it

write_witness → [async: request_prove + poll witness_ack] → release witness → ... → read_proof
                                                            ^-- memory freed early

🤖 Generated with Claude Code

Add witness_ack shared memory signaling between client and server: - Client resets a 1-byte ack signal before writing witness - Server signals ack after reading witness into MPI shared memory - Client polls for ack, then immediately releases witness shared memory and calls malloc_trim to return memory to OS - Prove request runs concurrently via tokio async, so witness memory is freed while proving is in progress - Skip reading PCS setup from shared memory (return default) since the client does not need it after setup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-02-22T01:26:39Z

Summary of Changes

Hello @hczphn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes memory usage during the proving phase by implementing a sophisticated shared memory signaling mechanism. By allowing the client to release witness data as soon as the server has processed it, and by making the prove request asynchronous, the peak memory consumption is substantially reduced. Additionally, unnecessary memory allocations for PCS setup on the client side are avoided, contributing to overall memory efficiency.

Highlights

Witness Acknowledgment (ACK) Mechanism: Introduced a 1-byte witness_ack shared memory signal to coordinate witness data transfer between client and server.
Early Witness Memory Release: Enabled the client to asynchronously poll the witness_ack signal and, once confirmed, immediately drop the witness shared memory and call malloc_trim(0) to return memory to the OS.
Asynchronous Prove Request: Implemented concurrent execution of the prove HTTP request via tokio, allowing witness memory to be freed while the proving process is still ongoing.
Optimized PCS Setup Reading: Modified the client to skip reading PCS setup from shared memory after initial setup, returning default values instead to further reduce memory footprint.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs
- Updated a timer name for setup logging.
- Modified the PCS setup reading logic to return default setup objects, preventing the client from holding unnecessary PCS setup data in memory.
- Refactored the client_send_witness_and_prove function to include resetting the witness_ack signal, calling malloc_trim(0) early, and orchestrating asynchronous prove requests and witness memory cleanup using tokio.
expander_compiler/src/zkcuda/proving_system/expander_parallelized/server_ctrl.rs
- Added a call to SharedMemoryEngine::signal_witness_read_complete() within the server's prove handler, signaling to the client that witness data has been successfully read.
expander_compiler/src/zkcuda/proving_system/expander_parallelized/shared_memory_utils.rs
- Introduced a new witness_ack field to the SharedMemory struct to manage the 1-byte acknowledgment signal.
- Implemented reset_witness_ack to initialize the signal before witness writing.
- Implemented signal_witness_read_complete for the server to set the signal after reading witness data.
- Implemented wait_for_witness_read_complete for the client to poll the signal until the server confirms witness data has been read.

Activity

The pull request introduces a new witness ACK shared memory signaling mechanism.
The client now resets the witness_ack signal before writing witness data.
The server sets the witness_ack signal to 1 after reading witness data into MPI shared memory.
The client asynchronously polls witness_ack and, once confirmed, immediately drops the witness shared memory and calls malloc_trim(0).
The prove HTTP request runs concurrently via tokio, allowing witness memory to be freed while proving is in progress.
The client skips reading PCS setup from shared memory after setup, returning default values to save memory.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces an early release mechanism for witness shared memory to reduce peak memory usage during the proving phase. This is a valuable optimization. However, there are several issues to address: the use of malloc_trim is not portable and will break builds on non-glibc Linux systems or other OSs; returning default setups in client_launch_server_and_setup breaks the verify functionality; and the 500ms polling interval for witness acknowledgment is likely too high for performance-critical code.

gemini-code-assist · 2026-02-22T01:28:13Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs

+    // Skip reading PCS setup from shared memory; return default to reduce memory
+    (
+        ExpanderProverSetup::default(),
+        ExpanderVerifierSetup::default(),
+    )


Returning ExpanderProverSetup::default() and ExpanderVerifierSetup::default() here will break the verify method in ParallelizedExpander. The verify implementation (line 52 in api_parallel.rs) relies on the verification keys stored in the setup. If the client is intended to perform verification, it must have access to the actual setup data. If memory saving is the priority, consider implementing a lazy-loading mechanism for the setup or documenting that verification is not supported when using this parallelized client.

gemini-code-assist · 2026-02-22T01:28:13Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs

+    extern "C" {
+        fn malloc_trim(pad: usize) -> i32;
+    }


malloc_trim is a glibc-specific extension. Declaring it as an extern "C" function without platform guards will cause linker errors on non-glibc systems (e.g., macOS, Windows, or musl-based Linux).

#[cfg(all(target_os = "linux", target_env = "gnu"))] extern "C" { fn malloc_trim(pad: usize) -> i32; }

gemini-code-assist · 2026-02-22T01:28:13Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs

+    unsafe {
+        malloc_trim(0);
+    }


The call to malloc_trim should be guarded by the same platform configuration as its declaration to avoid runtime or compilation errors on unsupported platforms.

#[cfg(all(target_os = "linux", target_env = "gnu"))] unsafe { malloc_trim(0); }

gemini-code-assist · 2026-02-22T01:28:13Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs

+    }
+
+    // Async: send prove request + poll for witness ack to release shared memory early
+    let rt = tokio::runtime::Runtime::new().unwrap();


Creating a new Runtime and calling block_on inside a function can cause a panic if this code is executed within an existing Tokio runtime. It is generally better to use the current handle if available or structure the API to be async.

gemini-code-assist · 2026-02-22T01:28:13Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs

+            SharedMemoryEngine::wait_for_witness_read_complete();
+            unsafe {
+                super::shared_memory_utils::SHARED_MEMORY.witness = None;
+                malloc_trim(0);


This call to malloc_trim also needs a platform guard to ensure portability.

#[cfg(all(target_os = "linux", target_env = "gnu"))] malloc_trim(0);

gemini-code-assist · 2026-02-22T01:28:13Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/shared_memory_utils.rs

+                if std::ptr::read_volatile(ptr) != 0 {
+                    break;
+                }
+                std::thread::sleep(std::time::Duration::from_millis(500));


A polling interval of 500ms is quite high and may introduce significant latency between the server finishing its read and the client releasing the memory. Reducing this to 10-50ms would make the system more responsive.

std::thread::sleep(std::time::Duration::from_millis(10));

…nterval - Wrap malloc_trim calls with #[cfg(all(target_os = "linux", target_env = "gnu"))] to avoid linker errors on non-glibc platforms - Reduce witness_ack polling interval from 500ms to 10ms for faster response Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hczphn · 2026-02-22T01:34:56Z

@gemini review

gemini-code-assist

Code Review

The pull request introduces a signaling mechanism to release witness shared memory early during the proving phase, which is a significant optimization for peak memory usage. The implementation correctly uses shared memory polling and asynchronous tasks to achieve this. However, there are a few areas for improvement: a missing timeout in the polling loop could lead to deadlocks if the server crashes, there is some redundant code regarding FFI declarations, and the creation of new Tokio runtimes introduces unnecessary overhead. Additionally, returning default setups in the client might break verification workflows.

gemini-code-assist · 2026-02-22T01:36:58Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/shared_memory_utils.rs

+            loop {
+                if std::ptr::read_volatile(ptr) != 0 {
+                    break;
+                }
+                std::thread::sleep(std::time::Duration::from_millis(10));
+            }


The polling loop lacks a timeout. If the server process crashes or fails to signal the witness_ack for any reason, the client will hang indefinitely. Adding a timeout (e.g., 5 minutes) would make the system more robust.

let start = std::time::Instant::now(); loop { if std::ptr::read_volatile(ptr) != 0 { break; } if start.elapsed().as_secs() > 300 { panic!("Timeout waiting for witness ack from server"); } std::thread::sleep(std::time::Duration::from_millis(10)); }

gemini-code-assist · 2026-02-22T01:36:58Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs

    ECCConfig: Config<FieldConfig = C::FieldConfig>,
 {
-    let setup_timer = Timer::new("setup", true);
+    let setup_timer = Timer::new("new setup", true);


The timer name has been changed from "setup" to "new setup". This appears to be a debugging leftover and should be reverted to maintain consistent output unless there is a specific reason for the name change.

Suggested change

let setup_timer = Timer::new("new setup", true);

let setup_timer = Timer::new("setup", true);

gemini-code-assist · 2026-02-22T01:36:58Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs

+    (
+        ExpanderProverSetup::default(),
+        ExpanderVerifierSetup::default(),
+    )


Returning default() setups instead of reading the actual data from shared memory will cause any subsequent verification attempts on the client side to fail (as the verification keys will be empty). If the client is intended to be able to verify proofs locally, this optimization should be made optional or the setups should be preserved.

gemini-code-assist · 2026-02-22T01:36:58Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs

+    #[cfg(all(target_os = "linux", target_env = "gnu"))]
+    {
+        extern "C" {
+            fn malloc_trim(pad: usize) -> i32;
+        }
+        unsafe {
+            malloc_trim(0);
+        }
+    }


The extern "C" declaration for malloc_trim is repeated twice in this function (here and at line 188). It is better to declare it once at the beginning of the function scope. Additionally, the second call to malloc_trim (after dropping the shared memory handle) may have negligible impact compared to the first one, as shared memory is typically managed via mmap rather than the malloc heap.

gemini-code-assist · 2026-02-22T01:36:58Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs

+    }
+
+    // Async: send prove request + poll for witness ack to release shared memory early
+    let rt = tokio::runtime::Runtime::new().unwrap();


Creating a new tokio::runtime::Runtime on every call to client_send_witness_and_prove is expensive and introduces significant overhead. Consider using a shared runtime or obtaining a handle to the current runtime if one exists.

…dant malloc_trim - Add 5-minute timeout to wait_for_witness_read_complete to prevent indefinite hang if the server crashes - Revert timer label from "new setup" back to "setup" - Remove duplicate malloc_trim inside spawn_blocking (shared memory is mmap-managed, not glibc heap) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hczphn · 2026-02-22T01:48:20Z

@gemini review

gemini-code-assist

Code Review

This pull request introduces an intelligent mechanism to reduce peak memory usage by releasing witness shared memory as early as possible. The use of an asynchronous acknowledgment signal is a good approach. The implementation is mostly solid, but I've identified a high-severity issue with the placement of malloc_trim which prevents the intended memory optimization from working correctly. I've also included a couple of medium-severity suggestions to improve the robustness of error handling and the efficiency of the asynchronous runtime usage.

gemini-code-assist · 2026-02-22T01:50:04Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs

+        tokio::task::spawn_blocking(|| {
+            SharedMemoryEngine::wait_for_witness_read_complete();
+            unsafe {
+                super::shared_memory_utils::SHARED_MEMORY.witness = None;
+            }
+        })


The malloc_trim(0) call should be executed after the witness shared memory is released to be effective. This suggestion moves it to the correct place. Please also remove the original malloc_trim block from lines 163-171.

tokio::task::spawn_blocking(|| { SharedMemoryEngine::wait_for_witness_read_complete(); unsafe { super::shared_memory_utils::SHARED_MEMORY.witness = None; } #[cfg(all(target_os = "linux", target_env = "gnu"))] { extern "C" { fn malloc_trim(pad: usize) -> i32; } unsafe { malloc_trim(0); } } })

gemini-code-assist · 2026-02-22T01:50:04Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs

+    }
+
+    // Async: send prove request + poll for witness ack to release shared memory early
+    let rt = tokio::runtime::Runtime::new().unwrap();


Creating a new Tokio runtime with tokio::runtime::Runtime::new().unwrap() inside this function can be inefficient if it's called multiple times. This pattern is also repeated in the wait_async helper function. It's generally better to manage the runtime at a higher level in the application, for instance by creating it once and reusing it across calls.

Consider using once_cell::sync::Lazy to initialize a static runtime that can be shared.

Example:

use once_cell::sync::Lazy; static RUNTIME: Lazy<tokio::runtime::Runtime> = Lazy::new(|| tokio::runtime::Runtime::new().expect("Failed to create Tokio runtime")); // Then in your function: // RUNTIME.block_on(async { ... });

Additionally, .unwrap() will panic on failure. Using .expect() with a descriptive message would be more robust.

Suggested change

let rt = tokio::runtime::Runtime::new().unwrap();

let rt = tokio::runtime::Runtime::new().expect("Failed to create Tokio runtime");

gemini-code-assist · 2026-02-22T01:50:04Z

expander_compiler/src/zkcuda/proving_system/expander_parallelized/shared_memory_utils.rs

+    pub fn wait_for_witness_read_complete() {
+        const TIMEOUT: std::time::Duration = std::time::Duration::from_secs(300);
+        let start = std::time::Instant::now();
+        unsafe {
+            let ptr = SHARED_MEMORY
+                .witness_ack
+                .as_ref()
+                .expect("witness_ack not initialized, call reset_witness_ack first")
+                .as_ptr() as *const u8;
+            loop {
+                if std::ptr::read_volatile(ptr) != 0 {
+                    break;
+                }
+                if start.elapsed() > TIMEOUT {
+                    panic!(
+                        "Timed out waiting for server to read witness ({}s)",
+                        TIMEOUT.as_secs()
+                    );
+                }
+                std::thread::sleep(std::time::Duration::from_millis(10));
+            }
+        }
+    }


This function panics on timeout. It's more idiomatic in Rust to return a Result to allow the caller to handle the error gracefully. This makes the function more reusable and robust. This change will require updating the call site in client_utils.rs to handle the Result, for example by using .expect().

pub fn wait_for_witness_read_complete() -> Result<(), String> { const TIMEOUT: std::time::Duration = std::time::Duration::from_secs(300); let start = std::time::Instant::now(); unsafe { let ptr = SHARED_MEMORY .witness_ack .as_ref() .expect("witness_ack not initialized, call reset_witness_ack first") .as_ptr() as *const u8; loop { if std::ptr::read_volatile(ptr) != 0 { return Ok(()); } if start.elapsed() > TIMEOUT { return Err(format!( "Timed out waiting for server to read witness ({}s)", TIMEOUT.as_secs() )); } std::thread::sleep(std::time::Duration::from_millis(10)); } } }

The previous optimization skipped reading PCS setup from shared memory and returned empty defaults, which caused verify to panic on v_keys lookup (unwrap on None). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist bot reviewed Feb 22, 2026

View reviewed changes

fix: restore verifier setup from shared memory to fix verify panic

dc557cb

The previous optimization skipped reading PCS setup from shared memory and returned empty defaults, which caused verify to panic on v_keys lookup (unwrap on None). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

	let setup_timer = Timer::new("new setup", true);
	let setup_timer = Timer::new("setup", true);

	let rt = tokio::runtime::Runtime::new().unwrap();
	let rt = tokio::runtime::Runtime::new().expect("Failed to create Tokio runtime");

Conversation

hczphn commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Memory flow (before → after)

Uh oh!

gemini-code-assist bot commented Feb 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

hczphn commented Feb 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

hczphn commented Feb 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hczphn commented Feb 22, 2026 •

edited

Loading