Skip to content

Reduce peak memory during prove by releasing witness shared memory early#204

Open
hczphn wants to merge 4 commits intohc/export_device_memories_uncheckedfrom
hc/witness_ack_memory_optimization
Open

Reduce peak memory during prove by releasing witness shared memory early#204
hczphn wants to merge 4 commits intohc/export_device_memories_uncheckedfrom
hc/witness_ack_memory_optimization

Conversation

@hczphn
Copy link
Collaborator

@hczphn hczphn commented Feb 22, 2026

Summary

Based on # (hc/export_device_memories_unchecked).

Adds a witness ACK shared memory signaling mechanism to release witness memory as early as possible during the prove phase:

  • Client resets a 1-byte witness_ack shared memory signal before writing witness data
  • Server sets witness_ack to 1 after reading witness into MPI shared memory
  • Client polls witness_ack asynchronously; once confirmed, immediately drops the witness shared memory and calls malloc_trim(0) to return memory to OS
  • Prove HTTP request runs concurrently via tokio, so witness memory is freed while proving is in progress
  • Skip reading PCS setup from shared memory after setup (client doesn't need it; return default to save memory)

Memory flow (before → after)

Before: witness shared memory lives for the entire prove duration

write_witness → request_prove (blocks) → read_proof
                ^--- witness memory held here ---^

After: witness shared memory released as soon as server reads it

write_witness → [async: request_prove + poll witness_ack] → release witness → ... → read_proof
                                                            ^-- memory freed early

🤖 Generated with Claude Code

Add witness_ack shared memory signaling between client and server:
- Client resets a 1-byte ack signal before writing witness
- Server signals ack after reading witness into MPI shared memory
- Client polls for ack, then immediately releases witness shared memory
  and calls malloc_trim to return memory to OS
- Prove request runs concurrently via tokio async, so witness memory is
  freed while proving is in progress
- Skip reading PCS setup from shared memory (return default) since the
  client does not need it after setup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hczphn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes memory usage during the proving phase by implementing a sophisticated shared memory signaling mechanism. By allowing the client to release witness data as soon as the server has processed it, and by making the prove request asynchronous, the peak memory consumption is substantially reduced. Additionally, unnecessary memory allocations for PCS setup on the client side are avoided, contributing to overall memory efficiency.

Highlights

  • Witness Acknowledgment (ACK) Mechanism: Introduced a 1-byte witness_ack shared memory signal to coordinate witness data transfer between client and server.
  • Early Witness Memory Release: Enabled the client to asynchronously poll the witness_ack signal and, once confirmed, immediately drop the witness shared memory and call malloc_trim(0) to return memory to the OS.
  • Asynchronous Prove Request: Implemented concurrent execution of the prove HTTP request via tokio, allowing witness memory to be freed while the proving process is still ongoing.
  • Optimized PCS Setup Reading: Modified the client to skip reading PCS setup from shared memory after initial setup, returning default values instead to further reduce memory footprint.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • expander_compiler/src/zkcuda/proving_system/expander_parallelized/client_utils.rs
    • Updated a timer name for setup logging.
    • Modified the PCS setup reading logic to return default setup objects, preventing the client from holding unnecessary PCS setup data in memory.
    • Refactored the client_send_witness_and_prove function to include resetting the witness_ack signal, calling malloc_trim(0) early, and orchestrating asynchronous prove requests and witness memory cleanup using tokio.
  • expander_compiler/src/zkcuda/proving_system/expander_parallelized/server_ctrl.rs
    • Added a call to SharedMemoryEngine::signal_witness_read_complete() within the server's prove handler, signaling to the client that witness data has been successfully read.
  • expander_compiler/src/zkcuda/proving_system/expander_parallelized/shared_memory_utils.rs
    • Introduced a new witness_ack field to the SharedMemory struct to manage the 1-byte acknowledgment signal.
    • Implemented reset_witness_ack to initialize the signal before witness writing.
    • Implemented signal_witness_read_complete for the server to set the signal after reading witness data.
    • Implemented wait_for_witness_read_complete for the client to poll the signal until the server confirms witness data has been read.
Activity
  • The pull request introduces a new witness ACK shared memory signaling mechanism.
  • The client now resets the witness_ack signal before writing witness data.
  • The server sets the witness_ack signal to 1 after reading witness data into MPI shared memory.
  • The client asynchronously polls witness_ack and, once confirmed, immediately drops the witness shared memory and calls malloc_trim(0).
  • The prove HTTP request runs concurrently via tokio, allowing witness memory to be freed while proving is in progress.
  • The client skips reading PCS setup from shared memory after setup, returning default values to save memory.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces an early release mechanism for witness shared memory to reduce peak memory usage during the proving phase. This is a valuable optimization. However, there are several issues to address: the use of malloc_trim is not portable and will break builds on non-glibc Linux systems or other OSs; returning default setups in client_launch_server_and_setup breaks the verify functionality; and the 500ms polling interval for witness acknowledgment is likely too high for performance-critical code.

Comment on lines 143 to 147
// Skip reading PCS setup from shared memory; return default to reduce memory
(
ExpanderProverSetup::default(),
ExpanderVerifierSetup::default(),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Returning ExpanderProverSetup::default() and ExpanderVerifierSetup::default() here will break the verify method in ParallelizedExpander. The verify implementation (line 52 in api_parallel.rs) relies on the verification keys stored in the setup. If the client is intended to perform verification, it must have access to the actual setup data. If memory saving is the priority, consider implementing a lazy-loading mechanism for the setup or documenting that verification is not supported when using this parallelized client.

Comment on lines 163 to 165
extern "C" {
fn malloc_trim(pad: usize) -> i32;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

malloc_trim is a glibc-specific extension. Declaring it as an extern "C" function without platform guards will cause linker errors on non-glibc systems (e.g., macOS, Windows, or musl-based Linux).

    #[cfg(all(target_os = "linux", target_env = "gnu"))]
    extern "C" {
        fn malloc_trim(pad: usize) -> i32;
    }

Comment on lines 166 to 168
unsafe {
malloc_trim(0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to malloc_trim should be guarded by the same platform configuration as its declaration to avoid runtime or compilation errors on unsupported platforms.

    #[cfg(all(target_os = "linux", target_env = "gnu"))]
    unsafe {
        malloc_trim(0);
    }

}

// Async: send prove request + poll for witness ack to release shared memory early
let rt = tokio::runtime::Runtime::new().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating a new Runtime and calling block_on inside a function can cause a panic if this code is executed within an existing Tokio runtime. It is generally better to use the current handle if available or structure the API to be async.

SharedMemoryEngine::wait_for_witness_read_complete();
unsafe {
super::shared_memory_utils::SHARED_MEMORY.witness = None;
malloc_trim(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This call to malloc_trim also needs a platform guard to ensure portability.

                #[cfg(all(target_os = "linux", target_env = "gnu"))]
                malloc_trim(0);

if std::ptr::read_volatile(ptr) != 0 {
break;
}
std::thread::sleep(std::time::Duration::from_millis(500));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A polling interval of 500ms is quite high and may introduce significant latency between the server finishing its read and the client releasing the memory. Reducing this to 10-50ms would make the system more responsive.

                std::thread::sleep(std::time::Duration::from_millis(10));

…nterval

- Wrap malloc_trim calls with #[cfg(all(target_os = "linux", target_env = "gnu"))]
  to avoid linker errors on non-glibc platforms
- Reduce witness_ack polling interval from 500ms to 10ms for faster response

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hczphn
Copy link
Collaborator Author

hczphn commented Feb 22, 2026

@gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a signaling mechanism to release witness shared memory early during the proving phase, which is a significant optimization for peak memory usage. The implementation correctly uses shared memory polling and asynchronous tasks to achieve this. However, there are a few areas for improvement: a missing timeout in the polling loop could lead to deadlocks if the server crashes, there is some redundant code regarding FFI declarations, and the creation of new Tokio runtimes introduces unnecessary overhead. Additionally, returning default setups in the client might break verification workflows.

Comment on lines 144 to 149
loop {
if std::ptr::read_volatile(ptr) != 0 {
break;
}
std::thread::sleep(std::time::Duration::from_millis(10));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The polling loop lacks a timeout. If the server process crashes or fails to signal the witness_ack for any reason, the client will hang indefinitely. Adding a timeout (e.g., 5 minutes) would make the system more robust.

            let start = std::time::Instant::now();
            loop {
                if std::ptr::read_volatile(ptr) != 0 {
                    break;
                }
                if start.elapsed().as_secs() > 300 {
                    panic!("Timeout waiting for witness ack from server");
                }
                std::thread::sleep(std::time::Duration::from_millis(10));
            }

ECCConfig: Config<FieldConfig = C::FieldConfig>,
{
let setup_timer = Timer::new("setup", true);
let setup_timer = Timer::new("new setup", true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timer name has been changed from "setup" to "new setup". This appears to be a debugging leftover and should be reverted to maintain consistent output unless there is a specific reason for the name change.

Suggested change
let setup_timer = Timer::new("new setup", true);
let setup_timer = Timer::new("setup", true);

Comment on lines 144 to 147
(
ExpanderProverSetup::default(),
ExpanderVerifierSetup::default(),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Returning default() setups instead of reading the actual data from shared memory will cause any subsequent verification attempts on the client side to fail (as the verification keys will be empty). If the client is intended to be able to verify proofs locally, this optimization should be made optional or the setups should be preserved.

Comment on lines +163 to +171
#[cfg(all(target_os = "linux", target_env = "gnu"))]
{
extern "C" {
fn malloc_trim(pad: usize) -> i32;
}
unsafe {
malloc_trim(0);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The extern "C" declaration for malloc_trim is repeated twice in this function (here and at line 188). It is better to declare it once at the beginning of the function scope. Additionally, the second call to malloc_trim (after dropping the shared memory handle) may have negligible impact compared to the first one, as shared memory is typically managed via mmap rather than the malloc heap.

}

// Async: send prove request + poll for witness ack to release shared memory early
let rt = tokio::runtime::Runtime::new().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating a new tokio::runtime::Runtime on every call to client_send_witness_and_prove is expensive and introduces significant overhead. Consider using a shared runtime or obtaining a handle to the current runtime if one exists.

…dant malloc_trim

- Add 5-minute timeout to wait_for_witness_read_complete to prevent
  indefinite hang if the server crashes
- Revert timer label from "new setup" back to "setup"
- Remove duplicate malloc_trim inside spawn_blocking (shared memory
  is mmap-managed, not glibc heap)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hczphn
Copy link
Collaborator Author

hczphn commented Feb 22, 2026

@gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an intelligent mechanism to reduce peak memory usage by releasing witness shared memory as early as possible. The use of an asynchronous acknowledgment signal is a good approach. The implementation is mostly solid, but I've identified a high-severity issue with the placement of malloc_trim which prevents the intended memory optimization from working correctly. I've also included a couple of medium-severity suggestions to improve the robustness of error handling and the efficiency of the asynchronous runtime usage.

Comment on lines +181 to +186
tokio::task::spawn_blocking(|| {
SharedMemoryEngine::wait_for_witness_read_complete();
unsafe {
super::shared_memory_utils::SHARED_MEMORY.witness = None;
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The malloc_trim(0) call should be executed after the witness shared memory is released to be effective. This suggestion moves it to the correct place. Please also remove the original malloc_trim block from lines 163-171.

        tokio::task::spawn_blocking(|| {
            SharedMemoryEngine::wait_for_witness_read_complete();
            unsafe {
                super::shared_memory_utils::SHARED_MEMORY.witness = None;
            }
            #[cfg(all(target_os = "linux", target_env = "gnu"))]
            {
                extern "C" {
                    fn malloc_trim(pad: usize) -> i32;
                }
                unsafe {
                    malloc_trim(0);
                }
            }
        })

}

// Async: send prove request + poll for witness ack to release shared memory early
let rt = tokio::runtime::Runtime::new().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating a new Tokio runtime with tokio::runtime::Runtime::new().unwrap() inside this function can be inefficient if it's called multiple times. This pattern is also repeated in the wait_async helper function. It's generally better to manage the runtime at a higher level in the application, for instance by creating it once and reusing it across calls.

Consider using once_cell::sync::Lazy to initialize a static runtime that can be shared.

Example:

use once_cell::sync::Lazy;

static RUNTIME: Lazy<tokio::runtime::Runtime> = Lazy::new(|| tokio::runtime::Runtime::new().expect("Failed to create Tokio runtime"));

// Then in your function:
// RUNTIME.block_on(async { ... });

Additionally, .unwrap() will panic on failure. Using .expect() with a descriptive message would be more robust.

Suggested change
let rt = tokio::runtime::Runtime::new().unwrap();
let rt = tokio::runtime::Runtime::new().expect("Failed to create Tokio runtime");

Comment on lines +138 to +160
pub fn wait_for_witness_read_complete() {
const TIMEOUT: std::time::Duration = std::time::Duration::from_secs(300);
let start = std::time::Instant::now();
unsafe {
let ptr = SHARED_MEMORY
.witness_ack
.as_ref()
.expect("witness_ack not initialized, call reset_witness_ack first")
.as_ptr() as *const u8;
loop {
if std::ptr::read_volatile(ptr) != 0 {
break;
}
if start.elapsed() > TIMEOUT {
panic!(
"Timed out waiting for server to read witness ({}s)",
TIMEOUT.as_secs()
);
}
std::thread::sleep(std::time::Duration::from_millis(10));
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function panics on timeout. It's more idiomatic in Rust to return a Result to allow the caller to handle the error gracefully. This makes the function more reusable and robust. This change will require updating the call site in client_utils.rs to handle the Result, for example by using .expect().

    pub fn wait_for_witness_read_complete() -> Result<(), String> {
        const TIMEOUT: std::time::Duration = std::time::Duration::from_secs(300);
        let start = std::time::Instant::now();
        unsafe {
            let ptr = SHARED_MEMORY
                .witness_ack
                .as_ref()
                .expect("witness_ack not initialized, call reset_witness_ack first")
                .as_ptr() as *const u8;
            loop {
                if std::ptr::read_volatile(ptr) != 0 {
                    return Ok(());
                }
                if start.elapsed() > TIMEOUT {
                    return Err(format!(
                        "Timed out waiting for server to read witness ({}s)",
                        TIMEOUT.as_secs()
                    ));
                }
                std::thread::sleep(std::time::Duration::from_millis(10));
            }
        }
    }

The previous optimization skipped reading PCS setup from shared memory
and returned empty defaults, which caused verify to panic on v_keys
lookup (unwrap on None).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant