Reduce peak memory during prove by releasing witness shared memory early#204
Reduce peak memory during prove by releasing witness shared memory early#204hczphn wants to merge 4 commits intohc/export_device_memories_uncheckedfrom
Conversation
Add witness_ack shared memory signaling between client and server: - Client resets a 1-byte ack signal before writing witness - Server signals ack after reading witness into MPI shared memory - Client polls for ack, then immediately releases witness shared memory and calls malloc_trim to return memory to OS - Prove request runs concurrently via tokio async, so witness memory is freed while proving is in progress - Skip reading PCS setup from shared memory (return default) since the client does not need it after setup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello @hczphn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly optimizes memory usage during the proving phase by implementing a sophisticated shared memory signaling mechanism. By allowing the client to release witness data as soon as the server has processed it, and by making the prove request asynchronous, the peak memory consumption is substantially reduced. Additionally, unnecessary memory allocations for PCS setup on the client side are avoided, contributing to overall memory efficiency. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces an early release mechanism for witness shared memory to reduce peak memory usage during the proving phase. This is a valuable optimization. However, there are several issues to address: the use of malloc_trim is not portable and will break builds on non-glibc Linux systems or other OSs; returning default setups in client_launch_server_and_setup breaks the verify functionality; and the 500ms polling interval for witness acknowledgment is likely too high for performance-critical code.
| // Skip reading PCS setup from shared memory; return default to reduce memory | ||
| ( | ||
| ExpanderProverSetup::default(), | ||
| ExpanderVerifierSetup::default(), | ||
| ) |
There was a problem hiding this comment.
Returning ExpanderProverSetup::default() and ExpanderVerifierSetup::default() here will break the verify method in ParallelizedExpander. The verify implementation (line 52 in api_parallel.rs) relies on the verification keys stored in the setup. If the client is intended to perform verification, it must have access to the actual setup data. If memory saving is the priority, consider implementing a lazy-loading mechanism for the setup or documenting that verification is not supported when using this parallelized client.
| extern "C" { | ||
| fn malloc_trim(pad: usize) -> i32; | ||
| } |
There was a problem hiding this comment.
| unsafe { | ||
| malloc_trim(0); | ||
| } |
| } | ||
|
|
||
| // Async: send prove request + poll for witness ack to release shared memory early | ||
| let rt = tokio::runtime::Runtime::new().unwrap(); |
| SharedMemoryEngine::wait_for_witness_read_complete(); | ||
| unsafe { | ||
| super::shared_memory_utils::SHARED_MEMORY.witness = None; | ||
| malloc_trim(0); |
| if std::ptr::read_volatile(ptr) != 0 { | ||
| break; | ||
| } | ||
| std::thread::sleep(std::time::Duration::from_millis(500)); |
There was a problem hiding this comment.
…nterval - Wrap malloc_trim calls with #[cfg(all(target_os = "linux", target_env = "gnu"))] to avoid linker errors on non-glibc platforms - Reduce witness_ack polling interval from 500ms to 10ms for faster response Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@gemini review |
There was a problem hiding this comment.
Code Review
The pull request introduces a signaling mechanism to release witness shared memory early during the proving phase, which is a significant optimization for peak memory usage. The implementation correctly uses shared memory polling and asynchronous tasks to achieve this. However, there are a few areas for improvement: a missing timeout in the polling loop could lead to deadlocks if the server crashes, there is some redundant code regarding FFI declarations, and the creation of new Tokio runtimes introduces unnecessary overhead. Additionally, returning default setups in the client might break verification workflows.
| loop { | ||
| if std::ptr::read_volatile(ptr) != 0 { | ||
| break; | ||
| } | ||
| std::thread::sleep(std::time::Duration::from_millis(10)); | ||
| } |
There was a problem hiding this comment.
The polling loop lacks a timeout. If the server process crashes or fails to signal the witness_ack for any reason, the client will hang indefinitely. Adding a timeout (e.g., 5 minutes) would make the system more robust.
let start = std::time::Instant::now();
loop {
if std::ptr::read_volatile(ptr) != 0 {
break;
}
if start.elapsed().as_secs() > 300 {
panic!("Timeout waiting for witness ack from server");
}
std::thread::sleep(std::time::Duration::from_millis(10));
}| ECCConfig: Config<FieldConfig = C::FieldConfig>, | ||
| { | ||
| let setup_timer = Timer::new("setup", true); | ||
| let setup_timer = Timer::new("new setup", true); |
There was a problem hiding this comment.
The timer name has been changed from "setup" to "new setup". This appears to be a debugging leftover and should be reverted to maintain consistent output unless there is a specific reason for the name change.
| let setup_timer = Timer::new("new setup", true); | |
| let setup_timer = Timer::new("setup", true); |
| ( | ||
| ExpanderProverSetup::default(), | ||
| ExpanderVerifierSetup::default(), | ||
| ) |
There was a problem hiding this comment.
Returning default() setups instead of reading the actual data from shared memory will cause any subsequent verification attempts on the client side to fail (as the verification keys will be empty). If the client is intended to be able to verify proofs locally, this optimization should be made optional or the setups should be preserved.
| #[cfg(all(target_os = "linux", target_env = "gnu"))] | ||
| { | ||
| extern "C" { | ||
| fn malloc_trim(pad: usize) -> i32; | ||
| } | ||
| unsafe { | ||
| malloc_trim(0); | ||
| } | ||
| } |
There was a problem hiding this comment.
The extern "C" declaration for malloc_trim is repeated twice in this function (here and at line 188). It is better to declare it once at the beginning of the function scope. Additionally, the second call to malloc_trim (after dropping the shared memory handle) may have negligible impact compared to the first one, as shared memory is typically managed via mmap rather than the malloc heap.
| } | ||
|
|
||
| // Async: send prove request + poll for witness ack to release shared memory early | ||
| let rt = tokio::runtime::Runtime::new().unwrap(); |
…dant malloc_trim - Add 5-minute timeout to wait_for_witness_read_complete to prevent indefinite hang if the server crashes - Revert timer label from "new setup" back to "setup" - Remove duplicate malloc_trim inside spawn_blocking (shared memory is mmap-managed, not glibc heap) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an intelligent mechanism to reduce peak memory usage by releasing witness shared memory as early as possible. The use of an asynchronous acknowledgment signal is a good approach. The implementation is mostly solid, but I've identified a high-severity issue with the placement of malloc_trim which prevents the intended memory optimization from working correctly. I've also included a couple of medium-severity suggestions to improve the robustness of error handling and the efficiency of the asynchronous runtime usage.
| tokio::task::spawn_blocking(|| { | ||
| SharedMemoryEngine::wait_for_witness_read_complete(); | ||
| unsafe { | ||
| super::shared_memory_utils::SHARED_MEMORY.witness = None; | ||
| } | ||
| }) |
There was a problem hiding this comment.
The malloc_trim(0) call should be executed after the witness shared memory is released to be effective. This suggestion moves it to the correct place. Please also remove the original malloc_trim block from lines 163-171.
tokio::task::spawn_blocking(|| {
SharedMemoryEngine::wait_for_witness_read_complete();
unsafe {
super::shared_memory_utils::SHARED_MEMORY.witness = None;
}
#[cfg(all(target_os = "linux", target_env = "gnu"))]
{
extern "C" {
fn malloc_trim(pad: usize) -> i32;
}
unsafe {
malloc_trim(0);
}
}
})| } | ||
|
|
||
| // Async: send prove request + poll for witness ack to release shared memory early | ||
| let rt = tokio::runtime::Runtime::new().unwrap(); |
There was a problem hiding this comment.
Creating a new Tokio runtime with tokio::runtime::Runtime::new().unwrap() inside this function can be inefficient if it's called multiple times. This pattern is also repeated in the wait_async helper function. It's generally better to manage the runtime at a higher level in the application, for instance by creating it once and reusing it across calls.
Consider using once_cell::sync::Lazy to initialize a static runtime that can be shared.
Example:
use once_cell::sync::Lazy;
static RUNTIME: Lazy<tokio::runtime::Runtime> = Lazy::new(|| tokio::runtime::Runtime::new().expect("Failed to create Tokio runtime"));
// Then in your function:
// RUNTIME.block_on(async { ... });Additionally, .unwrap() will panic on failure. Using .expect() with a descriptive message would be more robust.
| let rt = tokio::runtime::Runtime::new().unwrap(); | |
| let rt = tokio::runtime::Runtime::new().expect("Failed to create Tokio runtime"); |
| pub fn wait_for_witness_read_complete() { | ||
| const TIMEOUT: std::time::Duration = std::time::Duration::from_secs(300); | ||
| let start = std::time::Instant::now(); | ||
| unsafe { | ||
| let ptr = SHARED_MEMORY | ||
| .witness_ack | ||
| .as_ref() | ||
| .expect("witness_ack not initialized, call reset_witness_ack first") | ||
| .as_ptr() as *const u8; | ||
| loop { | ||
| if std::ptr::read_volatile(ptr) != 0 { | ||
| break; | ||
| } | ||
| if start.elapsed() > TIMEOUT { | ||
| panic!( | ||
| "Timed out waiting for server to read witness ({}s)", | ||
| TIMEOUT.as_secs() | ||
| ); | ||
| } | ||
| std::thread::sleep(std::time::Duration::from_millis(10)); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This function panics on timeout. It's more idiomatic in Rust to return a Result to allow the caller to handle the error gracefully. This makes the function more reusable and robust. This change will require updating the call site in client_utils.rs to handle the Result, for example by using .expect().
pub fn wait_for_witness_read_complete() -> Result<(), String> {
const TIMEOUT: std::time::Duration = std::time::Duration::from_secs(300);
let start = std::time::Instant::now();
unsafe {
let ptr = SHARED_MEMORY
.witness_ack
.as_ref()
.expect("witness_ack not initialized, call reset_witness_ack first")
.as_ptr() as *const u8;
loop {
if std::ptr::read_volatile(ptr) != 0 {
return Ok(());
}
if start.elapsed() > TIMEOUT {
return Err(format!(
"Timed out waiting for server to read witness ({}s)",
TIMEOUT.as_secs()
));
}
std::thread::sleep(std::time::Duration::from_millis(10));
}
}
}The previous optimization skipped reading PCS setup from shared memory and returned empty defaults, which caused verify to panic on v_keys lookup (unwrap on None). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Based on # (
hc/export_device_memories_unchecked).Adds a witness ACK shared memory signaling mechanism to release witness memory as early as possible during the prove phase:
witness_ackshared memory signal before writing witness datawitness_ackto 1 after reading witness into MPI shared memorywitness_ackasynchronously; once confirmed, immediately drops the witness shared memory and callsmalloc_trim(0)to return memory to OSMemory flow (before → after)
Before: witness shared memory lives for the entire prove duration
After: witness shared memory released as soon as server reads it
🤖 Generated with Claude Code