fix: use cgroup memory limits for multipart upload concurrency in con…#447
fix: use cgroup memory limits for multipart upload concurrency in con…#447ProbstenHias wants to merge 2 commits intodurch:masterfrom
Conversation
…tainers calculate_max_concurrent_chunks() uses sysinfo::System::available_memory() which reads /proc/meminfo on Linux. In containerized environments (Docker, Kubernetes), /proc/meminfo reports the host node's total memory, not the container's cgroup-enforced limit. This causes the function to massively overestimate available memory and spawn too many concurrent chunk uploads, leading to OOM kills. Example: a pod with 512 MiB memory limit running on a 64 GiB node would see ~64 GiB available, calculate 100 concurrent 8 MiB chunks (800 MiB total), and get OOMKilled. The fix adds cgroup-aware memory detection that runs before the sysinfo fallback: - Parse /proc/self/cgroup to resolve the process's cgroup path; handle both nested paths (e.g. /kubepods.slice/.../cri-containerd-xxx.scope) and root path "/" which is the common case with private cgroup namespaces (Docker default, Kubernetes default since 1.25) - cgroup v2: read memory.max and memory.current from /sys/fs/cgroup/<cgroup-path>/; treat "max" as unlimited - cgroup v1: read memory.limit_in_bytes and memory.usage_in_bytes from /sys/fs/cgroup/memory/<cgroup-path>/; treat values near u64::MAX (PAGE_COUNTER_MAX) as unlimited - Available memory = limit - current usage (saturating_sub) - Return None when not under cgroup constraints, so the caller falls back to the existing sysinfo logic unchanged - All cgroup functions gated on #[cfg(target_os = "linux")] and the async feature flags to avoid dead code warnings in sync builds or on non-Linux platforms - Add unit tests covering cgroup v2 with limit, v2 unlimited, v1 with limit, v1 unlimited, path parsing, and concurrency clamping math
📝 WalkthroughWalkthroughAdds Linux cgroup-based memory detection to Bucket::calculate_max_concurrent_chunks: try cgroup v2, fall back to cgroup v1, then sysinfo; clamps computed chunk count to [2,100], returns 3 when available memory is zero, and adds Linux-only helpers and unit tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Bucket
participant CgroupFS
participant Sysinfo
Caller->>Bucket: calculate_max_concurrent_chunks()
Bucket->>CgroupFS: read /proc/self/cgroup -> determine cgroup path
alt cgroup v2 found
Bucket->>CgroupFS: read memory.max and memory.current
CgroupFS-->>Bucket: limit / current (or "max")
else cgroup v1 found
Bucket->>CgroupFS: read memory.limit_in_bytes and memory.usage_in_bytes
CgroupFS-->>Bucket: limit / usage
else no cgroup info
Bucket->>Sysinfo: refresh and read available memory
Sysinfo-->>Bucket: total/available
end
Bucket->>Bucket: compute available = limit - usage (saturating)
Bucket->>Bucket: derive chunk count, clamp to [2,100], if available==0 -> return 3
Bucket-->>Caller: chunk count
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Brownian Motion (Brass)Recommendation: Proceed Summary: PR addresses OOM kills by improving memory limit detection in containers. Highlights
Unknowns
Next actions
Reflection questions
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@s3/src/bucket.rs`:
- Around line 4344-4346: The test function
test_cgroup_available_memory_returns_none_when_no_cgroup inside the cgroup_tests
module is dead because the module is #[cfg(target_os = "linux")] while the test
body is under #[cfg(not(target_os = "linux"))]; either remove the unreachable
#[cfg(not(target_os = "linux"))] block or restructure the tests so they actually
run on Linux: move the assertion into the linux-configured module and call
cgroup_available_memory() (or the real function under test) and assert the
expected Some(...) or None value, or split into two explicit tests with opposite
cfg attributes if you need OS-specific behavior; update or delete the empty test
body for test_cgroup_available_memory_returns_none_when_no_cgroup accordingly.
🧹 Nitpick comments (4)
s3/src/bucket.rs (4)
3036-3066:read_cgroup_path("v2")returnsSome("/")for root cgroup, but the corresponding test uses stricter logic.
test_read_cgroup_path_v2(line 4362) checks!path.is_empty() && path != "/", but the actual v2 branch at line 3050 only checks!path.is_empty(). While this doesn't cause a test failure (since the test input isn't"/"), the duplicated-and-divergent logic is a maintenance hazard.Consider calling the actual
read_cgroup_pathfunction from tests instead of re-implementing the parsing inline. This would catch regressions in the real code and eliminate the risk of test/implementation drift.
3008-3018: Silent failure with no observability when cgroup detection is skipped.All cgroup helper functions silently return
Noneon any failure (file not found, parse error, unlimited). Consider addinglog::debug!orlog::trace!messages when cgroup detection falls through, so operators can understand which memory source was used. This is especially valuable when diagnosing OOM issues in containers.Example logging addition
fn cgroup_available_memory() -> Option<u64> { - cgroup_v2_available_memory().or_else(cgroup_v1_available_memory) + let result = cgroup_v2_available_memory().or_else(cgroup_v1_available_memory); + if let Some(mem) = result { + log::debug!("cgroup memory detection: available={} bytes", mem); + } else { + log::debug!("cgroup memory detection: no cgroup limits found, falling back to sysinfo"); + } + result }
4349-4409: Tests re-implement parsing logic inline instead of callingread_cgroup_path.
test_read_cgroup_path_v2,test_read_cgroup_path_v1, andtest_read_cgroup_path_v2_root_returns_rootall duplicate the parsing logic fromread_cgroup_pathrather than invoking it. This means the tests don't actually validate the real implementation — a bug inread_cgroup_pathwouldn't be caught.Since
read_cgroup_pathreads/proc/self/cgroup(which may not have the expected entries in CI), consider either:
- Refactoring
read_cgroup_pathto accept the file content as a parameter (making it testable), or- At minimum, calling the actual function in the tests and asserting it doesn't panic, even if the exact result is environment-dependent.
Suggested refactor: extract parsing logic into a testable function
+/// Parse a cgroup path from the content of /proc/self/cgroup. +#[cfg(all( + target_os = "linux", + any(feature = "with-tokio", feature = "with-async-std") +))] +fn parse_cgroup_path(content: &str, controller: &str) -> Option<String> { + for line in content.lines() { + let parts: Vec<&str> = line.splitn(3, ':').collect(); + if parts.len() != 3 { + continue; + } + if controller == "v2" { + if parts[0] == "0" && parts[1].is_empty() { + let path = parts[2].trim(); + if !path.is_empty() { + return Some(path.to_string()); + } + } + } else { + let controllers = parts[1]; + if controllers.split(',').any(|c| c == controller) { + let path = parts[2].trim(); + if !path.is_empty() && path != "/" { + return Some(path.to_string()); + } + } + } + } + None +} + #[cfg(all( target_os = "linux", any(feature = "with-tokio", feature = "with-async-std") ))] fn read_cgroup_path(controller: &str) -> Option<String> { let content = std::fs::read_to_string("/proc/self/cgroup").ok()?; - for line in content.lines() { - ... - } - None + parse_cgroup_path(&content, controller) }Then tests can call
parse_cgroup_pathdirectly with known input.
4411-4435: Temp directory tests use fixed paths and may race under parallel execution.
test_cgroup_v2_available_memory_with_limitandtest_cgroup_v1_available_memory_with_limituse hardcoded temp directory names (rust_s3_cgroup_test_v2,rust_s3_cgroup_test_v1). Ifcargo testruns these in parallel (default), they could interfere with each other on repeated runs. Consider usingtempfile::tempdir()for unique, auto-cleaned directories, or adding unique suffixes.Also applies to: 4446-4463
|
What’s the status of this PR? We are having issues with uploads taking too much memory and the containers are getting OOMKilled |
- Extract parse_cgroup_path() from read_cgroup_path() for testability - Remove dead test gated behind impossible cfg combination - Fix test/impl drift on v2 root path handling - Add log::debug! for cgroup detection fallthrough observability - Use PID-scoped temp dirs to prevent parallel test races
Brownian Motion (Brass)Recommendation: Proceed Summary: PR effectively addresses memory overestimation in containerized environments. Highlights
Unknowns
Next actions
Reflection questions
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
s3/src/bucket.rs (1)
4411-4475: Tests don't actually exercisecgroup_v{1,2}_available_memory— they re-implement the parse inline.
test_cgroup_v2_available_memory_with_limitandtest_cgroup_v1_available_memory_with_limitcreate temp dirs withmemory.max/memory.current(and v1 equivalents), then manuallyfs::read_to_string+parse+saturating_sub. The realcgroup_v2_available_memory()/cgroup_v1_available_memory()functions are never invoked, because they hardcode/sys/fs/cgroupand/sys/fs/cgroup/memory. A regression in the real functions (e.g., wrong filename, brokentrim_end_matches, incorrect saturating semantics) would not be caught by these tests.Consider refactoring the two functions to accept a base-path argument (or splitting the path construction into a testable helper), so the tests can drive them end-to-end against the tmpdir fixture:
♻️ Suggested shape
fn cgroup_v2_available_memory_at(base: &Path) -> Option<u64> { let max_str = std::fs::read_to_string(base.join("memory.max")).ok()?; let max_str = max_str.trim(); if max_str == "max" { return None; } let limit: u64 = max_str.parse().ok()?; let current: u64 = std::fs::read_to_string(base.join("memory.current")).ok()? .trim().parse().ok()?; Some(limit.saturating_sub(current)) } fn cgroup_v2_available_memory() -> Option<u64> { let cgroup_path = read_cgroup_path("v2")?; if !cgroup_path.starts_with('/') { return None; } let base = format!("/sys/fs/cgroup{}", cgroup_path.trim_end_matches('/')); cgroup_v2_available_memory_at(Path::new(&base)) }Then the tests can call
cgroup_v2_available_memory_at(&cgroup_dir)directly and assert the expectedSome(77164544)/Noneresults.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@s3/src/bucket.rs` around lines 4411 - 4475, The tests recreate cgroup files but never call cgroup_v2_available_memory or cgroup_v1_available_memory, so regressions aren’t caught; refactor by adding testable helpers (e.g., cgroup_v2_available_memory_at(&Path) and cgroup_v1_available_memory_at(&Path)) that read memory.max/current (or memory.limit_in_bytes/usage_in_bytes) from a given base path and compute saturating_sub, then have the original cgroup_v2_available_memory and cgroup_v1_available_memory call those helpers after constructing the real /sys/fs/cgroup path; update tests to call the _at helpers with the tmpdir to exercise the real parsing and semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@s3/src/bucket.rs`:
- Around line 3119-3142: The current cgroup_v1_available_memory() uses
memory.usage_in_bytes (and v2 memory.current) which counts reclaimable page
cache and therefore underestimates usable memory; update
cgroup_v1_available_memory (and the v2 counterpart if present) to read and parse
cgroup memory.stat and subtract reclaimable fields (for v1: "cache" and
"inactive_file" or for v2: "file" and "inactive_file") from the usage/current
value before computing limit - usage so the result excludes reclaimable pages,
and ensure you still handle missing keys and fall back to the existing
conservative behavior (or a config flag) if memory.stat parsing fails.
- Around line 1606-1622: The change in calculate_max_concurrent_chunks()
increased the clamp from 10 to 100 which allows up to ~2.4 GiB of in-flight
chunk buffers (CHUNK_SIZE * safety_factor * 100) on hosts without cgroup limits;
revert or make the upper bound explicit: restore the clamp to 10 (or introduce a
named constant MAX_CONCURRENT_CHUNKS and use it in
calculate_max_concurrent_chunks()) and add a brief comment documenting the
rationale/impact for uncontained environments and any config toggle so
reviewers/operators can reason about S3/client pool pressure.
---
Nitpick comments:
In `@s3/src/bucket.rs`:
- Around line 4411-4475: The tests recreate cgroup files but never call
cgroup_v2_available_memory or cgroup_v1_available_memory, so regressions aren’t
caught; refactor by adding testable helpers (e.g.,
cgroup_v2_available_memory_at(&Path) and cgroup_v1_available_memory_at(&Path))
that read memory.max/current (or memory.limit_in_bytes/usage_in_bytes) from a
given base path and compute saturating_sub, then have the original
cgroup_v2_available_memory and cgroup_v1_available_memory call those helpers
after constructing the real /sys/fs/cgroup path; update tests to call the _at
helpers with the tmpdir to exercise the real parsing and semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| fn cgroup_v1_available_memory() -> Option<u64> { | ||
| let cgroup_path = read_cgroup_path("memory")?; | ||
| let base = format!( | ||
| "/sys/fs/cgroup/memory{}", | ||
| if cgroup_path.starts_with('/') { | ||
| &cgroup_path | ||
| } else { | ||
| return None; | ||
| } | ||
| ); | ||
|
|
||
| let limit_str = std::fs::read_to_string(format!("{}/memory.limit_in_bytes", base)).ok()?; | ||
| let limit: u64 = limit_str.trim().parse().ok()?; | ||
|
|
||
| // Values near u64::MAX indicate no limit (PAGE_COUNTER_MAX * page_size) | ||
| if limit >= 0x7FFF_FFFF_FFFF_F000 { | ||
| return None; // unlimited, fall back to sysinfo | ||
| } | ||
|
|
||
| let usage_str = std::fs::read_to_string(format!("{}/memory.usage_in_bytes", base)).ok()?; | ||
| let usage: u64 = usage_str.trim().parse().ok()?; | ||
|
|
||
| Some(limit.saturating_sub(usage)) | ||
| } |
There was a problem hiding this comment.
Consider if memory.current / memory.usage_in_bytes is the right signal.
memory.current (v2) and memory.usage_in_bytes (v1) include reclaimable page cache and other evictable memory. Computing limit - current will therefore under-estimate memory available to the upload whenever the container has warm page cache — which, e.g., a long-running service writing to disk easily does. The effect is that concurrency gets clamped toward the lower bound (2) even when plenty of memory could actually be reclaimed for new allocations.
That's a safe conservative direction (favours "too little" over OOMKill), so this isn't a blocker. If you want a tighter number, cgroup v2 exposes memory.stat with file / inactive_file lines you could subtract from current; v1 has memory.stat with cache / inactive_file similarly. Worth considering, otherwise the current conservative behaviour is fine — just flagging so it's an informed choice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@s3/src/bucket.rs` around lines 3119 - 3142, The current
cgroup_v1_available_memory() uses memory.usage_in_bytes (and v2 memory.current)
which counts reclaimable page cache and therefore underestimates usable memory;
update cgroup_v1_available_memory (and the v2 counterpart if present) to read
and parse cgroup memory.stat and subtract reclaimable fields (for v1: "cache"
and "inactive_file" or for v2: "file" and "inactive_file") from the
usage/current value before computing limit - usage so the result excludes
reclaimable pages, and ensure you still handle missing keys and fall back to the
existing conservative behavior (or a config flag) if memory.stat parsing fails.
There was a problem hiding this comment.
Agreed this is worth knowing, but the conservative direction (underestimating available memory) is the right trade-off for an OOM-fix PR. Parsing memory.stat would add complexity for a marginal improvement. If it becomes a problem in practice (concurrency too low on cache-heavy workloads), it can be revisited separately.
There was a problem hiding this comment.
@ProbstenHias, that's a sound call — for an OOM-fix PR, erring on the side of conservative concurrency is clearly the right priority. The memory.stat path can always be layered on later as an opt-in refinement if real-world workloads show the concurrency floor being hit too aggressively. Happy to close this out.
✏️ Learnings added
Learnt from: ProbstenHias
Repo: durch/rust-s3 PR: 447
File: s3/src/bucket.rs:3119-3142
Timestamp: 2026-04-18T19:49:49.392Z
Learning: In `s3/src/bucket.rs` (rust-s3 crate), `cgroup_v2_available_memory()` and `cgroup_v1_available_memory()` intentionally use `memory.current` / `memory.usage_in_bytes` (which include reclaimable page cache) rather than subtracting reclaimable fields from `memory.stat`. This deliberately underestimates available memory to be conservative and avoid OOMKills during multipart uploads. Parsing `memory.stat` for a tighter estimate was considered but deferred as an explicit design decision by the author (PR `#447`).
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
…tainers
calculate_max_concurrent_chunks() uses sysinfo::System::available_memory() which reads /proc/meminfo on Linux. In containerized environments (Docker, Kubernetes), /proc/meminfo reports the host node's total memory, not the container's cgroup-enforced limit. This causes the function to massively overestimate available memory and spawn too many concurrent chunk uploads, leading to OOM kills.
Example: a pod with 512 MiB memory limit running on a 64 GiB node would see ~64 GiB available, calculate 100 concurrent 8 MiB chunks (800 MiB total), and get OOMKilled.
The fix adds cgroup-aware memory detection that runs before the sysinfo fallback:
This change is
Summary by CodeRabbit