Skip to content

fix: use cgroup memory limits for multipart upload concurrency in con…#447

Open
ProbstenHias wants to merge 2 commits intodurch:masterfrom
ProbstenHias:fix/cgroup-memory-detection
Open

fix: use cgroup memory limits for multipart upload concurrency in con…#447
ProbstenHias wants to merge 2 commits intodurch:masterfrom
ProbstenHias:fix/cgroup-memory-detection

Conversation

@ProbstenHias
Copy link
Copy Markdown

@ProbstenHias ProbstenHias commented Feb 12, 2026

…tainers

calculate_max_concurrent_chunks() uses sysinfo::System::available_memory() which reads /proc/meminfo on Linux. In containerized environments (Docker, Kubernetes), /proc/meminfo reports the host node's total memory, not the container's cgroup-enforced limit. This causes the function to massively overestimate available memory and spawn too many concurrent chunk uploads, leading to OOM kills.

Example: a pod with 512 MiB memory limit running on a 64 GiB node would see ~64 GiB available, calculate 100 concurrent 8 MiB chunks (800 MiB total), and get OOMKilled.

The fix adds cgroup-aware memory detection that runs before the sysinfo fallback:

  • Parse /proc/self/cgroup to resolve the process's cgroup path; handle both nested paths (e.g. /kubepods.slice/.../cri-containerd-xxx.scope) and root path "/" which is the common case with private cgroup namespaces (Docker default, Kubernetes default since 1.25)
  • cgroup v2: read memory.max and memory.current from /sys/fs/cgroup//; treat "max" as unlimited
  • cgroup v1: read memory.limit_in_bytes and memory.usage_in_bytes from /sys/fs/cgroup/memory//; treat values near u64::MAX (PAGE_COUNTER_MAX) as unlimited
  • Available memory = limit - current usage (saturating_sub)
  • Return None when not under cgroup constraints, so the caller falls back to the existing sysinfo logic unchanged
  • All cgroup functions gated on #[cfg(target_os = "linux")] and the async feature flags to avoid dead code warnings in sync builds or on non-Linux platforms
  • Add unit tests covering cgroup v2 with limit, v2 unlimited, v1 with limit, v1 unlimited, path parsing, and concurrency clamping math

This change is Reviewable

Summary by CodeRabbit

  • Improvements
    • Enhanced Linux memory detection (including container environments) to better size concurrent uploads; increases max parallel uploads to 100 while keeping safe minimum concurrency for low/unknown memory.
  • Tests
    • Added Linux-only tests to verify memory-detection and concurrency-calculation behavior.

…tainers

calculate_max_concurrent_chunks() uses sysinfo::System::available_memory()
which reads /proc/meminfo on Linux. In containerized environments (Docker,
Kubernetes), /proc/meminfo reports the host node's total memory, not the
container's cgroup-enforced limit. This causes the function to massively
overestimate available memory and spawn too many concurrent chunk uploads,
leading to OOM kills.

Example: a pod with 512 MiB memory limit running on a 64 GiB node would
see ~64 GiB available, calculate 100 concurrent 8 MiB chunks (800 MiB
total), and get OOMKilled.

The fix adds cgroup-aware memory detection that runs before the sysinfo
fallback:

- Parse /proc/self/cgroup to resolve the process's cgroup path; handle
  both nested paths (e.g. /kubepods.slice/.../cri-containerd-xxx.scope)
  and root path "/" which is the common case with private cgroup
  namespaces (Docker default, Kubernetes default since 1.25)
- cgroup v2: read memory.max and memory.current from
  /sys/fs/cgroup/<cgroup-path>/; treat "max" as unlimited
- cgroup v1: read memory.limit_in_bytes and memory.usage_in_bytes from
  /sys/fs/cgroup/memory/<cgroup-path>/; treat values near u64::MAX
  (PAGE_COUNTER_MAX) as unlimited
- Available memory = limit - current usage (saturating_sub)
- Return None when not under cgroup constraints, so the caller falls
  back to the existing sysinfo logic unchanged
- All cgroup functions gated on #[cfg(target_os = "linux")] and the
  async feature flags to avoid dead code warnings in sync builds or
  on non-Linux platforms
- Add unit tests covering cgroup v2 with limit, v2 unlimited, v1 with
  limit, v1 unlimited, path parsing, and concurrency clamping math
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

Adds Linux cgroup-based memory detection to Bucket::calculate_max_concurrent_chunks: try cgroup v2, fall back to cgroup v1, then sysinfo; clamps computed chunk count to [2,100], returns 3 when available memory is zero, and adds Linux-only helpers and unit tests.

Changes

Cohort / File(s) Summary
Bucket logic & cgroup helpers
s3/src/bucket.rs
Introduce Linux-only helpers (cgroup_available_memory, parse_cgroup_path, read_cgroup_path, cgroup_v2_available_memory, cgroup_v1_available_memory) and platform-gated stubs. Update calculate_max_concurrent_chunks() to prefer cgroup v2 → cgroup v1 → sysinfo, refresh sysinfo only on fallback, treat “unlimited” as None, return 3 when available memory is 0, and clamp result to [2,100].
Tests (Linux-only)
s3/src/... (tests in same module)
Add unit tests covering cgroup path parsing, v2/v1 memory computations, handling of “unlimited”, and clamp/edge-case behavior for concurrency calculation.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Bucket
    participant CgroupFS
    participant Sysinfo
    Caller->>Bucket: calculate_max_concurrent_chunks()
    Bucket->>CgroupFS: read /proc/self/cgroup -> determine cgroup path
    alt cgroup v2 found
        Bucket->>CgroupFS: read memory.max and memory.current
        CgroupFS-->>Bucket: limit / current (or "max")
    else cgroup v1 found
        Bucket->>CgroupFS: read memory.limit_in_bytes and memory.usage_in_bytes
        CgroupFS-->>Bucket: limit / usage
    else no cgroup info
        Bucket->>Sysinfo: refresh and read available memory
        Sysinfo-->>Bucket: total/available
    end
    Bucket->>Bucket: compute available = limit - usage (saturating)
    Bucket->>Bucket: derive chunk count, clamp to [2,100], if available==0 -> return 3
    Bucket-->>Caller: chunk count
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I sniffed the cgroup paths tonight,
v2 then v1 by lantern-light,
If none reply, sysinfo's near,
Chunks rise to a hundred, never fear,
Hop—uploads steady, tails held tight! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: using cgroup memory limits for multipart upload concurrency detection in containers, which directly addresses the core problem of memory overestimation inside containerized environments.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@brownian-motion-v0
Copy link
Copy Markdown

Brownian Motion (Brass)

Recommendation: Proceed

Summary: PR addresses OOM kills by improving memory limit detection in containers.
Risk: Medium · Confidence: 80%

Highlights

  • Good test coverage
  • Clear commit history

Unknowns

  • No linked issue
  • Missing performance metrics

Next actions

  • Keep: cgroup memory limit detection logic
  • Drop: unnecessary comments in the code
  • Add: documentation for new functions

Reflection questions

  • What core assumption underpins this PR's approach?
  • How does this change align with the project's longer-term goals?
  • Could there be a simpler way to achieve the primary objective here?

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@s3/src/bucket.rs`:
- Around line 4344-4346: The test function
test_cgroup_available_memory_returns_none_when_no_cgroup inside the cgroup_tests
module is dead because the module is #[cfg(target_os = "linux")] while the test
body is under #[cfg(not(target_os = "linux"))]; either remove the unreachable
#[cfg(not(target_os = "linux"))] block or restructure the tests so they actually
run on Linux: move the assertion into the linux-configured module and call
cgroup_available_memory() (or the real function under test) and assert the
expected Some(...) or None value, or split into two explicit tests with opposite
cfg attributes if you need OS-specific behavior; update or delete the empty test
body for test_cgroup_available_memory_returns_none_when_no_cgroup accordingly.
🧹 Nitpick comments (4)
s3/src/bucket.rs (4)

3036-3066: read_cgroup_path("v2") returns Some("/") for root cgroup, but the corresponding test uses stricter logic.

test_read_cgroup_path_v2 (line 4362) checks !path.is_empty() && path != "/", but the actual v2 branch at line 3050 only checks !path.is_empty(). While this doesn't cause a test failure (since the test input isn't "/"), the duplicated-and-divergent logic is a maintenance hazard.

Consider calling the actual read_cgroup_path function from tests instead of re-implementing the parsing inline. This would catch regressions in the real code and eliminate the risk of test/implementation drift.


3008-3018: Silent failure with no observability when cgroup detection is skipped.

All cgroup helper functions silently return None on any failure (file not found, parse error, unlimited). Consider adding log::debug! or log::trace! messages when cgroup detection falls through, so operators can understand which memory source was used. This is especially valuable when diagnosing OOM issues in containers.

Example logging addition
 fn cgroup_available_memory() -> Option<u64> {
-    cgroup_v2_available_memory().or_else(cgroup_v1_available_memory)
+    let result = cgroup_v2_available_memory().or_else(cgroup_v1_available_memory);
+    if let Some(mem) = result {
+        log::debug!("cgroup memory detection: available={} bytes", mem);
+    } else {
+        log::debug!("cgroup memory detection: no cgroup limits found, falling back to sysinfo");
+    }
+    result
 }

4349-4409: Tests re-implement parsing logic inline instead of calling read_cgroup_path.

test_read_cgroup_path_v2, test_read_cgroup_path_v1, and test_read_cgroup_path_v2_root_returns_root all duplicate the parsing logic from read_cgroup_path rather than invoking it. This means the tests don't actually validate the real implementation — a bug in read_cgroup_path wouldn't be caught.

Since read_cgroup_path reads /proc/self/cgroup (which may not have the expected entries in CI), consider either:

  1. Refactoring read_cgroup_path to accept the file content as a parameter (making it testable), or
  2. At minimum, calling the actual function in the tests and asserting it doesn't panic, even if the exact result is environment-dependent.
Suggested refactor: extract parsing logic into a testable function
+/// Parse a cgroup path from the content of /proc/self/cgroup.
+#[cfg(all(
+    target_os = "linux",
+    any(feature = "with-tokio", feature = "with-async-std")
+))]
+fn parse_cgroup_path(content: &str, controller: &str) -> Option<String> {
+    for line in content.lines() {
+        let parts: Vec<&str> = line.splitn(3, ':').collect();
+        if parts.len() != 3 {
+            continue;
+        }
+        if controller == "v2" {
+            if parts[0] == "0" && parts[1].is_empty() {
+                let path = parts[2].trim();
+                if !path.is_empty() {
+                    return Some(path.to_string());
+                }
+            }
+        } else {
+            let controllers = parts[1];
+            if controllers.split(',').any(|c| c == controller) {
+                let path = parts[2].trim();
+                if !path.is_empty() && path != "/" {
+                    return Some(path.to_string());
+                }
+            }
+        }
+    }
+    None
+}
+
 #[cfg(all(
     target_os = "linux",
     any(feature = "with-tokio", feature = "with-async-std")
 ))]
 fn read_cgroup_path(controller: &str) -> Option<String> {
     let content = std::fs::read_to_string("/proc/self/cgroup").ok()?;
-    for line in content.lines() {
-        ...
-    }
-    None
+    parse_cgroup_path(&content, controller)
 }

Then tests can call parse_cgroup_path directly with known input.


4411-4435: Temp directory tests use fixed paths and may race under parallel execution.

test_cgroup_v2_available_memory_with_limit and test_cgroup_v1_available_memory_with_limit use hardcoded temp directory names (rust_s3_cgroup_test_v2, rust_s3_cgroup_test_v1). If cargo test runs these in parallel (default), they could interfere with each other on repeated runs. Consider using tempfile::tempdir() for unique, auto-cleaned directories, or adding unique suffixes.

Also applies to: 4446-4463

Comment thread s3/src/bucket.rs
@swermin
Copy link
Copy Markdown

swermin commented Mar 27, 2026

What’s the status of this PR? We are having issues with uploads taking too much memory and the containers are getting OOMKilled

- Extract parse_cgroup_path() from read_cgroup_path() for testability
- Remove dead test gated behind impossible cfg combination
- Fix test/impl drift on v2 root path handling
- Add log::debug! for cgroup detection fallthrough observability
- Use PID-scoped temp dirs to prevent parallel test races
@brownian-motion-v0
Copy link
Copy Markdown

Brownian Motion (Brass)

Recommendation: Proceed

Summary: PR effectively addresses memory overestimation in containerized environments.
Risk: Medium · Confidence: 80%

Highlights

  • Good test coverage
  • Clear commit history

Unknowns

  • No linked issue
  • Missing performance metrics

Next actions

  • Keep: cgroup-aware memory detection logic
  • Drop: unnecessary complexity in error handling
  • Add: documentation for new functions

Reflection questions

  • What core assumption underpins this PR's approach?
  • How does this change align with the project's longer-term goals?
  • Could there be a simpler way to achieve the primary objective here?

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
s3/src/bucket.rs (1)

4411-4475: Tests don't actually exercise cgroup_v{1,2}_available_memory — they re-implement the parse inline.

test_cgroup_v2_available_memory_with_limit and test_cgroup_v1_available_memory_with_limit create temp dirs with memory.max/memory.current (and v1 equivalents), then manually fs::read_to_string + parse + saturating_sub. The real cgroup_v2_available_memory() / cgroup_v1_available_memory() functions are never invoked, because they hardcode /sys/fs/cgroup and /sys/fs/cgroup/memory. A regression in the real functions (e.g., wrong filename, broken trim_end_matches, incorrect saturating semantics) would not be caught by these tests.

Consider refactoring the two functions to accept a base-path argument (or splitting the path construction into a testable helper), so the tests can drive them end-to-end against the tmpdir fixture:

♻️ Suggested shape
fn cgroup_v2_available_memory_at(base: &Path) -> Option<u64> {
    let max_str = std::fs::read_to_string(base.join("memory.max")).ok()?;
    let max_str = max_str.trim();
    if max_str == "max" { return None; }
    let limit: u64 = max_str.parse().ok()?;
    let current: u64 = std::fs::read_to_string(base.join("memory.current")).ok()?
        .trim().parse().ok()?;
    Some(limit.saturating_sub(current))
}

fn cgroup_v2_available_memory() -> Option<u64> {
    let cgroup_path = read_cgroup_path("v2")?;
    if !cgroup_path.starts_with('/') { return None; }
    let base = format!("/sys/fs/cgroup{}", cgroup_path.trim_end_matches('/'));
    cgroup_v2_available_memory_at(Path::new(&base))
}

Then the tests can call cgroup_v2_available_memory_at(&cgroup_dir) directly and assert the expected Some(77164544) / None results.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@s3/src/bucket.rs` around lines 4411 - 4475, The tests recreate cgroup files
but never call cgroup_v2_available_memory or cgroup_v1_available_memory, so
regressions aren’t caught; refactor by adding testable helpers (e.g.,
cgroup_v2_available_memory_at(&Path) and cgroup_v1_available_memory_at(&Path))
that read memory.max/current (or memory.limit_in_bytes/usage_in_bytes) from a
given base path and compute saturating_sub, then have the original
cgroup_v2_available_memory and cgroup_v1_available_memory call those helpers
after constructing the real /sys/fs/cgroup path; update tests to call the _at
helpers with the tmpdir to exercise the real parsing and semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@s3/src/bucket.rs`:
- Around line 3119-3142: The current cgroup_v1_available_memory() uses
memory.usage_in_bytes (and v2 memory.current) which counts reclaimable page
cache and therefore underestimates usable memory; update
cgroup_v1_available_memory (and the v2 counterpart if present) to read and parse
cgroup memory.stat and subtract reclaimable fields (for v1: "cache" and
"inactive_file" or for v2: "file" and "inactive_file") from the usage/current
value before computing limit - usage so the result excludes reclaimable pages,
and ensure you still handle missing keys and fall back to the existing
conservative behavior (or a config flag) if memory.stat parsing fails.
- Around line 1606-1622: The change in calculate_max_concurrent_chunks()
increased the clamp from 10 to 100 which allows up to ~2.4 GiB of in-flight
chunk buffers (CHUNK_SIZE * safety_factor * 100) on hosts without cgroup limits;
revert or make the upper bound explicit: restore the clamp to 10 (or introduce a
named constant MAX_CONCURRENT_CHUNKS and use it in
calculate_max_concurrent_chunks()) and add a brief comment documenting the
rationale/impact for uncontained environments and any config toggle so
reviewers/operators can reason about S3/client pool pressure.

---

Nitpick comments:
In `@s3/src/bucket.rs`:
- Around line 4411-4475: The tests recreate cgroup files but never call
cgroup_v2_available_memory or cgroup_v1_available_memory, so regressions aren’t
caught; refactor by adding testable helpers (e.g.,
cgroup_v2_available_memory_at(&Path) and cgroup_v1_available_memory_at(&Path))
that read memory.max/current (or memory.limit_in_bytes/usage_in_bytes) from a
given base path and compute saturating_sub, then have the original
cgroup_v2_available_memory and cgroup_v1_available_memory call those helpers
after constructing the real /sys/fs/cgroup path; update tests to call the _at
helpers with the tmpdir to exercise the real parsing and semantics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8a498e3c-9d44-4961-a3d8-1ac7aee19b8c

📥 Commits

Reviewing files that changed from the base of the PR and between b7583a9 and e5225f0.

📒 Files selected for processing (1)
  • s3/src/bucket.rs

Comment thread s3/src/bucket.rs
Comment thread s3/src/bucket.rs
Comment on lines +3119 to +3142
fn cgroup_v1_available_memory() -> Option<u64> {
let cgroup_path = read_cgroup_path("memory")?;
let base = format!(
"/sys/fs/cgroup/memory{}",
if cgroup_path.starts_with('/') {
&cgroup_path
} else {
return None;
}
);

let limit_str = std::fs::read_to_string(format!("{}/memory.limit_in_bytes", base)).ok()?;
let limit: u64 = limit_str.trim().parse().ok()?;

// Values near u64::MAX indicate no limit (PAGE_COUNTER_MAX * page_size)
if limit >= 0x7FFF_FFFF_FFFF_F000 {
return None; // unlimited, fall back to sysinfo
}

let usage_str = std::fs::read_to_string(format!("{}/memory.usage_in_bytes", base)).ok()?;
let usage: u64 = usage_str.trim().parse().ok()?;

Some(limit.saturating_sub(usage))
}
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider if memory.current / memory.usage_in_bytes is the right signal.

memory.current (v2) and memory.usage_in_bytes (v1) include reclaimable page cache and other evictable memory. Computing limit - current will therefore under-estimate memory available to the upload whenever the container has warm page cache — which, e.g., a long-running service writing to disk easily does. The effect is that concurrency gets clamped toward the lower bound (2) even when plenty of memory could actually be reclaimed for new allocations.

That's a safe conservative direction (favours "too little" over OOMKill), so this isn't a blocker. If you want a tighter number, cgroup v2 exposes memory.stat with file / inactive_file lines you could subtract from current; v1 has memory.stat with cache / inactive_file similarly. Worth considering, otherwise the current conservative behaviour is fine — just flagging so it's an informed choice.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@s3/src/bucket.rs` around lines 3119 - 3142, The current
cgroup_v1_available_memory() uses memory.usage_in_bytes (and v2 memory.current)
which counts reclaimable page cache and therefore underestimates usable memory;
update cgroup_v1_available_memory (and the v2 counterpart if present) to read
and parse cgroup memory.stat and subtract reclaimable fields (for v1: "cache"
and "inactive_file" or for v2: "file" and "inactive_file") from the
usage/current value before computing limit - usage so the result excludes
reclaimable pages, and ensure you still handle missing keys and fall back to the
existing conservative behavior (or a config flag) if memory.stat parsing fails.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is worth knowing, but the conservative direction (underestimating available memory) is the right trade-off for an OOM-fix PR. Parsing memory.stat would add complexity for a marginal improvement. If it becomes a problem in practice (concurrency too low on cache-heavy workloads), it can be revisited separately.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ProbstenHias, that's a sound call — for an OOM-fix PR, erring on the side of conservative concurrency is clearly the right priority. The memory.stat path can always be layered on later as an opt-in refinement if real-world workloads show the concurrency floor being hit too aggressively. Happy to close this out.


✏️ Learnings added
Learnt from: ProbstenHias
Repo: durch/rust-s3 PR: 447
File: s3/src/bucket.rs:3119-3142
Timestamp: 2026-04-18T19:49:49.392Z
Learning: In `s3/src/bucket.rs` (rust-s3 crate), `cgroup_v2_available_memory()` and `cgroup_v1_available_memory()` intentionally use `memory.current` / `memory.usage_in_bytes` (which include reclaimable page cache) rather than subtracting reclaimable fields from `memory.stat`. This deliberately underestimates available memory to be conservative and avoid OOMKills during multipart uploads. Parsing `memory.stat` for a tighter estimate was considered but deferred as an explicit design decision by the author (PR `#447`).

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants