Force a major GC and compact before measuring RSS by eightbitraptor · Pull Request #481 · ruby/ruby-bench

eightbitraptor · 2026-02-18T17:46:04Z

This will hopefully remove any variance caused by one benchmark run having just hit a GC and one not.

eregon · 2026-02-18T19:40:08Z

Aren't all benchmarks each run in their own process so this shouldn't matter, at least the part about a benchmark affecting another?

eregon · 2026-02-18T19:41:40Z

harness/harness-common.rb

+  # Full GC then compact before measuring RSS so fragmentation doesn't inflate the number.
+  GC.start(full_mark: true, immediate_sweep: true)
+  GC.compact if GC.respond_to?(:compact)


I think this is a non-trivial overhead, maybe we should just report maxrss which doesn't need this.
In the worst case the extra GC might even cause more memory usage potentially.

GC.compact is also somewhat brittle as it can break if a single extension doesn't do the right thing, so it feels rather dangerous to do.

I disagree that GC.compact makes this brittle. An extension that reports compaction support and breaks compaction is a correctness issue, and we shouldn't be benchmarking against broken code.

I am amenable to reporting maxrss, but that fundamentally changes the semantics of the benchmark results so I think that is a change we'd have to be more careful about. Or at least communicate effectively on rubybench.github.io or speed.ruby-lang.org.

An extension that reports compaction support and breaks compaction is a correctness issue, and we shouldn't be benchmarking against broken code.

It can be more subtle than that, e.g. holding to a Ruby object in a VALUE struct field or global variable without marking it and also holding it from a Ruby object directly works without GC.compact, but breaks with GC.compact.
Such an extension would not report compaction support, but still break.

eregon · 2026-02-18T19:42:44Z

Did you notice any improvement in practice with this change?
If not I think we should revert it.

eightbitraptor · 2026-02-18T21:40:03Z

Did you notice any improvement in practice with this change?

Anecdotally I have noticed smaller variance in RSS results when benchmarking my current branch repeatedly on an AWS bare metal instance set up specifically to benchmark. I haven't prepared any concrete documentation yet though.

I propose that we let it run for a few days and measure the effects on the memory benchmarks in https://github.com/rubybench/rubybench-data/ before deciding to revert. This is minimal risk as if we do decide to revert, then the data from our testing date range can be easily backfilled.

eregon · 2026-02-19T09:36:25Z

I'd expect this will also change the reported RSS value a lot, before it was meant to be measured right after the last iteration but now with the GC before it measures "which objects are globally reachable" which should be much smaller and probably not interesting as it would be far off from the actual usage during benchmarking.
I think measuring RSS like this has always been a problem and has lots of variance (depending on the timing of automated GCs during the benchmarks), hence why I added maxrss.

eightbitraptor · 2026-02-26T14:25:04Z

Following this up: I went through a range of ruby-bench data from the repo (Feb 5–18 before, Feb 19–25 after) to see what effect this had (For reference there are 67 benchmarks, each with 3 jit configs, so 201 total benchmark pairs in the repo).

I tried to separate the harness change from ruby/ruby by comparing RSS against iteration performance, and discarding data where the perf had changed significantly.

The net effect on RSS magnitude is small: median change across our ~200 benchmark pairs is about −0.3%.

45 benchmark pairs showed a significant movement of RSS. Of those, 30 decreased and 15 increased. So, a slight bias toward lower RSS but not dramatic. I don't really understand why collecting and compacting would increase RSS usage when measured like this, so I don't fully trust that this variance isn't due to other changes happening in ruby/ruby at the same time.

The variance story is more interesting. 43 benchmarks (~20%) that had high run-to-run variance, where stddev was 2% > mean.

Of those, 31 (>70%) improved (eg. psych-load no_jit went from 13% CV to 1.7%). This was the issue that I was trying to target with this PR: the benchmarks are oscillating between two discrete RSS values depending on whether a GC happened to run near the end of the run or not, and the explicit GC.start makes that deterministic.

I think you're right that this changes the semantics of the measurement. My read of the data is that GC.start alone gets us most of the benefit (deterministic collection removes the oscillation), but that adding GC.compact on top is probably unnecessary and does change the meaning of the benchmark.

I'll put a PR together that removes GC.compact and leaves the GC.start in place. But I think we should also consider surfacing maxrss as well, alongside this one, because at the end of the day, that's immune to GC churn completely, but that's a bit of a bigger piece of work because we'll need to think about how to present it on the frontend too.

Force a major GC and compact before measuring RSS

fe53533

This will hopefully remove any variance caused by one benchmark run having just hit a GC and one not.

k0kubun approved these changes Feb 18, 2026

View reviewed changes

eightbitraptor merged commit 60d93c0 into main Feb 18, 2026
11 checks passed

eightbitraptor deleted the mvh-compact-rss branch February 18, 2026 18:31

eregon reviewed Feb 18, 2026

View reviewed changes

eightbitraptor mentioned this pull request Feb 26, 2026

Remove GC.compact from the benchmark harness #486

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force a major GC and compact before measuring RSS#481

Force a major GC and compact before measuring RSS#481
eightbitraptor merged 1 commit intomainfrom
mvh-compact-rss

eightbitraptor commented Feb 18, 2026

Uh oh!

Uh oh!

eregon commented Feb 18, 2026

Uh oh!

eregon Feb 18, 2026 •

edited

Loading

Uh oh!

eightbitraptor Feb 18, 2026

Uh oh!

eregon Feb 19, 2026

Uh oh!

eregon commented Feb 18, 2026

Uh oh!

eightbitraptor commented Feb 18, 2026

Uh oh!

eregon commented Feb 19, 2026

Uh oh!

eightbitraptor commented Feb 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eightbitraptor commented Feb 18, 2026

Uh oh!

Uh oh!

eregon commented Feb 18, 2026

Uh oh!

eregon Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eightbitraptor Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

eregon Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

eregon commented Feb 18, 2026

Uh oh!

eightbitraptor commented Feb 18, 2026

Uh oh!

eregon commented Feb 19, 2026

Uh oh!

eightbitraptor commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eregon Feb 18, 2026 •

edited

Loading

eightbitraptor commented Feb 26, 2026 •

edited

Loading