perf: benchmark zstd GRPC compression#749
Conversation
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
JUnit Test Report 78 files ±0 78 suites ±0 3m 44s ⏱️ +5s Results for commit f3b1981. ± Comparison against base commit 348053f. This pull request removes 6 and adds 6 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
Integration Test Report 418 files + 3 418 suites +3 17m 52s ⏱️ - 1m 14s Results for commit f3b1981. ± Comparison against base commit 348053f. This pull request removes 3 and adds 91 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Concerns / Issues
For the alternative to Thread.sleep()For nanosecond-precision delays, a busy-wait spin loop is the standard approach in benchmarking: private void sleep(long bytes) {
final long nanos = nanosPerByte * bytes;
final long deadline = System.nanoTime() + nanos;
while (System.nanoTime() < deadline) {
Thread.onSpinWait(); // hint to the CPU (JDK 9+)
}
}
Why this works for benchmarks:
Why
The tradeoff is that busy-wait consumes a full CPU core, but that's expected and acceptable in a JMH context. |
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
|
|
@jasperpotts : an update regarding the Level.Invocation/Level.Trial - I think I found a cause of that at #758 , but it will be a separate fix. Once merged, I'll update the benchmarks to use the Level.Trial, again, in a separate future PR. For this PR, we go with the same approach that is currently used in the existing PbjGrpcBench and use the Level.Invocation. As mentioned above, the setup of the state happens outside of the measurement, and therefore, it doesn't affect the measurement itself. So it's not a critical issue by any means as it doesn't affect the benchmark, other than by making it run just a tiny bit longer. |
Description:
Update on 3/12/26: introduced
TestBlockGrpcBenchto test actual block data.Adding a zstd GRPC compression benchmark in PBJ integration tests. To that end:
GrpcCompression.register*()APIs are added to support custom encodings.PayloadWeightis updated to precompute the payloads, to make them semi-repeatable (to see some effects of compression), and aSUPERpayload is added with 2M bytes.PbjGrpcCall.setNetworkBytesInspector(PbjGrpcNetworkBytesInspector)API is added to support network latency simulation.Only
unarybenchmarks test various compression methods because streaming benchmarks are very slow on its own already, and the compression shouldn't really have a different effect between unary and streaming cases because we always process every single message individually anyway.See results below. The main outcomes:
zstdis faster with those larger payloads. However, it's a lot slower thangzipwith smaller payloads. In turn,gzipis slower thanidentitywith those smaller payloads.Ultimately, the compression is only beneficial with large requests/replies (>>8KB), assuming they contain enough compressible data. And if network speeds are very fast (>1Gbps), then the compression becomes irrelevant.
Related issue(s):
Fixes #746
Notes for reviewer:
Updated results on 3/17/2026:
Updated results:
The bench has been rewritten to use real block data. The items in the sample are:
So we get the following numbers of items for various block sizes:
Based on the above math, here's the results table with a manually added IPS (items per second) column:
Below are older results with the regular Greeter bench.
Comments in the issue show more various results, but below is the final run:
Checklist