Skip to content

perf: benchmark zstd GRPC compression#749

Merged
anthony-swirldslabs merged 8 commits intomainfrom
746-zstdBench
Mar 24, 2026
Merged

perf: benchmark zstd GRPC compression#749
anthony-swirldslabs merged 8 commits intomainfrom
746-zstdBench

Conversation

@anthony-swirldslabs
Copy link
Contributor

@anthony-swirldslabs anthony-swirldslabs commented Mar 10, 2026

Description:
Update on 3/12/26: introduced TestBlockGrpcBench to test actual block data.

Adding a zstd GRPC compression benchmark in PBJ integration tests. To that end:

  • GrpcCompression.register*() APIs are added to support custom encodings.
  • PayloadWeight is updated to precompute the payloads, to make them semi-repeatable (to see some effects of compression), and a SUPER payload is added with 2M bytes.
  • A non-official PbjGrpcCall.setNetworkBytesInspector(PbjGrpcNetworkBytesInspector) API is added to support network latency simulation.
  • A 1Gbps network is simulated in the benchmarks to show at least some benefit of the compression.

Only unary benchmarks test various compression methods because streaming benchmarks are very slow on its own already, and the compression shouldn't really have a different effect between unary and streaming cases because we always process every single message individually anyway.

See results below. The main outcomes:

  • Fast network negate the benefits of the compression because it's faster to send the bytes than to spend CPU on compressing them. Slowing down the network (e.g. 100Mbps or slower) could artificially show greater benefits of the compression, but that doesn't seem fair.
  • Small(er) payloads don't benefit from the compression. Only the 2M payload seems to benefit sometimes. It's questionable if this payload size is truly typical though.
  • zstd is faster with those larger payloads. However, it's a lot slower than gzip with smaller payloads. In turn, gzip is slower than identity with those smaller payloads.

Ultimately, the compression is only beneficial with large requests/replies (>>8KB), assuming they contain enough compressible data. And if network speeds are very fast (>1Gbps), then the compression becomes irrelevant.

Related issue(s):

Fixes #746

Notes for reviewer:
Updated results on 3/17/2026:

Benchmark                              (encodings)  (maxBlockSize)   Mode  Cnt     Score     Error  Units  IPS
TestBlockGrpcBench.benchBidiStreaming     identity          102400  thrpt    3   900.908 ± 101.119  ops/s  315K
TestBlockGrpcBench.benchBidiStreaming     identity          524288  thrpt    3   176.079 ±  10.660  ops/s  317K
TestBlockGrpcBench.benchBidiStreaming     identity         2048000  thrpt    3    46.429 ±   3.484  ops/s  325K
TestBlockGrpcBench.benchBidiStreaming         gzip          102400  thrpt    3   573.893 ±  60.821  ops/s  201K
TestBlockGrpcBench.benchBidiStreaming         gzip          524288  thrpt    3   103.225 ±   3.426  ops/s  186K
TestBlockGrpcBench.benchBidiStreaming         gzip         2048000  thrpt    3    27.067 ±   1.305  ops/s  189K
TestBlockGrpcBench.benchBidiStreaming         zstd          102400  thrpt    3  1735.377 ± 789.987  ops/s  607K
TestBlockGrpcBench.benchBidiStreaming         zstd          524288  thrpt    3   323.607 ±  21.641  ops/s  582K
TestBlockGrpcBench.benchBidiStreaming         zstd         2048000  thrpt    3    91.564 ±   4.128  ops/s  641K
TestBlockGrpcBench.benchBidiStreaming       zstd10          102400  thrpt    3   781.124 ±  77.676  ops/s  273K
TestBlockGrpcBench.benchBidiStreaming       zstd10          524288  thrpt    3   148.500 ±  16.553  ops/s  267K
TestBlockGrpcBench.benchBidiStreaming       zstd10         2048000  thrpt    3    37.497 ±   1.823  ops/s  262K
TestBlockGrpcBench.benchBidiStreaming       zstd-5          102400  thrpt    3  1606.577 ± 246.504  ops/s  562K
TestBlockGrpcBench.benchBidiStreaming       zstd-5          524288  thrpt    3   280.603 ±  20.727  ops/s  505K
TestBlockGrpcBench.benchBidiStreaming       zstd-5         2048000  thrpt    3    73.976 ±  45.961  ops/s  518K
TestBlockGrpcBench.benchUnary             identity          102400  thrpt    3   726.609 ± 345.864  ops/s  254K
TestBlockGrpcBench.benchUnary             identity          524288  thrpt    3   166.334 ±  13.599  ops/s  299K
TestBlockGrpcBench.benchUnary             identity         2048000  thrpt    3    44.689 ±   3.642  ops/s  313K
TestBlockGrpcBench.benchUnary                 gzip          102400  thrpt    3   341.977 ±   7.401  ops/s  120K
TestBlockGrpcBench.benchUnary                 gzip          524288  thrpt    3    70.284 ±   5.135  ops/s  127K
TestBlockGrpcBench.benchUnary                 gzip         2048000  thrpt    3    18.919 ±   0.144  ops/s  132K
TestBlockGrpcBench.benchUnary                 zstd          102400  thrpt    3   708.404 ±  76.297  ops/s  248K
TestBlockGrpcBench.benchUnary                 zstd          524288  thrpt    3   186.305 ±  16.827  ops/s  335K
TestBlockGrpcBench.benchUnary                 zstd         2048000  thrpt    3    57.511 ±   2.269  ops/s  403K
TestBlockGrpcBench.benchUnary               zstd10          102400  thrpt    3   430.910 ±  28.807  ops/s  151K
TestBlockGrpcBench.benchUnary               zstd10          524288  thrpt    3   104.016 ±  10.584  ops/s  187K
TestBlockGrpcBench.benchUnary               zstd10         2048000  thrpt    3    28.662 ±   2.335  ops/s  201K
TestBlockGrpcBench.benchUnary               zstd-5          102400  thrpt    3   848.160 ± 112.154  ops/s  297K
TestBlockGrpcBench.benchUnary               zstd-5          524288  thrpt    3   218.547 ±  41.507  ops/s  393K
TestBlockGrpcBench.benchUnary               zstd-5         2048000  thrpt    3    61.696 ±   3.951  ops/s  432K

Updated results:

The bench has been rewritten to use real block data. The items in the sample are:

Test blocks of maxBlockSize 2048000:
   0: 7234 items, 2047972 bytes total, with average item at 283 bytes
   1: 6776 items, 2037785 bytes total, with average item at 300 bytes
   2: 7157 items, 2047982 bytes total, with average item at 286 bytes
...

So we get the following numbers of items for various block sizes:

102400   350
524288   1800
2048000  7000

Based on the above math, here's the results table with a manually added IPS (items per second) column:

Benchmark                              (encodings)  (maxBlockSize)   Mode  Cnt     Score     Error  Units  IPS
TestBlockGrpcBench.benchBidiStreaming     identity          102400  thrpt    3   571.208 ± 391.871  ops/s  199K
TestBlockGrpcBench.benchBidiStreaming     identity          524288  thrpt    3   108.676 ±  92.651  ops/s  194K
TestBlockGrpcBench.benchBidiStreaming     identity         2048000  thrpt    3    35.170 ±   1.578  ops/s  245K
TestBlockGrpcBench.benchBidiStreaming         gzip          102400  thrpt    3   576.079 ±  58.788  ops/s  201K
TestBlockGrpcBench.benchBidiStreaming         gzip          524288  thrpt    3   103.364 ±   0.980  ops/s  185K
TestBlockGrpcBench.benchBidiStreaming         gzip         2048000  thrpt    3    27.226 ±   1.710  ops/s  189K
TestBlockGrpcBench.benchBidiStreaming         zstd          102400  thrpt    3  1303.327 ± 150.537  ops/s  456K
TestBlockGrpcBench.benchBidiStreaming         zstd          524288  thrpt    3   262.027 ±  10.394  ops/s  471K
TestBlockGrpcBench.benchBidiStreaming         zstd         2048000  thrpt    3    73.482 ±   3.155  ops/s  511K
TestBlockGrpcBench.benchBidiStreaming        zstd0          102400  thrpt    3  1291.796 ± 107.170  ops/s  452K
TestBlockGrpcBench.benchBidiStreaming        zstd0          524288  thrpt    3   252.795 ±  14.902  ops/s  455K
TestBlockGrpcBench.benchBidiStreaming        zstd0         2048000  thrpt    3    72.517 ±   1.303  ops/s  507K
TestBlockGrpcBench.benchBidiStreaming       zstd-5          102400  thrpt    3  1170.695 ±  58.590  ops/s  409K
TestBlockGrpcBench.benchBidiStreaming       zstd-5          524288  thrpt    3   211.161 ±  69.433  ops/s  380K
TestBlockGrpcBench.benchBidiStreaming       zstd-5         2048000  thrpt    3    65.638 ±   2.561  ops/s  459K
TestBlockGrpcBench.benchUnary             identity          102400  thrpt    3   447.294 ± 357.286  ops/s  156K
TestBlockGrpcBench.benchUnary             identity          524288  thrpt    3   102.628 ±  49.036  ops/s  184K
TestBlockGrpcBench.benchUnary             identity         2048000  thrpt    3    30.458 ±   3.870  ops/s  213K
TestBlockGrpcBench.benchUnary                 gzip          102400  thrpt    3   298.344 ±  77.744  ops/s  104K
TestBlockGrpcBench.benchUnary                 gzip          524288  thrpt    3    66.626 ±   4.734  ops/s  120K
TestBlockGrpcBench.benchUnary                 gzip         2048000  thrpt    3    17.656 ±   2.364  ops/s  123K
TestBlockGrpcBench.benchUnary                 zstd          102400  thrpt    3   603.907 ±  21.967  ops/s  211K
TestBlockGrpcBench.benchUnary                 zstd          524288  thrpt    3   146.474 ±  18.600  ops/s  263K
TestBlockGrpcBench.benchUnary                 zstd         2048000  thrpt    3    48.994 ±   1.385  ops/s  343K
TestBlockGrpcBench.benchUnary                zstd0          102400  thrpt    3   604.090 ±  62.128  ops/s  244K
TestBlockGrpcBench.benchUnary                zstd0          524288  thrpt    3   146.230 ±  41.440  ops/s  263K
TestBlockGrpcBench.benchUnary                zstd0         2048000  thrpt    3    50.207 ±  31.698  ops/s  351K
TestBlockGrpcBench.benchUnary               zstd-5          102400  thrpt    3   691.045 ± 258.739  ops/s  242K
TestBlockGrpcBench.benchUnary               zstd-5          524288  thrpt    3   158.872 ±  74.167  ops/s  286K
TestBlockGrpcBench.benchUnary               zstd-5         2048000  thrpt    3    54.725 ±   2.770  ops/s  383K

Below are older results with the regular Greeter bench.

Comments in the issue show more various results, but below is the final run:

Benchmark                          (encodings)  (streamCount)  (weight)   Mode  Cnt     Score      Error  Units
PbjGrpcBench.benchUnary               identity            N/A     LIGHT  thrpt    3  7685.976 ± 1340.012  ops/s
PbjGrpcBench.benchUnary               identity            N/A    NORMAL  thrpt    3  6737.614 ±  576.778  ops/s
PbjGrpcBench.benchUnary               identity            N/A     HEAVY  thrpt    3  2730.245 ±  284.283  ops/s
PbjGrpcBench.benchUnary               identity            N/A     SUPER  thrpt    3    15.300 ±    6.864  ops/s
PbjGrpcBench.benchUnary                   gzip            N/A     LIGHT  thrpt    3  5635.441 ±  533.935  ops/s
PbjGrpcBench.benchUnary                   gzip            N/A    NORMAL  thrpt    3  5035.319 ±  748.946  ops/s
PbjGrpcBench.benchUnary                   gzip            N/A     HEAVY  thrpt    3  1995.138 ±  350.897  ops/s
PbjGrpcBench.benchUnary                   gzip            N/A     SUPER  thrpt    3     9.364 ±    4.476  ops/s
PbjGrpcBench.benchUnary                   zstd            N/A     LIGHT  thrpt    3  4267.069 ± 3984.615  ops/s
PbjGrpcBench.benchUnary                   zstd            N/A    NORMAL  thrpt    3  3922.068 ±  877.392  ops/s
PbjGrpcBench.benchUnary                   zstd            N/A     HEAVY  thrpt    3  2782.024 ±  332.568  ops/s
PbjGrpcBench.benchUnary                   zstd            N/A     SUPER  thrpt    3    23.687 ±   17.162  ops/s

Checklist

  • Documented (Code comments, README, etc.)
  • Tested (unit, integration, etc.)

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
@github-actions
Copy link

github-actions bot commented Mar 10, 2026

JUnit Test Report

   78 files  ±0     78 suites  ±0   3m 44s ⏱️ +5s
1 352 tests ±0  1 348 ✅ ±0   4 💤 ±0  0 ❌ ±0 
7 234 runs  ±0  7 214 ✅ ±0  20 💤 ±0  0 ❌ ±0 

Results for commit f3b1981. ± Comparison against base commit 348053f.

This pull request removes 6 and adds 6 tests. Note that renamed tests count towards both.
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [1] FLOAT, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048ceb0@3b28ab9b, [0.1, 0.5, 100.0], 12, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048d0c0@16c1345b
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [1] STRING, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c04972a0@6ce8bf64, [string 1, testing here, testing there], com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c04974b0@6413eeb7
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [2] BYTES, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c04976c0@4c678a1f, [010203, ff7f0f, 42da07370bff], com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c04978d0@217009bd
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [2] DOUBLE, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048d2d0@1443539, [0.1, 0.5, 100.0, 1.7653472635472653E240], 32, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048d4e0@5b160208
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [3] BOOL, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048d6f0@16a15261, [true, false, false, true, true, true], 6, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048d900@36ec4071
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [4] ENUM, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048db10@20d92f1e, [0, 2, 1], 3, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048dd20@3cf7433e
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [1] FLOAT, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048d370@62cb977a, [0.1, 0.5, 100.0], 12, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048d580@7db70494
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [1] STRING, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c04976c0@58189132, [string 1, testing here, testing there], com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c04978d0@2305aad0
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [2] BYTES, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c0497ae0@54cce500, [010203, ff7f0f, 42da07370bff], com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c0497cf0@755033c5
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [2] DOUBLE, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048d790@36ec4071, [0.1, 0.5, 100.0, 1.7653472635472653E240], 32, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048d9a0@5d8112e6
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [3] BOOL, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048dbb0@3cf7433e, [true, false, false, true, true, true], 6, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048ddc0@68cc6319
com.hedera.pbj.runtime.ProtoWriterToolsTest ‑ [4] ENUM, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048dfd0@544733a4, [0, 2, 1], 3, com.hedera.pbj.runtime.ProtoWriterToolsTest$$Lambda/0x00000007c048e1e0@522f74a1

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Mar 10, 2026

Integration Test Report

    418 files  + 3      418 suites  +3   17m 52s ⏱️ - 1m 14s
114 977 tests +88  114 977 ✅ +88  0 💤 ±0  0 ❌ ±0 
115 219 runs  +88  115 219 ✅ +88  0 💤 ±0  0 ❌ ±0 

Results for commit f3b1981. ± Comparison against base commit 348053f.

This pull request removes 3 and adds 91 tests. Note that renamed tests count towards both.
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [1] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000713344c36bb0@1f368b6a
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [2] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000713344c36de0@5baff897
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [3] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000713344c37010@3f7f9015
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [1] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000799d27c431f0@41b9662c
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [2] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000799d27c43420@579f4595
com.hedera.pbj.integration.test.ParserNeverWrapsTest ‑ [3] com.hedera.pbj.integration.test.ParserNeverWrapsTest$$Lambda/0x0000799d27c43650@4465b66d
pbj.integration.tests.pbj.integration.tests.tests.TestBlockItemTest ‑ [10] NoToStringWrapper{pbj.integration.tests.pbj.integration.tests.TestBlockItem}
pbj.integration.tests.pbj.integration.tests.tests.TestBlockItemTest ‑ [11] NoToStringWrapper{pbj.integration.tests.pbj.integration.tests.TestBlockItem}
pbj.integration.tests.pbj.integration.tests.tests.TestBlockItemTest ‑ [12] NoToStringWrapper{pbj.integration.tests.pbj.integration.tests.TestBlockItem}
pbj.integration.tests.pbj.integration.tests.tests.TestBlockItemTest ‑ [13] NoToStringWrapper{pbj.integration.tests.pbj.integration.tests.TestBlockItem}
pbj.integration.tests.pbj.integration.tests.tests.TestBlockItemTest ‑ [14] NoToStringWrapper{pbj.integration.tests.pbj.integration.tests.TestBlockItem}
pbj.integration.tests.pbj.integration.tests.tests.TestBlockItemTest ‑ [15] NoToStringWrapper{pbj.integration.tests.pbj.integration.tests.TestBlockItem}
pbj.integration.tests.pbj.integration.tests.tests.TestBlockItemTest ‑ [16] NoToStringWrapper{pbj.integration.tests.pbj.integration.tests.TestBlockItem}
…

♻️ This comment has been updated with latest results.

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
@jasperpotts
Copy link
Member

Concerns / Issues

  1. @Setup(Level.Invocation) in TestBlockGrpcBench.BenchState — Starting and stopping the server + client on every JMH invocation is very expensive and will dominate the measurement for smaller payloads. This is a significant benchmark methodology issue. Level.Trial or Level.Iteration would be more appropriate. The existing PbjGrpcBench uses Level.Trial for the server and Level.Iteration for the client — TestBlockGrpcBench should follow the same pattern.

  2. Network latency simulator is a global static side effectNetworkLatencySimulator.simulate() installs a Thread.sleep()-based inspector via a global static field on PbjGrpcCall. Both PbjGrpcBench and TestBlockGrpcBench call this in their static {} blocks. If both benchmarks run in the same JVM fork, the second one overwrites the first's inspector (one has printSizes=false, the other printSizes=true). Since JMH uses separate forks by default (@Fork(1)), this is probably fine in practice, but it's fragile.

  3. Thread.sleep for network simulation is coarseThread.sleep(millis, nanos) typically has millisecond granularity on most JVMs/OSes. For a 1Gbps network and a 100-byte payload, the calculated sleep is ~800ns, which will round to 0ms. The simulator effectively does nothing for small payloads and only kicks in for large ones (>~125KB). This means the "1Gbps simulation" is mostly a no-op for the 102K block size. The PR description acknowledges this implicitly ("fast network negates benefits"), but it's worth noting the simulation is imprecise.

  4. GrpcCompression maps changed from immutable to mutable HashMap — The COMPRESSOR_MAP and DECOMPRESSOR_MAP are now plain HashMap with no synchronization. This is fine for benchmarks (registered once at startup), but since this is in pbj-runtime (production code), concurrent reads during registration could cause issues. A ConcurrentHashMap would be safer.

  5. Error swallowing in benchmarks — Both benchUnary and benchBidiStreaming catch Exception, print the stack trace, and continue. This means if compression is broken or the server errors out, the benchmark silently produces incorrect results with fewer actual operations than INVOCATIONS. The @OperationsPerInvocation(INVOCATIONS) will then report inflated throughput.

  6. Socket closed handling in PbjGrpcCall — The new UncheckedIOException / SocketException catch block with string matching (se.getMessage().contains("Socket closed")) is fragile. This is production code being changed to support a benchmark edge case. String-matching on exception messages is locale/JVM-dependent.

For the alternative to Thread.sleep()

For nanosecond-precision delays, a busy-wait spin loop is the standard approach in benchmarking:

private void sleep(long bytes) {
    final long nanos = nanosPerByte * bytes;
    final long deadline = System.nanoTime() + nanos;
    while (System.nanoTime() < deadline) {
        Thread.onSpinWait(); // hint to the CPU (JDK 9+)
    }
}

Thread.onSpinWait() emits a PAUSE instruction on x86 (or equivalent on ARM), which reduces power consumption and avoids starving sibling hyperthreads while spinning.

Why this works for benchmarks:

  • System.nanoTime() has sub-microsecond resolution on modern OSes
  • Thread.sleep(0, 800) typically sleeps for ~1ms due to OS scheduling granularity, which is 1000x too long for an 800ns target
  • In a JMH benchmark, burning CPU on a spin-wait is acceptable — you're already dedicating cores to the benchmark

Why Thread.sleep is wrong here:

  • At 1Gbps, 100KB = ~800µs. Thread.sleep can handle that, but barely.
  • At 1Gbps, 1KB = ~8µs. Thread.sleep will overshoot by 100x+.
  • At 1Gbps, 100B = ~800ns. Thread.sleep rounds to 0 or ~1ms — either a no-op or 1000x too much.

The tradeoff is that busy-wait consumes a full CPU core, but that's expected and acceptable in a JMH context.

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
@anthony-swirldslabs
Copy link
Contributor Author

anthony-swirldslabs commented Mar 17, 2026

@jasperpotts :

  1. @setup(Level.Invocation): this does not "dominate the measurement" for any payload size because the benchmark state is being initialized outside of the measurement method. So there's no any "benchmark methodology issues". The benchmark uses an idiomatic approach. That statement above is completely false. I agree that using Level.Trial may be a bit more efficient from the perspective of the overall time it takes to run the benchmark. However, there's an issue related to flow-control as described at Benchmark zstd-jni #746 (comment) / GRPC streaming server may die #758, and therefore the benchmark has to restart the server for every invocation for the bidi streaming case specifically (it may or may not affect the unary case as well, but I haven't observed it.) That issue is unclear and is outside of scope of this benchmark. For this reason, I keep this part as is.
  2. The NetworkLatencySimulator is a utility class exposing a static method to activate it. This is because the network inspector in the PbjGrpcCall is static, which is by design because applications never have direct access to the Call object. I suppose we could move it into the PbjGrpcClient as a mutable instance member, however, that would have slight negative performance hit because the PbjGrpcCall would have to retrieve that member from another object after receiving/sending every datagram. Also, this class is not designed to be used in several benchmarks running in parallel. Our current JMH setup does not run our benchmark in parallel in the same JVM, and we don't have any plans changing that because this could spoil the measurement results. So this isn't an issue. However, I'll add a note to its javadoc to mention this.
  3. Thread.sleep(): Sounds good. Busy wait using the onSpinWait() works for me. Updated. However, The PR description acknowledges this implicitly ("fast network negates benefits"), but it's worth noting the simulation is imprecise. doesn't make sense. The statement in the PR description is still true, and doesn't depend on the precision of the sleep implementation.
  4. Compressor maps: Good point. However, concurrent reads shouldn't cause any failures, and we don't want to synchronize on reads for performance reasons. But writing to the maps is something that needs to be synchronized, indeed. Adding synchronization.
  5. Errors: The errors are swallowed by design because these benchmarks rely on a real networking stack (albeit using the loopback network interface only.) Errors may occur sporadically, and we don't want to fail a long benchmark run because of a single broken connection. We have numerous integration tests that verify that the GRPC client and server work and don't throw exceptions randomly. So when running the benchmark, we can be assured that nothing is broken. And we don't want to fail a run or omit an iteration because of a random failure in the OS networking stack. So this part isn't changing.
  6. Socket closed: This is a weird one. It's not really a "benchmark edge case", it's a real problem. However, I'm unsure how much this problem hurts real applications, and I agree that the current solution isn't very elegant (although I hardly see any other alternative.) To address the comment, I moved this logic over to the benchmark itself for now.

@anthony-swirldslabs
Copy link
Contributor Author

@jasperpotts : an update regarding the Level.Invocation/Level.Trial - I think I found a cause of that at #758 , but it will be a separate fix. Once merged, I'll update the benchmarks to use the Level.Trial, again, in a separate future PR. For this PR, we go with the same approach that is currently used in the existing PbjGrpcBench and use the Level.Invocation. As mentioned above, the setup of the state happens outside of the measurement, and therefore, it doesn't affect the measurement itself. So it's not a critical issue by any means as it doesn't affect the benchmark, other than by making it run just a tiny bit longer.

@anthony-swirldslabs anthony-swirldslabs merged commit 8f8b634 into main Mar 24, 2026
15 checks passed
@anthony-swirldslabs anthony-swirldslabs deleted the 746-zstdBench branch March 24, 2026 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark zstd-jni

3 participants