service/builder: use grpc.NumStreamWorkers for fixed goroutine pool#725
Open
Jura-Z wants to merge 2 commits into
Open
service/builder: use grpc.NumStreamWorkers for fixed goroutine pool#725Jura-Z wants to merge 2 commits into
Jura-Z wants to merge 2 commits into
Conversation
Each Is() call rebuilds a rego.PreparedEvalQuery from the runtime's
compiler and store, including ast.ParseBody / ast.ParseRef for the
query body. Under concurrent load this fights the OPA compiler's
internal locks and burns CPU on duplicate parsing.
Memoize the prepared query keyed on (policy_context.path,
policy_context.decisions). The compiler/store/policy bundle are stable
between bundle reloads, so the prepared query is reusable. Invalidation
is wired up via plugins.Manager.RegisterCompilerTrigger, which fires
on bundle activation / discovery update / any compiler rotation.
Concurrency-safe: sync.Map for read-mostly access, golang.org/x/sync
singleflight to collapse concurrent misses on the same key into one
PrepareForEval call.
Measured against a stub iap.egress.http policy, native Go gRPC client
(port 8282), Apple M-series Mac. Median of 3 back-to-back 3-second runs
after a 100-call warmup:
conc | before rps | after rps | gain
-----+------------+-----------+------
1 | 7,803 | 11,517 | +48%
4 | 18,474 | 28,831 | +56%
8 | 23,880 | 37,572 | +57%
16 | 29,929 | 46,980 | +57%
p50 latency drops correspondingly (e.g. conc=1: 113 µs -> 77 µs;
conc=16: 522 µs -> 301 µs). Decisions are byte-identical to the
unmodified path.
Tests:
- TestCacheKey: key derivation is order-sensitive on the decisions
list and stable across runs.
- TestGetOrPrepare_CachesAndDedupes: factory runs at most once across
200 concurrent goroutines for the same key; subsequent calls hit
the cache without invoking the factory.
- TestGetOrPrepare_FactoryError: factory errors are propagated and
not cached, so transient failures don't poison the cache.
The default grpc-go server spawns a fresh goroutine for every incoming stream. Under sustained ~50k rps load this is the dominant CPU cost on top of the actual policy work — measured via pprof (Go scheduler 34%, runtime.morestack 41%, GC mark 16%; OPA Eval itself only 9%). `grpc.NumStreamWorkers(N)` switches the server to a fixed pool of N worker goroutines that process incoming streams. Stacks are reused (no morestack), creation cost is amortized, and GC has less per-request garbage to chase. Setting N to GOMAXPROCS-equivalent (`runtime.NumCPU`) follows the recommendation in grpc-go's own benchmarks (server.go:609 comment). Measured stacking on top of aserto-dev#724 (the prepared-query cache PR), gRPC client, M-series Mac, stub iap.egress.http policy, median of 2 runs: conc | cache only | + StreamWorkers | gain -----+------------+-----------------+------ 1 | 11,517 | 12,900 | +12% 4 | 28,831 | 31,854 | +10% 8 | 37,572 | 40,985 | +9% 16 | 46,980 | 50,261 | +7% The win is modest because the cache PR already removed the largest lever; what remains is real Go-runtime overhead. NumStreamWorkers is marked Experimental in grpc-go but has been stable since 2022 and is widely used in production gRPC servers under high RPS.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The default grpc-go server spawns a fresh goroutine for every incoming
stream. Under sustained ~50k rps load this is the dominant CPU cost on
top of the actual policy work — measured via pprof on a local stub
iap.egress.httppolicy:grpc.NumStreamWorkers(N)switches the server to a fixed pool of Nworker goroutines that process incoming streams. Stacks are reused (no
morestack), creation cost is amortized, and GC has less per-request
garbage to chase.
Setting N to
runtime.NumCPU()follows the recommendation in grpc-go'sown benchmarks (see the comment in server.go:609).
Bench
Stacking on top of #724 (the prepared-query cache PR), gRPC client,
M-series Mac, stub iap.egress.http policy, median of 2 runs:
Why merge order matters
This stacks on #724. The branch
izakipnyi/grpc-stream-workersisbased on
izakipnyi/cache-prepared-eval-queryso the diff againstmainincludes both commits. If #724 lands first, this becomes atrivial 14-line diff.
Compatibility
grpc.NumStreamWorkersis part of the existinggoogle.golang.org/grpc dep, just a previously-unused option).
Experimentalin grpc-go but stable since 2022 and used byproduction gRPC servers under high RPS.
Why this is the architectural ceiling
Even with this fix, single-process Topaz still scales sub-linearly
(4.4× from 1→16 way concurrent vs ideal 12×). The remaining cost is
fundamentally Go-runtime overhead: even with stack reuse, each request
still pays for protobuf alloc + GC + middleware-chain dispatch. To go
further you'd need to pool the
IsRequest/IsResponseproto objectswith
sync.Pool, or scale topazd horizontally (multiple processesbehind a load balancer — verified empirically as 2 processes ≈ 1.5×
total throughput of 1).