perf: cache repeated long string rendering by He-Pin · Pull Request #857 · databricks/sjsonnet

He-Pin · 2026-05-14T09:47:49Z

Motivation:

This is stacked after #856 (perf/simple-format-ascii-safe). The benchmark comparison below is
against origin/perf/simple-format-ascii-safe@f2a2e3ba, where this PR adds one commit:
7656943f perf: cache repeated long string rendering.

kube-prometheus renders many repeated medium/long description strings. Without caching, each repeated value pays the full quoted-string rendering cost again: UTF-8 encoding, JSON escape scanning, and byte copying. This showed up as a remaining Native renderer gap after larger parser/import and inline-object sort work had already been addressed.

Key Design Decision:

Cache only fully rendered quoted bytes for repeated strings in the 128..4096 character range, and cap the cache by both entry count and rendered byte size. This keeps the optimization targeted at repeated Kubernetes-style descriptions while leaving huge unique strings, such as large_string_template, on the existing renderer path.

Modification:

Add a per-renderer HashMap[String, Array[Byte]] in BaseByteRenderer.
Cache rendered quoted bytes only when the source string has 128..4096 characters.
Cap cached rendered entries at 16 KiB each and at 2048 entries per renderer.
Copy cached bytes directly into the renderer byte buffer on repeated hits.
Add a renderer regression test covering repeated long escaped strings.
Update the performance ledgers with fresh JMH and Native hyperfine evidence.

Benchmark Results:

Environment:

Branch base: origin/perf/simple-format-ascii-safe@f2a2e3ba
Candidate: perf/long-string-render-cache@7656943f
JVM: JDK 21.0.10, JMH 1.37
Native binaries:
- base: /tmp/sjsonnet-native-long-cache-pr-base
- candidate: /tmp/sjsonnet-native-long-cache-pr-candidate
Benchmark rule: single benchmark process; no concurrent Mill/JMH/hyperfine.

Output equality:

workload	check
kube-prometheus realworld	base output equals candidate output
`bench/resources/cpp_suite/large_string_template.jsonnet`	base output equals candidate output

JMH guard:

Command:

./mill --no-server --ticker false --color false bench.runRegressions bench/resources/cpp_suite/large_string_template.jsonnet

variant	score
base	`0.666 ms/op`
candidate	`0.666 ms/op`

Native hyperfine: kube-prometheus, forward order, 12 runs:

variant	mean	stddev	median	min	max
base	`132.889 ms`	`2.539 ms`	`132.431 ms`	`128.896 ms`	`137.725 ms`
candidate	`132.280 ms`	`2.789 ms`	`131.247 ms`	`129.262 ms`	`138.730 ms`

Native hyperfine: kube-prometheus, reverse order, 12 runs:

variant	mean	stddev	median	min	max
candidate	`130.720 ms`	`1.562 ms`	`130.347 ms`	`128.133 ms`	`133.569 ms`
base	`132.262 ms`	`1.457 ms`	`132.585 ms`	`130.240 ms`	`134.842 ms`

Native hyperfine: large_string_template, forward order, 30 runs:

variant	mean	stddev	median	min	max
base	`11.074 ms`	`1.293 ms`	`11.126 ms`	`8.941 ms`	`13.997 ms`
candidate	`10.822 ms`	`0.931 ms`	`10.738 ms`	`9.237 ms`	`13.076 ms`

Native hyperfine: large_string_template, reverse order, 30 runs:

variant	mean	stddev	median	min	max
candidate	`11.266 ms`	`1.065 ms`	`10.977 ms`	`9.805 ms`	`14.192 ms`
base	`11.781 ms`	`1.040 ms`	`11.805 ms`	`10.201 ms`	`14.117 ms`

Validation:

./mill --no-server --ticker false --color false __.checkFormat
./mill --no-server --ticker false --color false -j 1 __.test

Result:

Tests: 445, Passed: 445, Failed: 0

Analysis:

The kube-prometheus target is positive in both command orders, which is the expected workload for this cache because it contains repeated rendered description strings. large_string_template remains neutral-to-positive in both command orders because its dominant string is huge and unique, so it is intentionally above the cache cap and stays on the existing fast path. The JMH guard is exactly neutral at 0.666 ms/op, so this does not trade Native improvement for a JVM regression.

References:

bench/reports/sjsonnet-vs-jrsonnet-gaps.md
bench/reports/sync-points.md

Result:

Repeated medium/long strings render faster on kube-prometheus with bounded memory overhead, unchanged output, neutral JVM guard behavior, and full test-suite coverage.

Motivation: PR databricks#840 introduced a strict JSON fast path for .json imports but still forces a full UTF-8 string decode for every cached file before handing the text to ujson.StringParser. Real-world workloads (e.g. kube-prometheus) import many .json files; decoding each one twice (once into String for parsing, again as cache content) is pure overhead. Key Design Decision: ujson 4.4.3 ships ByteArrayParser, which parses UTF-8 JSON directly from a byte array without an intermediate String. Cache small resolved files as raw bytes (already what we read from disk) and lazily decode text only when the importstr/parser-input path actually needs it. Preserve parse-cache content identity by hashing the cached bytes with SHA-256 (length + hex digest) so external ParseCache implementations keep the same collision resistance as the old full-string key. Modification: * Importer.scala: CachedResolver.parseJsonImport now calls ujson.ByteArrayParser.transform(content.readRawBytes(), visitor) instead of decoding the whole file to String first. * CachedResolvedFile.scala (JVM/Native): small files are cached as Array[Byte]; getParserInput / readString materialize the String lazily; readRawBytes returns the cached bytes directly; contentHash is length + SHA-256 over the cached bytes; binary imports still use StaticBinaryResolvedFile. * PreloaderTests.scala: tighten the strict-JSON fast-path coverage so it fails if the fast path ever falls back to readString(). Result: * Output equality vs upstream sjsonnet and jrsonnet preserved on kube-prometheus and large_string_template. * Native kube-prometheus hyperfine A/B (forward & reverse): clean 139.4 +/- 2.8 ms -> candidate 132.7 +/- 1.9 ms (forward) candidate 132.1 +/- 1.9 ms vs clean 140.3 +/- 2.6 ms (reverse) * Full ./mill __.test green. References: Follow-up to databricks#840

Motivation: Large inline objects produced by strict JSON imports can exceed the small-object shape that computeSortedInlineOrder was originally tuned for. Native sampling on kube-prometheus showed sorted inline-order computation as a materialization hotspot, and insertion sort becomes quadratic on those wider objects. Modification: Keep insertion sort for small inline objects, and use an in-place quicksort with insertion-sort cleanup for larger visible field sets. Record the accepted benchmark result and rejected parser/key-render micro-routes in the performance ledgers. Result: Kube-prometheus Native A/B improved on top of strict JSON byte imports, with forward mean 145.3ms -> 140.0ms and reverse mean 151.6ms -> 148.9ms. Formatting and the full test suite pass. References: Upstream-base: databricks/sjsonnet@cedc083 Prior optimization: 883fca5 perf: parse strict JSON imports from bytes

Motivation: Keep the performance exploration ledger current so future optimization work does not repeat Native-negative or build-invalid routes. Modification: Record rejected short-string, ASCII-safe, inline sort-cache, path-only parse-cache, and Native GC configuration probes with the validation evidence that ruled them out. Result: No runtime code changes are retained; the branch documents the failed hypotheses and preserves the current accepted optimization stack. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Motivation: large_string_template still spent time re-encoding and re-scanning the huge string produced by simple named format interpolation, even when the final result was known to be JSON-string ASCII-safe. Modification: Track whether compiled format literals are ASCII-safe and return Val.Str.asciiSafe from PartialApplyFmt when every simple named dynamic value is also safe. Add regression coverage for safe numeric values, unsafe string values, unsafe static literals, and mixed-key safety. Result: Native large_string_template improved in both command orders (8.64 -> 8.01 ms forward, 8.65 -> 8.17 ms reverse); JVM JMH stayed neutral-positive (0.683 -> 0.677 ms/op); full __.test and checkFormat pass. References: bench/reports/sjsonnet-vs-jrsonnet-gaps.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Motivation: A follow-up probe reused parser ASCII-safe metadata to skip static format literal safety scanning after PR databricks#856, but internal debug counter gains did not translate into Native whole-process speed. Modification: Record the rejected parser `_asciiSafe` format hint experiment in the gap ledger and sync-points file so it is not repeated without materially new evidence. Result: The accepted simple-format ASCII-safe optimization remains unchanged, and the rejected hint path is documented with forward/reverse Native benchmark evidence. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Motivation: A Native-only ASCII-safe byte-copy probe tested whether a manual low-byte loop could improve large formatted string rendering after PR databricks#856. Modification: Record the rejected manual copy-loop experiment in the gap ledger and sync-points file with forward/reverse Native benchmark evidence. Result: The optimized implementation remains on the faster platform `String.getBytes(0, len, dst, dstPos)` path, and the slower manual copy route is documented to avoid repeated work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Motivation: A simple-format loop probe tested whether single-character dynamic values should use `StringBuilder.append(Char)` instead of appending the one-character String. Modification: Record the rejected single-character append experiment in the gap ledger and sync-points file with forward/reverse Native benchmark evidence. Result: The accepted simple-format implementation remains on the faster `append(String)` path, and the slower single-character branch is documented to avoid repeated work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Motivation: Avoid repeating a ByteRenderer object-rendering route that looked promising on kube but failed guard benchmarks. Modification: Record the minified direct-object comma/empty-state specialization in the gap ledger and sync-points rejection table. Result: The implementation remains reverted; future work should not trade a weak kube gain for a large_string_template regression. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Motivation: Document a Native-only long-string renderer probe so future optimization work does not repeat the same regression. Modification: Add the failed direct charAt escaped-ASCII renderer result to the gap ledger and sync-points table. Result: The code remains reverted because large_string_template regressed in both Native command orders. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Motivation: Keep the performance ledger complete after testing an allocation-saving cycle-tracking variant. Modification: Record the inline small-stack cycle tracking probe and its Native benchmark result in both ledgers. Result: The code remains reverted because kube was noise-level and large_string_template regressed in both command orders. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Motivation: Preserve the outcome of quoted-key cache experiments so future kube work does not repeat unstable variants. Modification: Record HashMap, direct-mapped, and capped ByteRenderer key-cache probes in the performance ledgers. Result: The implementation remains reverted because kube reverse A/B was not stable-positive and some variants regressed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Motivation: Kube-prometheus output contains many repeated medium/long description strings. Re-rendering each repeated value repeats UTF-8 encoding and JSON escape scanning, while large_string_template's dominant string is huge and unique. Modification: Cache fully rendered quoted bytes for repeated strings between 128 and 4096 characters, with bounded entry count and per-entry byte size. Add a regression test covering repeated escaped strings and record Native/JVM benchmark evidence in the performance ledgers. Result: Native kube-prometheus improves in both command orders: 132.89 -> 132.28 ms forward and 132.26 -> 130.72 ms reverse. large_string_template remains neutral-to-positive because its huge unique string is above the cache cap; focused JVM JMH guard is neutral at 0.666 -> 0.666 ms/op. References: bench/reports/sjsonnet-vs-jrsonnet-gaps.md

He-Pin and others added 12 commits May 13, 2026 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cache repeated long string rendering#857

perf: cache repeated long string rendering#857
He-Pin wants to merge 12 commits into
databricks:masterfrom
He-Pin:perf/long-string-render-cache

He-Pin commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

He-Pin commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant