Skip to content

Presize array copy consumers#823

Draft
He-Pin wants to merge 3 commits intodatabricks:masterfrom
He-Pin:perf/presized-array-copy-consumers
Draft

Presize array copy consumers#823
He-Pin wants to merge 3 commits intodatabricks:masterfrom
He-Pin:perf/presized-array-copy-consumers

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented May 5, 2026

Motivation

#822 gives consumers a cheap Eval copy API, but std.flattenArrays and array-separator std.join can still pay ArrayBuilder growth/copy costs when the outer array has a modest number of large child arrays.

This PR adds a guarded two-pass pre-size path for those consumers. The goal is to remove avoidable intermediate allocation in few-large-array workloads without regressing many-small-array workloads.

Constraints:

  • do not force element values
  • avoid many-small-array regressions
  • guard total length before allocation
  • keep hot paths as straight indexed loops
  • keep this PR narrowly stacked on Add array eval copy API #822

Modification

Stacked on #822.

Use Arr.copyEvalTo to presize high-volume array-copy consumers:

  • std.flattenArrays
  • array-separator std.join

The pre-sized path uses two linear scans only when the outer part count is modest (<= 1024). Large outer arrays fall back to the one-pass ArrayBuilder + copyEvalTo path from #822.

Result

Verification passed:

  • ./mill --no-server 'sjsonnet.jvm[3.3.7].compile'
  • ./mill --no-server 'sjsonnet.jvm[2.13.18].compile'
  • ./mill --no-server 'sjsonnet.jvm[2.12.21].compile'
  • ./mill --no-server 'sjsonnet.jvm[3.3.7].test.testOnly' sjsonnet.Std0150FunctionsTests sjsonnet.ValArrayViewTests
  • ./mill --no-server 'sjsonnet.jvm[3.3.7].test'
  • ./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'
  • ./mill --no-server 'sjsonnet.native[3.3.7].nativeLink'
  • git diff --check

JMH, JVM harness, compared with #822 copy-api baseline:

Benchmark Before After
array_copy_views 13.002 ms/op 8.454 ms/op
realistic2 see Native data see Native data

Scala Native hyperfine, compared with #822 copy-api baseline, using Scala Native binaries, not JVM jars:

Benchmark Before After
array_copy_views 11.9 ms +/- 1.2 ms 10.5 ms +/- 1.0 ms
many-small fallback 7.0 ms +/- 0.7 ms 6.6 ms +/- 0.5 ms
realistic2 82.6 ms +/- 0.8 ms 82.5 ms +/- 0.7 ms

External performance diff, against jrsonnet built from source at 80cd36a with cargo build --release -p jrsonnet (jrsonnet 0.5.0-pre98):

Benchmark sjsonnet Scala Native (#823) source-built jrsonnet Result
array_copy_views 9.3 ms +/- 0.2 ms 14.3 ms +/- 0.4 ms sjsonnet 1.53 +/- 0.06x faster
realistic2 79.9 ms +/- 2.2 ms 92.9 ms +/- 1.9 ms sjsonnet 1.16 +/- 0.04x faster

JIT / GC review:

  • The second pass copies Eval references into one preallocated Array[Eval]; it does not force element values.
  • totalLen is accumulated as Long and checked before allocating the final Array[Eval].
  • PresizedCopyMaxParts = 1024 avoids turning many-small arrays into an always-two-pass workload.
  • The fallback path preserves Add array eval copy API #822 behavior for large outer arrays.
  • The hot path is simple counted while-loops plus copyEvalTo, so it stays friendly to JIT inlining and Scala Native codegen.

Rollback boundary:

  • This PR only changes fully-consumed array-copy consumers.
  • It does not change string join, renderer, sort, callback invocation, or global array view semantics.
  • If a workload shows a many-small regression, the threshold can be lowered or the affected consumer can use the Add array eval copy API #822 one-pass path.

References

He-Pin added 3 commits May 5, 2026 16:16
Motivation:

Avoid copying large array slices and remove/removeAt intermediates after the lazy-array work. This follows jrsonnet's indexed slice-view idea while keeping JVM retention under control for small sub-slices.

Modifications:

- add Val.Arr.sliced and SliceArr for large or compact-source slices
- route array slicing and std.remove/removeAt through slice/concat views
- let large concat decisions use total length, with overflow protection
- add correctness coverage and a slice/remove benchmark resource

Results:

- ./mill --no-server 'sjsonnet.jvm[3.3.7].compile'
- ./mill --no-server 'sjsonnet.jvm[2.13.18].compile'
- ./mill --no-server 'sjsonnet.jvm[2.12.21].compile'
- ./mill --no-server 'sjsonnet.jvm[3.3.7].test.testOnly' sjsonnet.ValArrayViewTests sjsonnet.Std0150FunctionsTests
- ./mill --no-server 'sjsonnet.jvm[3.3.7].test'
- ./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'
- ./mill --no-server bench.checkFormat
- JMH runRegressions: lazy_array_slice_remove 5.890 -> 1.089 ms/op
- hyperfine macro slice/remove: 498.6 ms -> 335.5 ms
Motivation:

Several stdlib consumers fully copy array elements after the lazy-array work. Centralizing that path avoids repeated directBackingArray/range/view branches and lets concat, repeat, slice, range, and byte arrays expose cheap bulk Eval copies without forcing Val values.

Modifications:

- add Arr.copyEvalTo overloads for ArrayBuilder and preallocated Array[Eval]
- teach concat materialization/eager concat to copy through the new API
- add specialized copy implementations for repeat, slice, reversed lazy views, range, and byte arrays
- route std.flattenArrays, array flatMap, and array-separator std.join through the API
- add correctness coverage and an array_copy_views regression benchmark

Results:

- ./mill --no-server 'sjsonnet.jvm[3.3.7].compile'
- ./mill --no-server 'sjsonnet.jvm[2.13.18].compile'
- ./mill --no-server 'sjsonnet.jvm[2.12.21].compile'
- ./mill --no-server 'sjsonnet.jvm[3.3.7].test.testOnly' sjsonnet.ValArrayViewTests sjsonnet.Std0150FunctionsTests
- ./mill --no-server 'sjsonnet.jvm[3.3.7].test'
- ./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'
- ./mill --no-server bench.checkFormat
- ./mill --no-server 'sjsonnet.native[3.3.7].nativeLink'
- JMH runRegressions vs slice baseline: array_copy_views 16.871 -> 13.937 ms/op
- Scala Native hyperfine vs slice baseline: array_copy_views 26.1 ms -> 10.9 ms, 2.39x faster
Motivation:

After adding Arr.copyEvalTo, high-volume consumers can avoid ArrayBuilder growth by counting output length first and copying into a single Array[Eval]. This targets small outer arrays that contain large view-backed subarrays, while preserving the one-pass builder path for many-small-array workloads.

Modifications:

- presize std.flattenArrays when the outer part count is modest
- presize array-separator std.join when the outer part count is modest
- keep the one-pass ArrayBuilder + copyEvalTo fallback for large part counts

Results:

- ./mill --no-server 'sjsonnet.jvm[3.3.7].compile'
- ./mill --no-server 'sjsonnet.jvm[2.13.18].compile'
- ./mill --no-server 'sjsonnet.jvm[2.12.21].compile'
- ./mill --no-server 'sjsonnet.jvm[3.3.7].test.testOnly' sjsonnet.Std0150FunctionsTests sjsonnet.ValArrayViewTests
- ./mill --no-server 'sjsonnet.jvm[3.3.7].test'
- ./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'
- ./mill --no-server 'sjsonnet.native[3.3.7].nativeLink'
- JMH runRegressions vs copy-api baseline: array_copy_views 13.002 -> 8.454 ms/op
- Scala Native hyperfine vs copy-api baseline: array_copy_views 11.9 ms -> 10.5 ms
- Scala Native hyperfine many-small fallback: 7.0 ms -> 6.6 ms
- Scala Native hyperfine realistic2: 82.6 ms -> 82.5 ms
@He-Pin He-Pin marked this pull request as draft May 5, 2026 09:15
This was referenced May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant