perf: comprehensive Scala Native render pipeline optimization by He-Pin · Pull Request #776 · databricks/sjsonnet

He-Pin · 2026-04-13T07:16:41Z

Status: split in progress; do not merge this large PR as-is.

This PR is now an experiment/archive branch for the broader Scala Native render-pipeline work. Current master has moved to the JDK17 compilation level, so each optimization split should be independently justified as JDK17/JIT/GC-friendly and verified with focused JMH, GC, and Native CLI hyperfine data.

First Focused Split:

perf: chunk long string byte escaping #809: perf: chunk long string byte escaping
Scope: only the JMH-positive byte-renderer long-string escape chunking piece.
Intentionally excluded from perf: chunk long string byte escaping #809: compareStrings, char materializer, stdlib asciiSafe/substr/join, and format char-array assembly changes.

#809 Benchmark Summary:

#809 now contains the full 36-case JMH+GC sweep, focused rechecks for suspicious rows, Native hyperfine data, and a repeated correctness review. Keep the detailed numbers in #809 as the source of truth; the short summary here is only the target/no-regression view.

Focused target JMH + GC, lower is better:

Benchmark	master ms/op	#809 ms/op	Delta	master alloc B/op	#809 alloc B/op	GC note
`large_string_template`	1.686 ± 0.027	1.398 ± 0.464	-17.1%	7,775,106	7,774,803	allocation neutral/slightly lower
`large_string_join`	0.637 ± 0.075	0.646 ± 0.025	neutral	1,530,343	1,530,269	clean path neutral

Scala Native hyperfine, each run loops 20 CLI invocations and divides back to per-invocation milliseconds:

Benchmark	master mean ms	#809 mean ms	Delta	master median ms	#809 median ms	Median delta
`large_string_template`	11.60 ± 0.98	10.30 ± 0.82	-11.3%	11.32	9.95	-12.1%
`large_string_join`	6.01 ± 0.12	6.02 ± 0.16	neutral	5.98	5.98	neutral

Rejected From the First Split Batch:

Format.scala char-array assembly: not JMH-positive on current master.
length/substr/asciiSafe/join group: substr regressed, so it should not be split out as-is.
std.join exact-capacity builder: allocation improved in one run, but no-prof JMH regressed.
compareStrings/SWAR group: too broad and not GC-proven for a focused first split.
JVM String.indexOf escape scan: tiny signal only, not enough for a separate PR.

Next Split Bar:

Keep each PR small enough to explain its JIT shape.
Prefer straight loops, System.arraycopy, public JDK17 APIs, and existing buffers.
Avoid new per-call temporary arrays, ThreadLocal caches, or extra object graphs unless GC-profiler data pays for them.
Include both positive target data and at least one nearby no-regression guard benchmark.

Original Context:

Original head: ef97244e7536e31089be8410a80b19b3ae144880
Current first split base: 3a9a492899420456070fb84eaa5b89f8b7dfe1bf
First split head: ed9af56139ad6a066910483104bae165cef53d16

He-Pin · 2026-04-14T15:09:45Z

I think the string join can be improved with ast rewritten,but I want to do that after this got merged.

He-Pin · 2026-04-14T15:28:35Z

● Benchmark 结果汇总
  环境: Apple Silicon, macOS | 工具: hyperfine --warmup 5 --min-runs 20 -N sjsonnet: Scala Native (当前分支, 含 PR #776 优化) | jrsonnet: 0.5.0-pre98 (从源码编译)

  可靠基准 (>20ms 运行时间，启动开销不主导)
   Benchmark                                     │                sjsonnet (ms)                 │                jrsonnet (ms)                 │                     比值                     │ 胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   comparsion_for_primitives                     │                     37.6                     │                    214.5                     │             sjsonnet 5.71x 更快              │ sjsonnet
   inheritance_recursion                         │                     60.7                     │                    120.2                     │             sjsonnet 1.98x 更快              │ sjsonnet
   simple_recursive_call                         │                     28.8                     │                     52.6                     │             sjsonnet 1.83x 更快              │ sjsonnet
   realistic_2                                   │                     89.4                     │                    101.7                     │             sjsonnet 1.14x 更快              │ sjsonnet
   std_reverse                                   │                     21.6                     │                     23.5                     │                 持平 (1.09x)                 │ 持平

  中等规模 (10-20ms)
   Benchmark                                     │                sjsonnet (ms)                 │                jrsonnet (ms)                 │                     比值                     │ 胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   std_base64_byte_array                         │                     9.8                      │                     18.2                     │             sjsonnet 1.86x 更快              │ sjsonnet
   std_base64decodebytes                         │                     14.1                     │                     20.5                     │             sjsonnet 1.45x 更快              │ sjsonnet
   big_object                                    │                     10.5                     │                     11.6                     │             sjsonnet 1.10x 更快              │ sjsonnet
   realistic_1                                   │                     9.3                      │                     11.9                     │             sjsonnet 1.27x 更快              │ sjsonnet

  小规模 (<10ms，启动开销主导)
   Benchmark                                     │                sjsonnet (ms)                 │                jrsonnet (ms)                 │                     比值                     │ 胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   comparsion_for_array                          │                     6.3                      │                     12.8                     │             sjsonnet 2.02x 更快              │ sjsonnet
   foldl_string_concat                           │                     5.4                      │                     8.6                      │             sjsonnet 1.59x 更快              │ sjsonnet
   std_foldl                                     │                     6.2                      │                     7.4                      │             sjsonnet 1.19x 更快              │ sjsonnet
   large_string_join                             │                     6.8                      │                     5.4                      │             jrsonnet 1.26x 更快              │ jrsonnet
   array_sorts                                   │                     8.2                      │                     5.5                      │             jrsonnet 1.49x 更快              │ jrsonnet
   std_base64                                    │                     7.8                      │                     4.2                      │             jrsonnet 1.86x 更快              │ jrsonnet
   std_base64decode                              │                     7.3                      │                     5.3                      │             jrsonnet 1.36x 更快              │ jrsonnet
   std_manifestjsonex                            │                     6.4                      │                     4.1                      │             jrsonnet 1.54x 更快              │ jrsonnet
   std_manifesttomlex                            │                     6.5                      │                     3.6                      │             jrsonnet 1.82x 更快              │ jrsonnet
   std_parseint                                  │                     6.1                      │                     3.6                      │             jrsonnet 1.70x 更快              │ jrsonnet
   std_substr                                    │                     6.2                      │                     4.2                      │             jrsonnet 1.45x 更快              │ jrsonnet
   string_strips                                 │                     5.7                      │                     3.9                      │             jrsonnet 1.48x 更快              │ jrsonnet
   tail_call                                     │                     5.9                      │                     3.7                      │             jrsonnet 1.57x 更快              │ jrsonnet
   inheritance_function_recursion                │                     5.0                      │                     2.9                      │             jrsonnet 1.74x 更快              │ jrsonnet

Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.

He-Pin · 2026-04-26T11:36:17Z

Reviewed and keeping this as a follow-up, not part of this PR. #776 is scoped to the render/materialization pipeline and Scala Native-friendly SWAR/direct rendering paths. An AST rewrite for string join would be a separate optimization because it changes the optimization boundary earlier in the pipeline and should get its own focused benchmark/compatibility review.

Motivation: PR databricks#776 already propagates _asciiSafe through parser literals, base64, joins, and substrings, but MaterializeJsonRenderer still sent those known-safe strings through the chunked char renderer, allocating a temporary char array and scanning for escapes. The hand-written parseInt path also rejected Long.MinValue, which the previous Long.parseLong-based implementation accepted. Modification: Add a char-renderer fast path for known ASCII-safe strings and use it in fused MaterializeJsonRenderer. Let std.length trust _asciiSafe before scanning, and switch parseDigits to negative accumulation so Long.MinValue is accepted while positive overflow remains rejected. Result: Known ASCII-safe strings skip allocation and escape scanning in char materialization and std.length. parseInt keeps the overflow guard without regressing the Long.MinValue boundary.

He-Pin · 2026-04-26T20:24:34Z

Small follow-up pushed in 76d7bc4c:

Reuse _asciiSafe in MaterializeJsonRenderer, so known-safe strings skip the temporary char[] allocation and escape scan in the fused char rendering path.
Let std.length trust _asciiSafe before running the ASCII scan.
Fix the hand-written parseInt overflow path to preserve the previous Long.MinValue boundary while still rejecting positive overflow.

Validation run locally:

./mill --no-server 'sjsonnet.jvm[3.3.7].reformat'
./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'
./mill --no-server 'sjsonnet.jvm[3.3.7].test.testOnly' sjsonnet.Std0150FunctionsTests
./mill --no-server 'sjsonnet.jvm[3.3.7].test.testOnly' sjsonnet.RendererTests
./mill --no-server 'sjsonnet.jvm[3.3.7].test'
./mill --no-server 'sjsonnet.js[3.3.7].test'
./mill --no-server 'sjsonnet.native[3.3.7].compile'
git diff --check

Motivation: String comparison (compareStringsByCodepoint) and long string rendering are hot paths in sort-heavy and render-heavy Jsonnet workloads. The comparison used per-char charAt() virtual dispatch preventing JIT vectorization. Long string rendering used a binary scan (clean→bulk copy, dirty→full reprocess from position 0). Modification: 1. compareStrings: bulk getChars() + tight array loop enabling JIT auto-vectorization (AVX2/SSE). Surrogate check deferred to mismatch point only (O(1) vs O(n)). ThreadLocal buffers on JVM, local alloc on Native, scalar fallback on JS. 2. findFirstEscapeChar: SWAR scan returning position (not boolean). 3. visitLongString: chunked rendering — find escape position, arraycopy clean prefix, escape inline, repeat. Avoids re-processing entire string when only a few chars need escaping. Result: All tests pass across JVM (Scala 3.3.7, 2.13.18) and JS. All benchmark regressions pass. Endian-safe (SWAR operates on independent byte lanes).

Replace per-call `new Array[Char](n)` allocation with module-level pre-allocated buffers in Scala Native's compareStrings. Safe because Scala Native is single-threaded (mirrors the JVM ThreadLocal approach).

Motivation: manifestJsonEx/manifestTomlEx used the generic Visitor interface for char-based rendering, missing the fused direct-walk optimization that ByteRenderer already had. Additionally, char-based string rendering (BaseCharRenderer, MaterializeJsonRenderer) did binary hasEscapeChar check → char-by-char RenderUtils.escapeChar fallback, while ByteRenderer had proper chunked SWAR scanning → bulk arraycopy → inline escape. Modification: - Add materializeDirect(Val) to MaterializeJsonRenderer, mirroring ByteRenderer's fused materializer with valTag-based switch dispatch - Replace visitNonNullString in BaseCharRenderer with chunked rendering: findFirstEscapeCharChar → bulk arraycopy clean segments → escapeCharInline - Add renderQuotedString to MaterializeJsonRenderer with same chunked pattern - Add findFirstEscapeCharChar(char[]) to all 3 CharSWAR platform impls - Wire ManifestModule to use renderer.materializeDirect instead of Materializer.apply0 + Visitor interface Result: manifestJsonEx gap reduced from 2.15x to ~1.4x slower vs jrsonnet. realistic_2 flipped from 1.62x slower to 1.12x faster.

…afe propagation Motivation: String-heavy stdlib operations (substr, length, join, parseInt) had unnecessary overhead on Scala Native: codePointCount/offsetByCodePoints O(n) scans for ASCII strings, StringBuilder resize churn for join, exception-based parseInt via Long.parseLong. Modification: - Add ASCII fast path to Length and Substr using CharSWAR.isAllAscii: skip codePointCount/offsetByCodePoints for ASCII-only strings (99% case) - Pre-sized char[] assembly for std.join: two-pass approach calculates exact output length, then copies with getChars — zero resize overhead - Hand-written parseDigits loop for parseInt/parseOctal/parseHex: no exception setup, no intermediate allocation, single pass - Propagate _asciiSafe flag: parser sets it on ASCII string literals, Val.Str.concat preserves it when both children are ASCII-safe, join propagates it through all elements Result: substr gap reduced from 2.03x to ~1.07x. parseint from 1.80x to ~1.0x. large_string_join from 1.81x to ~1.27x. realistic_2 benefits from combined improvements.

Motivation: Format.format() used StringBuilder which starts small and resizes multiple times for large output. The large_string_template benchmark (591KB template, 256 interpolations) showed 2.78x gap vs jrsonnet. Modification: - Three-pass approach: compute formatted values into String array, calculate exact total output length, allocate char[] and copy with getChars — eliminates StringBuilder resize/copy overhead - Add direct Val dispatch in format loop: skip Materializer for common types (Str, Num, Bool, Null) to avoid ujson.Value roundtrip Result: large_string_template gap reduced from 2.78x to ~1.88x. Remaining gap is dominated by Scala Native startup overhead (~7ms vs Rust ~1ms); pure computation time is within ~1ms of jrsonnet.

Motivation: CI fails on two issues: (1) unused `alwaysinline` import in Native CharSWAR.scala, (2) `\uXXXX` sequences in comments are parsed as unicode escapes in Scala 2.12, causing compilation errors. Modification: - Remove unused `scala.scalanative.annotation.alwaysinline` import - Escape backslash-u sequences in comments across BaseByteRenderer and Renderer Result: Full test suite passes across all platforms and Scala versions

Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.

Motivation: PR databricks#776 already propagates _asciiSafe through parser literals, base64, joins, and substrings, but MaterializeJsonRenderer still sent those known-safe strings through the chunked char renderer, allocating a temporary char array and scanning for escapes. The hand-written parseInt path also rejected Long.MinValue, which the previous Long.parseLong-based implementation accepted. Modification: Add a char-renderer fast path for known ASCII-safe strings and use it in fused MaterializeJsonRenderer. Let std.length trust _asciiSafe before scanning, and switch parseDigits to negative accumulation so Long.MinValue is accepted while positive overflow remains rejected. Result: Known ASCII-safe strings skip allocation and escape scanning in char materialization and std.length. parseInt keeps the overflow guard without regressing the Long.MinValue boundary.

Motivation: Split the JMH-positive, JDK17/JIT/GC-friendly long-string rendering piece out of #776. Keep this PR focused on byte rendering for long strings that contain JSON escapes; this does not include the broader format, stdlib, compareStrings, or Scala Native experiments from #776. Modification: - Add `CharSWAR.findFirstEscapeChar(byte[], from, to)` on JVM, Scala.js, and Scala Native. - In `BaseByteRenderer`, keep the existing UTF-8 byte array for long strings, locate escape bytes, bulk-copy clean chunks with `System.arraycopy`, and escape only matching bytes inline. - Precompute the exact escaped output length, reserve `ByteBuilder` once, then write directly to the backing byte array. This removes repeated `ensureLength`/`appendUnsafeC` calls from the dirty long-string loop. - Use a static byte hex table for `\u00XX` control escapes. JIT / GC shape: - Hot code stays in simple `while` loops, `System.arraycopy`, and small private helpers. - No reflection, no internal JDK APIs, no closures/iterators in the rendering loop. - No per-chunk or per-escape objects are allocated by this follow-up; the existing per-long-string UTF-8 byte array remains the only temporary for this path. - I tested a no-allocation ASCII scalar path, but rejected it because it regressed `large_string_template` and `large_string_join` JMH. Notable results only: JMH target run, same machine, same command shape on `upstream/master` and this branch: `./mill -i bench.runRegressions bench/resources/cpp_suite/large_string_template.jsonnet bench/resources/cpp_suite/large_string_join.jsonnet` | Benchmark | upstream/master | PR | Delta | | --- | ---: | ---: | ---: | | `large_string_template` | 1.552 ms/op | 1.154 ms/op | -25.6% / 1.34x faster | Scala Native hyperfine, release-full native binary, 20 runs: | Benchmark | upstream/master | PR | Delta | | --- | ---: | ---: | ---: | | `large_string_template` | 10.5 +/- 0.2 ms | 9.6 +/- 0.3 ms | -8.6% / 1.09x faster | `large_string_join` was rechecked as a guardrail and stayed neutral, so it is intentionally omitted from the result tables. Verification: - `./mill -i 'sjsonnet.jvm[3.3.7].compile'` - `./mill -i 'sjsonnet.jvm[3.3.7].test'` - `./mill -i 'sjsonnet.js[3.3.7].compile' 'sjsonnet.native[3.3.7].compile'` - `./mill -i 'sjsonnet.native[3.3.7].nativeLink'` - `./mill -i __.checkFormat` - `git diff --check` - Focused JMH and Native hyperfine commands above References: - Split from #776 - Base: `b4c667d55d82d7c50c2103db967c33bebb0c2c98` - Head: `ff70b63e`

He-Pin · 2026-05-08T05:17:43Z

Closing obsolete broad draft. The useful render work has been or should be split into smaller focused PRs with current docs-aligned data; this branch is now conflicting and too broad to carry forward as-is.

He-Pin · 2026-05-08T05:21:39Z

Reopened. This broad branch still conflicts heavily with current renderer/SWAR code and overlaps later split PRs. Keep as draft/source material for extracting smaller PRs rather than closing it as negative.

He-Pin · 2026-05-08T05:35:53Z

Rebase retry against current upstream/master still conflicts at the first renderer/SWAR commit: sjsonnet/src-js/sjsonnet/CharSWAR.scala, sjsonnet/src-native/sjsonnet/CharSWAR.scala, and sjsonnet/src/sjsonnet/BaseByteRenderer.scala. Keeping this as draft/source material; not closing because this is not a negative benchmark result.

He-Pin force-pushed the renderOpt-clean branch from 5512f52 to 3042124 Compare April 13, 2026 07:28

He-Pin marked this pull request as draft April 13, 2026 07:50

He-Pin mentioned this pull request Apr 13, 2026

perf: SIMD-accelerated FastBase64 for Scala Native via C FFI #749

Merged

He-Pin force-pushed the renderOpt-clean branch from 3042124 to 3ac67a1 Compare April 14, 2026 14:16

He-Pin changed the title ~~perf: SWAR string comparison and chunked escape rendering~~ perf: comprehensive Scala Native render pipeline optimization Apr 14, 2026

He-Pin marked this pull request as ready for review April 14, 2026 14:19

He-Pin commented Apr 14, 2026

View reviewed changes

Comment thread sjsonnet/src-js/sjsonnet/CharSWAR.scala

He-Pin marked this pull request as draft April 14, 2026 17:32

He-Pin force-pushed the renderOpt-clean branch from a4dde27 to e38e8c4 Compare April 18, 2026 09:59

He-Pin force-pushed the renderOpt-clean branch from bf0e393 to 58759aa Compare April 25, 2026 08:41

He-Pin force-pushed the renderOpt-clean branch from 58759aa to 2e42f76 Compare April 26, 2026 10:47

He-Pin closed this Apr 26, 2026

He-Pin reopened this Apr 26, 2026

He-Pin marked this pull request as ready for review April 26, 2026 11:05

He-Pin marked this pull request as draft April 26, 2026 11:10

He-Pin marked this pull request as ready for review April 26, 2026 11:19

He-Pin marked this pull request as draft April 26, 2026 20:22

He-Pin and others added 5 commits April 29, 2026 04:58

perf: use pre-allocated char buffers for Native compareStrings

2b10a20

Replace per-call `new Array[Char](n)` allocation with module-level pre-allocated buffers in Scala Native's compareStrings. Safe because Scala Native is single-threaded (mirrors the JVM ThreadLocal approach).

style: apply scalafmt to CharSWAR Scala sources

84c596c

He-Pin added 6 commits April 29, 2026 04:59

test: drop stale parseInt overflow expectation

de4d61d

perf: avoid temp char arrays for clean strings

ef97244

He-Pin force-pushed the renderOpt-clean branch from 76d7bc4 to ef97244 Compare April 28, 2026 22:10

He-Pin mentioned this pull request Apr 30, 2026

perf: chunk long string byte escaping #809

Merged

He-Pin closed this May 8, 2026

He-Pin reopened this May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: comprehensive Scala Native render pipeline optimization#776

perf: comprehensive Scala Native render pipeline optimization#776
He-Pin wants to merge 11 commits intodatabricks:masterfrom
He-Pin:renderOpt-clean

He-Pin commented Apr 13, 2026 •

edited

Loading

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

He-Pin commented Apr 26, 2026

Uh oh!

He-Pin commented Apr 26, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

He-Pin commented Apr 26, 2026

Uh oh!

He-Pin commented Apr 26, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

He-Pin commented Apr 13, 2026 •

edited

Loading