perf: comprehensive Scala Native render pipeline optimization#776
perf: comprehensive Scala Native render pipeline optimization#776He-Pin wants to merge 11 commits intodatabricks:masterfrom
Conversation
|
I think the string join can be improved with ast rewritten,but I want to do that after this got merged. |
● Benchmark 结果汇总
环境: Apple Silicon, macOS | 工具: hyperfine --warmup 5 --min-runs 20 -N sjsonnet: Scala Native (当前分支, 含 PR #776 优化) | jrsonnet: 0.5.0-pre98 (从源码编译)
可靠基准 (>20ms 运行时间,启动开销不主导)
Benchmark │ sjsonnet (ms) │ jrsonnet (ms) │ 比值 │ 胜者
───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
comparsion_for_primitives │ 37.6 │ 214.5 │ sjsonnet 5.71x 更快 │ sjsonnet
inheritance_recursion │ 60.7 │ 120.2 │ sjsonnet 1.98x 更快 │ sjsonnet
simple_recursive_call │ 28.8 │ 52.6 │ sjsonnet 1.83x 更快 │ sjsonnet
realistic_2 │ 89.4 │ 101.7 │ sjsonnet 1.14x 更快 │ sjsonnet
std_reverse │ 21.6 │ 23.5 │ 持平 (1.09x) │ 持平
中等规模 (10-20ms)
Benchmark │ sjsonnet (ms) │ jrsonnet (ms) │ 比值 │ 胜者
───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
std_base64_byte_array │ 9.8 │ 18.2 │ sjsonnet 1.86x 更快 │ sjsonnet
std_base64decodebytes │ 14.1 │ 20.5 │ sjsonnet 1.45x 更快 │ sjsonnet
big_object │ 10.5 │ 11.6 │ sjsonnet 1.10x 更快 │ sjsonnet
realistic_1 │ 9.3 │ 11.9 │ sjsonnet 1.27x 更快 │ sjsonnet
小规模 (<10ms,启动开销主导)
Benchmark │ sjsonnet (ms) │ jrsonnet (ms) │ 比值 │ 胜者
───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
comparsion_for_array │ 6.3 │ 12.8 │ sjsonnet 2.02x 更快 │ sjsonnet
foldl_string_concat │ 5.4 │ 8.6 │ sjsonnet 1.59x 更快 │ sjsonnet
std_foldl │ 6.2 │ 7.4 │ sjsonnet 1.19x 更快 │ sjsonnet
large_string_join │ 6.8 │ 5.4 │ jrsonnet 1.26x 更快 │ jrsonnet
array_sorts │ 8.2 │ 5.5 │ jrsonnet 1.49x 更快 │ jrsonnet
std_base64 │ 7.8 │ 4.2 │ jrsonnet 1.86x 更快 │ jrsonnet
std_base64decode │ 7.3 │ 5.3 │ jrsonnet 1.36x 更快 │ jrsonnet
std_manifestjsonex │ 6.4 │ 4.1 │ jrsonnet 1.54x 更快 │ jrsonnet
std_manifesttomlex │ 6.5 │ 3.6 │ jrsonnet 1.82x 更快 │ jrsonnet
std_parseint │ 6.1 │ 3.6 │ jrsonnet 1.70x 更快 │ jrsonnet
std_substr │ 6.2 │ 4.2 │ jrsonnet 1.45x 更快 │ jrsonnet
string_strips │ 5.7 │ 3.9 │ jrsonnet 1.48x 更快 │ jrsonnet
tail_call │ 5.9 │ 3.7 │ jrsonnet 1.57x 更快 │ jrsonnet
inheritance_function_recursion │ 5.0 │ 2.9 │ jrsonnet 1.74x 更快 │ jrsonnet |
Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.
Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.
Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.
|
Reviewed and keeping this as a follow-up, not part of this PR. #776 is scoped to the render/materialization pipeline and Scala Native-friendly SWAR/direct rendering paths. An AST rewrite for string join would be a separate optimization because it changes the optimization boundary earlier in the pipeline and should get its own focused benchmark/compatibility review. |
Motivation: PR databricks#776 already propagates _asciiSafe through parser literals, base64, joins, and substrings, but MaterializeJsonRenderer still sent those known-safe strings through the chunked char renderer, allocating a temporary char array and scanning for escapes. The hand-written parseInt path also rejected Long.MinValue, which the previous Long.parseLong-based implementation accepted. Modification: Add a char-renderer fast path for known ASCII-safe strings and use it in fused MaterializeJsonRenderer. Let std.length trust _asciiSafe before scanning, and switch parseDigits to negative accumulation so Long.MinValue is accepted while positive overflow remains rejected. Result: Known ASCII-safe strings skip allocation and escape scanning in char materialization and std.length. parseInt keeps the overflow guard without regressing the Long.MinValue boundary.
|
Small follow-up pushed in
Validation run locally:
|
Motivation: String comparison (compareStringsByCodepoint) and long string rendering are hot paths in sort-heavy and render-heavy Jsonnet workloads. The comparison used per-char charAt() virtual dispatch preventing JIT vectorization. Long string rendering used a binary scan (clean→bulk copy, dirty→full reprocess from position 0). Modification: 1. compareStrings: bulk getChars() + tight array loop enabling JIT auto-vectorization (AVX2/SSE). Surrogate check deferred to mismatch point only (O(1) vs O(n)). ThreadLocal buffers on JVM, local alloc on Native, scalar fallback on JS. 2. findFirstEscapeChar: SWAR scan returning position (not boolean). 3. visitLongString: chunked rendering — find escape position, arraycopy clean prefix, escape inline, repeat. Avoids re-processing entire string when only a few chars need escaping. Result: All tests pass across JVM (Scala 3.3.7, 2.13.18) and JS. All benchmark regressions pass. Endian-safe (SWAR operates on independent byte lanes).
Replace per-call `new Array[Char](n)` allocation with module-level pre-allocated buffers in Scala Native's compareStrings. Safe because Scala Native is single-threaded (mirrors the JVM ThreadLocal approach).
Motivation: manifestJsonEx/manifestTomlEx used the generic Visitor interface for char-based rendering, missing the fused direct-walk optimization that ByteRenderer already had. Additionally, char-based string rendering (BaseCharRenderer, MaterializeJsonRenderer) did binary hasEscapeChar check → char-by-char RenderUtils.escapeChar fallback, while ByteRenderer had proper chunked SWAR scanning → bulk arraycopy → inline escape. Modification: - Add materializeDirect(Val) to MaterializeJsonRenderer, mirroring ByteRenderer's fused materializer with valTag-based switch dispatch - Replace visitNonNullString in BaseCharRenderer with chunked rendering: findFirstEscapeCharChar → bulk arraycopy clean segments → escapeCharInline - Add renderQuotedString to MaterializeJsonRenderer with same chunked pattern - Add findFirstEscapeCharChar(char[]) to all 3 CharSWAR platform impls - Wire ManifestModule to use renderer.materializeDirect instead of Materializer.apply0 + Visitor interface Result: manifestJsonEx gap reduced from 2.15x to ~1.4x slower vs jrsonnet. realistic_2 flipped from 1.62x slower to 1.12x faster.
…afe propagation Motivation: String-heavy stdlib operations (substr, length, join, parseInt) had unnecessary overhead on Scala Native: codePointCount/offsetByCodePoints O(n) scans for ASCII strings, StringBuilder resize churn for join, exception-based parseInt via Long.parseLong. Modification: - Add ASCII fast path to Length and Substr using CharSWAR.isAllAscii: skip codePointCount/offsetByCodePoints for ASCII-only strings (99% case) - Pre-sized char[] assembly for std.join: two-pass approach calculates exact output length, then copies with getChars — zero resize overhead - Hand-written parseDigits loop for parseInt/parseOctal/parseHex: no exception setup, no intermediate allocation, single pass - Propagate _asciiSafe flag: parser sets it on ASCII string literals, Val.Str.concat preserves it when both children are ASCII-safe, join propagates it through all elements Result: substr gap reduced from 2.03x to ~1.07x. parseint from 1.80x to ~1.0x. large_string_join from 1.81x to ~1.27x. realistic_2 benefits from combined improvements.
Motivation: Format.format() used StringBuilder which starts small and resizes multiple times for large output. The large_string_template benchmark (591KB template, 256 interpolations) showed 2.78x gap vs jrsonnet. Modification: - Three-pass approach: compute formatted values into String array, calculate exact total output length, allocate char[] and copy with getChars — eliminates StringBuilder resize/copy overhead - Add direct Val dispatch in format loop: skip Materializer for common types (Str, Num, Bool, Null) to avoid ujson.Value roundtrip Result: large_string_template gap reduced from 2.78x to ~1.88x. Remaining gap is dominated by Scala Native startup overhead (~7ms vs Rust ~1ms); pure computation time is within ~1ms of jrsonnet.
Motivation: CI fails on two issues: (1) unused `alwaysinline` import in Native CharSWAR.scala, (2) `\uXXXX` sequences in comments are parsed as unicode escapes in Scala 2.12, causing compilation errors. Modification: - Remove unused `scala.scalanative.annotation.alwaysinline` import - Escape backslash-u sequences in comments across BaseByteRenderer and Renderer Result: Full test suite passes across all platforms and Scala versions
Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.
Motivation: PR databricks#776 already propagates _asciiSafe through parser literals, base64, joins, and substrings, but MaterializeJsonRenderer still sent those known-safe strings through the chunked char renderer, allocating a temporary char array and scanning for escapes. The hand-written parseInt path also rejected Long.MinValue, which the previous Long.parseLong-based implementation accepted. Modification: Add a char-renderer fast path for known ASCII-safe strings and use it in fused MaterializeJsonRenderer. Let std.length trust _asciiSafe before scanning, and switch parseDigits to negative accumulation so Long.MinValue is accepted while positive overflow remains rejected. Result: Known ASCII-safe strings skip allocation and escape scanning in char materialization and std.length. parseInt keeps the overflow guard without regressing the Long.MinValue boundary.
Motivation: Split the JMH-positive, JDK17/JIT/GC-friendly long-string rendering piece out of #776. Keep this PR focused on byte rendering for long strings that contain JSON escapes; this does not include the broader format, stdlib, compareStrings, or Scala Native experiments from #776. Modification: - Add `CharSWAR.findFirstEscapeChar(byte[], from, to)` on JVM, Scala.js, and Scala Native. - In `BaseByteRenderer`, keep the existing UTF-8 byte array for long strings, locate escape bytes, bulk-copy clean chunks with `System.arraycopy`, and escape only matching bytes inline. - Precompute the exact escaped output length, reserve `ByteBuilder` once, then write directly to the backing byte array. This removes repeated `ensureLength`/`appendUnsafeC` calls from the dirty long-string loop. - Use a static byte hex table for `\u00XX` control escapes. JIT / GC shape: - Hot code stays in simple `while` loops, `System.arraycopy`, and small private helpers. - No reflection, no internal JDK APIs, no closures/iterators in the rendering loop. - No per-chunk or per-escape objects are allocated by this follow-up; the existing per-long-string UTF-8 byte array remains the only temporary for this path. - I tested a no-allocation ASCII scalar path, but rejected it because it regressed `large_string_template` and `large_string_join` JMH. Notable results only: JMH target run, same machine, same command shape on `upstream/master` and this branch: `./mill -i bench.runRegressions bench/resources/cpp_suite/large_string_template.jsonnet bench/resources/cpp_suite/large_string_join.jsonnet` | Benchmark | upstream/master | PR | Delta | | --- | ---: | ---: | ---: | | `large_string_template` | 1.552 ms/op | 1.154 ms/op | -25.6% / 1.34x faster | Scala Native hyperfine, release-full native binary, 20 runs: | Benchmark | upstream/master | PR | Delta | | --- | ---: | ---: | ---: | | `large_string_template` | 10.5 +/- 0.2 ms | 9.6 +/- 0.3 ms | -8.6% / 1.09x faster | `large_string_join` was rechecked as a guardrail and stayed neutral, so it is intentionally omitted from the result tables. Verification: - `./mill -i 'sjsonnet.jvm[3.3.7].compile'` - `./mill -i 'sjsonnet.jvm[3.3.7].test'` - `./mill -i 'sjsonnet.js[3.3.7].compile' 'sjsonnet.native[3.3.7].compile'` - `./mill -i 'sjsonnet.native[3.3.7].nativeLink'` - `./mill -i __.checkFormat` - `git diff --check` - Focused JMH and Native hyperfine commands above References: - Split from #776 - Base: `b4c667d55d82d7c50c2103db967c33bebb0c2c98` - Head: `ff70b63e`
|
Closing obsolete broad draft. The useful render work has been or should be split into smaller focused PRs with current docs-aligned data; this branch is now conflicting and too broad to carry forward as-is. |
|
Reopened. This broad branch still conflicts heavily with current renderer/SWAR code and overlaps later split PRs. Keep as draft/source material for extracting smaller PRs rather than closing it as negative. |
|
Rebase retry against current upstream/master still conflicts at the first renderer/SWAR commit: sjsonnet/src-js/sjsonnet/CharSWAR.scala, sjsonnet/src-native/sjsonnet/CharSWAR.scala, and sjsonnet/src/sjsonnet/BaseByteRenderer.scala. Keeping this as draft/source material; not closing because this is not a negative benchmark result. |
Status: split in progress; do not merge this large PR as-is.
This PR is now an experiment/archive branch for the broader Scala Native render-pipeline work. Current master has moved to the JDK17 compilation level, so each optimization split should be independently justified as JDK17/JIT/GC-friendly and verified with focused JMH, GC, and Native CLI hyperfine data.
First Focused Split:
perf: chunk long string byte escaping#809 Benchmark Summary:
#809 now contains the full 36-case JMH+GC sweep, focused rechecks for suspicious rows, Native hyperfine data, and a repeated correctness review. Keep the detailed numbers in #809 as the source of truth; the short summary here is only the target/no-regression view.
Focused target JMH + GC, lower is better:
large_string_templatelarge_string_joinScala Native hyperfine, each run loops 20 CLI invocations and divides back to per-invocation milliseconds:
large_string_templatelarge_string_joinRejected From the First Split Batch:
Format.scalachar-array assembly: not JMH-positive on current master.length/substr/asciiSafe/joingroup:substrregressed, so it should not be split out as-is.std.joinexact-capacity builder: allocation improved in one run, but no-prof JMH regressed.String.indexOfescape scan: tiny signal only, not enough for a separate PR.Next Split Bar:
System.arraycopy, public JDK17 APIs, and existing buffers.Original Context:
ef97244e7536e31089be8410a80b19b3ae1448803a9a492899420456070fb84eaa5b89f8b7dfe1bfed9af56139ad6a066910483104bae165cef53d16