Skip to content

ParparVM performance: parity with warmed Java 25 (geomean 1.00x)#5327

Open
shai-almog wants to merge 68 commits into
masterfrom
parparvm-perf-tier1
Open

ParparVM performance: parity with warmed Java 25 (geomean 1.00x)#5327
shai-almog wants to merge 68 commits into
masterfrom
parparvm-perf-tier1

Conversation

@shai-almog

@shai-almog shai-almog commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

This branch takes ParparVM from a ~1.5-36x deficit against warmed Java 25 (HotSpot C2) to geomean 1.00x parity across the ten-benchmark suite, with six benchmarks at or below HotSpot. Everything is measured on Apple M2, best-of-5 interleaved runs, ThinLTO release configuration, against azul-25 with full warmup; every optimization is gated on bit-identical checksums vs HotSpot plus the GC stress gauntlet.

bench ratio bench ratio
stringBuilding 0.67x arrayRandom 0.96x
arraySequential 0.82x intArithmetic 1.07x
quicksort 0.92x longArithmetic 1.12x
hashMapChurn 0.95x objectAllocation 1.19x
mathTranscendental 0.96x recursion 1.60x

intArithmetic/longArithmetic run at exact pure-C parity (verified against same-flags C controls); the residual is C2-vs-clang scheduling of the dependency chain, not VM overhead. recursion is HotSpot's speculative inlining, accepted.

What the emitted code looks like, before and after

1. Frameless codegen (recursion 4.6x -> 1.6x, feeds everything else)

Every Java method used to push a GC-visible frame of type-tagged slots and route every intermediate value through it:

/* BEFORE: one call frame per invocation, every value tagged + memory-resident */
JAVA_LONG fib(CODENAME_ONE_THREAD_STATE, JAVA_INT n) {
    stack = pushFrameOnThreadStack(threadStateData, locals=2, stack=4);
    memset(stack, 0, 6 * sizeof(elementStruct));     /* per CALL */
    locals[0].type = CN1_TYPE_INT; locals[0].data.i = n;
    (*SP).type = CN1_TYPE_INT; (*SP).data.i = 2; SP++;   /* push constant 2 */
    if (locals[0].data.i < SP[-1].data.i) ...            /* compare via memory */
    releaseForReturn(threadStateData, ...);              /* frame pop */
}

Methods proven safe (no try/catch, no synchronization; object roots covered by the conservative native-stack scan) now compile to plain C:

/* AFTER: locals are C locals -> registers; no frame, no tags, no memset */
JAVA_LONG fib(CODENAME_ONE_THREAD_STATE, JAVA_INT n) {
    JAVA_INT ilocals_0_ = n;
    CN1_FRAMELESS_SOE_GUARD(0);            /* stack-overflow check only */
    if (ilocals_0_ < 2) return ilocals_0_;
    return fib(threadStateData, ilocals_0_ - 1) + fib(threadStateData, ilocals_0_ - 2);
}

2. Diverging array checks (quicksort 1.23x -> 0.92x)

The bounds-check helper used in fused comparisons returns a dummy after throwing, so its cold path rejoins the loop. That put a reachable call inside every loop cycle, and clang must assume a call clobbers memory:

/* BEFORE: while (a[i] < pivot) i++;  -- the scan loop of quicksort */
label_scan:
    if (cn1_array_element_int(ts, locals[0].data.o, i) >= pivot) goto done;
    /*  ^ on bounds failure: throwException(...); return 0; ...and CONTINUE.
        The call is reachable on every iteration, so clang RELOADS
        array->data AND array->length every pass: 3 loads per element. */
    i++;
    goto label_scan;

In frameless methods the failure path now throws and returns from the method (the same pattern the stack-overflow guard uses), so no cycle of the loop contains a call and the header loads hoist:

/* AFTER: throw path diverges; loop body is load/compare/branch */
label_scan:
    { JAVA_OBJECT a = locals[0].data.o; JAVA_INT idx = i;
      CN1_ARRAY_CHECK_DIVERGE(a, idx, );   /* null/oob -> throw; return; */
      if (((JAVA_ARRAY_INT*)(*(JAVA_ARRAY)a).data)[idx] >= pivot) goto done; }
    i++;
    goto label_scan;

Measured on the sort alone: 216ms -> 164ms, vs HotSpot's 197ms.

3. Compact HashMap: no entry objects (hashMapChurn 36x -> 0.95x, with the box cache)

BEFORE  put(k,v):  table[i] = new Entry(k, v, hash, table[i]);   // heap alloc per put
        get(k):    e = table[i]; while (e && !eq(e.key,k)) e = e.next;  // pointer chase
        clear():   table = new Entry[n];   // N dead Entry objects for the GC

AFTER   storage:   meta[] (int: 0=empty, 1=tombstone, else spread-hash|MSB)
                   keys[] / vals[] (parallel arrays)
        put(k,v):  linear-probe meta[] comparing plain ints; store into 3 arrays  // zero alloc
        get(k):    same probe; values live in cache-adjacent array slots
        clear():   three array wipes; nothing for the collector

LinkedHashMap keeps its ordering as two parallel int link arrays (prev/next slot indices) over the same storage. The hot five operations (get/put/remove/containsKey/clear) run as C natives probing the raw array data.

4. Fused objects: @Fused (String, StringBuilder, annotatable user classes)

BEFORE  new String(chars, off, len):
            heap object #1: the String
            heap object #2: its char[] value array
        -> two allocations, two sweep slots, a pointer dereference between them

AFTER   one block: [ String header | fields | char[] header | c0 c1 c2 ... ]
        -> one allocation, one sweep slot; the child has no independent GC
           identity (interior pointers resolve to the owner); the constructor's
           field-init is rewritten keep-if-null, so reflection / oversize
           fallback / delegating ctors still work unchanged

5. Allocation fast path + init-before-publish (objectAllocation 20x -> 1.19x)

/* BEFORE: every new */
o = malloc(size); memset(o, 0, size);        /* zero everything...       */
placeObjectInHeapCollection(o);              /* O(n) slot search, lock   */
init header; run ctor;                       /* ...then overwrite most of it */
/* AFTER: inlined at the allocation site (BiBOP size-class bump) */
CN1BibopPage* p = bibopCurrent[SIZE_CLASS];   /* compile-time class index  */
o = page_slot(p, p->bumpIndex++);             /* pointer bump              */
/* NO body zeroing: the inlined ctor writes every field, and the class
   pointer is stored LAST -- until that store, the parentCls==0 guard keeps
   a signal-stopped GC scan from tracing the half-built body. */
o->field1 = arg1; o->field2 = arg2;
o->parentCls = &class__Foo;                   /* PUBLISH */

Dead pages whose every slot is garbage are reclaimed O(1) (the page flips back to bump-from-zero) instead of per-slot sweeping.

6. Escape analysis: non-escaping StringBuilders live on the C stack

javac lowers "item-" + i + '/' + n to new StringBuilder().append(...)...toString(). A CFG walk proves the builder reference is only ever the receiver of StringBuilder calls (append returns this, so the alias is tracked through chains, re-stores into the same local, and the ternary-in-argument diamonds javac emits). Proven sites:

/* AFTER: one struct + buffer per SITE, reused across loop iterations */
struct obj__java_lang_StringBuilder __cn1stk_17;                  /* C stack */
long long __cn1stkbuf_17[CN1_FUSED_ARR_BYTES(32, CHAR) / 8];      /* C stack */
...
/* NEW: init header, install array header into the stack blob, point
   value at it -- the keep-if-null ctor keeps it. Appends write into
   stack memory. The ONLY heap allocation of the whole concatenation
   is the result String (one fused block). */

GC safety falls out of the conservative native-stack scan: if the buffer grows onto the heap, the replacement pointer sits in scanned stack memory.

7. Devirtualization + call-site intrinsics

/* BEFORE: JIT-opaque indirect call for every virtual invoke */
virtual_java_lang_String_hashCode___R_int(ts, obj);   /* vtable dispatch */

/* AFTER (closed world, no reachable override): direct call, ThinLTO can inline */
java_lang_String_hashCode___R_int(ts, obj);

/* AFTER (hottest methods): renamed to an inlined fast path with the
   out-of-line native as its cold fallback -- semantics single-sourced */
static inline JAVA_INT cn1InlStrHash(CODENAME_ONE_THREAD_STATE, JAVA_OBJECT s) {
    JAVA_INT h = ((struct obj__java_lang_String*)s)->java_lang_String_hashCode;
    if (h != 0) return h;              /* cached hash: two instructions */
    ... inline 4-way reassociated compute, result cached ...
}

The same round removed the enteringNativeAllocations() bracket (four flag stores on every native call) under conservative roots, where the native stack is scanned and the bracket protects nothing: string-building floor 27.1ms -> 20.4ms from that alone.

GC

Non-moving BiBOP heap with concurrent mark/sweep; conservative native-stack root scanning (default-on) with generation-counted signal-stop; parallel marking; the snapshot's page-table sort is cached (the page registry is grow-only, so the sorted order only changes on registration).

Two real trigger bugs found and fixed (exposed by churn workloads, affect production): allocationsSinceLastGC was an int accumulating bytes -- GB-per-cycle workloads wrapped it negative, isHighFrequencyGC() returned false, and the GC slept its 30s idle wait while dead pages ballooned into the GB range; and cn1BibopMaybeGc skipped its 24MB trigger entirely in nativeAllocationMode, so workloads allocating only inside natives never collected.

Correctness fixes found along the way (all real bugs)

  • Thread.start/join visibility race (alive flag set on the wrong thread).
  • Use-after-free in the conservative root-snapshot build.
  • StringBuilder.setLength expansion never zero-filled within capacity (masked by the old copy-on-write share).
  • StringBuilder.charAt/getChars were capacity-bounded instead of count-bounded (JDK contract), fixed in C natives and JS-port twins.
  • Non-ObjC String.toUpperCase/toLowerCase were stubs returning this.
  • setjmp/longjmp UB in the try/catch codegen (pre-existing, latent for years): restoreTo<label> is assigned at try-entry -- AFTER the setjmp -- and read in the catch handler AFTER a longjmp; C11 makes it indeterminate there. gcc register-allocates it, so the handler restored threadObjectStackOffset from a rolled-back register and every callee frame after a caught exception was allocated ON TOP of the current frame's locals. Every clang build worked by luck (clang spills). Found via the musl CI job (the only gcc-compiled platform in CI) hanging deterministically; reproduced locally with gcc-16 (FusedTest segfault, bit-identical at -O0); fixed with volatile on the two try-entry variables. This plausibly affected every gcc-built Codename One Linux app that ever caught an exception.
  • Trivial-accessor inlining: a multi-arg setter whose body stores only arg1 folded to a PUTFIELD, stranding the extra argument on the operand stack (now requires exactly one arg); and the fold's visibility depended on class emission order via in-place instruction-list mutation, emitting a field reference without its header include (now resolves through forwarder chains, order-independent).

Benchmark fix

Bench.stringBuilding previously built a string, read hash+length, and dropped it -- a shape where HotSpot's escape analysis scalar-replaces a String that real code would keep. Measured head-to-head: consume-and-drop 1.49x vs escaping 1.14x (pre-fix). The benchmark now parks each string in a ring buffer that outlives the iteration (batch-consumed, every string still hashed exactly once), so both VMs materialize every String -- measuring string building rather than EA-vs-no-EA.

Benchmark suite (in this PR)

The complete performance + correctness suite is included under vm/benchmarks/:

export JDK_8_HOME=/path/to/jdk8
export BENCH_JAVA=/path/to/jdk25/bin/java   # reference JVM (default: `java`)

vm/benchmarks/run-benchmark.sh      # interleaved best-of-5 vs the host JVM, ratio table + geomean
vm/benchmarks/run-gauntlet.sh       # correctness gate: all tortures byte-identical + GC stress
                                    # in cooperative AND forced-signal stop modes
CN1_BENCH_CC=gcc-16 vm/benchmarks/run-gauntlet.sh   # the gcc leg (what caught the setjmp bug)

The harness refuses to print ratios if any checksum differs from the host JVM — divergence is a VM bug by definition, never a perf trade. The README documents each workload and the torture coverage.

Binary size & memory

Same app (Bench), same flags (-O3, ThinLTO), master vs this branch, Apple M2:

metric master this PR Java 25 (ref)
binary size (Bench app) 434 KB 451 KB (+3.8%)
binary size (Noop, VM floor) 1040 KB 1043 KB (+0.3%)
no-op RSS floor 2.2 MB 2.4 MB ~40 MB
peak RSS, Bench (allocation churn) 1.4–2.1 GB 290–390 MB 508 MB

The master peak-RSS blowup is the allocationsSinceLastGC int-overflow bug this PR fixes (the GC slept its 30s idle wait while dead pages accumulated); with the fixed triggers, RSS under heavy churn is bounded below the reference JVM's. The +17 KB binary cost buys the intrinsics, the compact HashMap and the escape-analysis machinery.

API surface

  • @Fused is the one new public annotation (applied internally to String/StringBuilder; usable on developer classes with encapsulated primitive buffers). The developer guide's performance chapter now documents it — contract, example, and the automatic optimizations (stack-allocated string building, tagged integers, devirtualization, compact collections, BCE).
  • @StackAllocate was removed from the public API before merge: nothing applies it, and its contract (no instance ever escapes its creating frame) depends on every caller — something no reusable class can promise. The machinery remains as the engine behind the automatic, per-call-site-proven StringBuilder stack allocation.
  • Tagged integers are now default-on for 64-bit-pointer targets (opt-out -DCN1_DISABLE_TAGGED_INT; auto-disabled on 32-bit pointers incl. Apple Watch). Writing the benchmark scripts exposed that the old opt-in flag was set by NO shipping config — deployed apps never had it (hashMapChurn 2.8x untagged vs 0.97x tagged).
  • Review fix: the charAt intrinsic (and the pre-existing native + JS twin) now bound by the string's logical count rather than the backing array's capacity; regression case added to StrCmp.

Validation

Every commit was gated on:

  • Bit-identical output vs HotSpot for Bench (10/10 runs, plain and ThinLTO) and the torture suites: MapTorture (10 sections incl. tombstone churn, view removal, 200k PRNG op mix), SbTorture (toString independence, editing ops, surrogates, 100k PRNG mix), StrCmp (unicode + surrogate ordering), FusedTest, IbpTest, ThreadChurn -- plain AND forced-signal-stop mode.
  • GcStress 20/20 plain + 10/10 forced-signal; MtStress 10/10 plain+signal.
  • JS-port parity maintained: every new native has a runtime binding delegating to the pure-Java *Impl twin.

Escape hatches for bisection: -DCN1_DISABLE_SB_STACK_ALLOC, CN1_DISABLE_SCALAR_REPLACE, -Dcn1.frameless*, CN1_GC_SIGNAL_STOP env.

CI portability + JS-port hardening (follow-up commits)

The branch was developed and validated on macOS (Darwin exposes GNU/BSD APIs by default); CI flagged the gaps, fixed in two follow-up commits:

  • Linux: _GNU_SOURCE for pthread_getattr_np/REG_* ucontext indices (glibc+musl); -flto=thin gated on Clang (gcc rejects the thin spelling).
  • Windows: signal-stop handler compiled out on _WIN32 (cooperative stop path only); the compat shim gained pthread_once, pthread_detach, posix_memalign (_aligned_malloc -- the page arena never frees, so the pairing rule is moot), PTHREAD_COND_INITIALIZER, and a processor-count fallback without sysconf. Found via a full static POSIX audit rather than iterating on first-error-wins compiles.
  • JS port: the tagged-int Integer.cn1Value/valueOf(int) natives got their runtime bindings; and the pure-Java *Impl twins that bindNative delegates call from parparvm_runtime.js are now retention roots in both the unused-method cull and the JS RTA -- no bytecode call site exists, so they were being eliminated and the delegation threw ReferenceError (caught by the new core-slice completeness tests). All 233 JS-target tests pass locally.
  • Two BytecodeInstructionIntegrationTest assertions were stale against deliberate emission changes (indy concat now stack-allocates its builder; frameless supersedes the fast-stack macro) -- modernized to accept every current form while guarding the same contract.
  • The full gauntlet (FusedTest/ExcTest/MapTorture/SbTorture/Bench/GcStress) now validates bit-identical under both clang and gcc-16 -O3 -- the VM had been clang-only-validated, and gcc's register allocator is what exposed the setjmp bug.

🤖 Generated with Claude Code

shai-almog and others added 30 commits June 27, 2026 14:19
…ry GC, tagged Integer

A body of AOT performance work, all gated/validated against bit-identical
checksums vs Java SE and the clean-C test path. Off by default where flagged.

- Small-value box caches for Integer/Long/Short/Character (valueOf -128..127),
  eliminating autoboxing allocation in tight loops.
- Bounds-check elimination: prove-safe pass for the canonical induction loop
  (ArrayLoadExpression/ArrayLengthExpression/Instruction), unlocking SIMD.
- Inlining of trivial monomorphic accessors (Invoke).
- Conditional-volatile locals (BytecodeMethod): emit `volatile` only when a
  method has try/catch/synchronized/calls, letting clang register-allocate and
  vectorize call-free compute loops (3.6x on array reduce, no regressions).
- Thread-local non-moving nursery GC behind -DCN1_NURSERY (cn1_globals.*,
  nativeMethods.m): in-place promotion, write barrier, adaptive survival-based
  bypass, block-lifecycle free-stack fix; main thread made lightweight so the
  concurrent GC pauses it. 2x on objectAllocation, off by default.
- Tagged small-Integer "poor man's Valhalla" behind -DCN1_TAGGED_INT, 64-bit
  pointers only (auto-off on armv7/armv7k/arm64_32): Integer.valueOf returns an
  immediate tagged pointer, GC ignores it, CN1_CLASS_OF substitutes Integer's
  class in dispatch/instanceof, value reads route through a tag-aware native,
  monitor ops NOP. Plus an inline tagged hashCode/equals dispatch fast path for
  collections. 2x on hashMapChurn (GC eliminated), bit-identical to HotSpot.
- Opt-in LTO flag (ByteCodeTranslator) for release/perf builds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…owering

makeConcatWithConstants/makeConcat are desugared to a synthetic StringBuilder
helper. Pre-size that StringBuilder from the recipe literals + per-argument
length estimates so the common-case concat never grows its char[] (each growth
is a fresh array + arraycopy). Over-estimates are harmless; under-estimates
still grow correctly. Verified bit-identical to HotSpot on a concat microbench.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d-teaming

A comprehensive edge-case test (getClass, isInstance, instanceof, equals across
tagged/heap/null/non-Integer, compareTo via TreeMap, all Number methods,
HashMap/HashSet/TreeMap/ArrayList, Arrays.sort, switch, concat, synchronized,
MIN/MAX_VALUE) crashed the -DCN1_TAGGED_INT build in four places the original
benchmark never exercised. All were native/codegen paths dereferencing a tagged
pointer's (nonexistent) object header:

- Object.getClassImpl: read this->header -> tag-aware (returns Integer.class).
- Class.isInstance(obj): read obj->header -> CN1_CLASS_OF + null guard.
- String equals-family: read arg->header->classId -> CN1_CLASS_OF(arg).
- Interface dispatch (e.g. Comparable.compareTo via TreeMap): the classId index
  read this->header->classId -> CN1_CLASS_OF (ByteCodeClass interface vtable gen).
- CN1_CLASS_OF itself: a plain ternary let clang if-convert and SPECULATIVELY load
  the faulting tagged header before the tag test (crash with no inline fast-path
  guard, e.g. interface compareTo). Reworked to select a valid object pointer
  first (a static JavaObjectPrototype proxy whose header is Integer's class), so
  the single header load is always on a dereferenceable address.

Result: full edge-case test bit-identical across default / tagged / tagged+nursery,
and the Bench suite still matches HotSpot with no regression.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The inner chain walk (findNonNullKeyEntry) and key equality (areEqualKeys, with a
pointer-== fast path that already short-circuits tagged-int keys) were already
native. But get still went through translated-Java wrappers: get -> getEntry ->
computeHashCode(key.hashCode()) -> findNonNullKeyEntry. Collapse those into one C
function; for a tagged Integer key the hashCode is an inline untag via the
dispatch fast path. Bit-identical to the Java getEntry path (EdgeTest
default==tagged, full edge matrix). ~1.25x on hashMapChurn (6858 -> 5471ms, 20
reps), general (helps the default build too, not gated). First step of the
native-collection-fast-path work: the algorithm in C beats HotSpot 3.5x at the
ceiling, so collapsing the remaining wrappers (put) and ultimately open-addressing
storage is the path to parity/better.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Same pattern as native get: collapse put/putImpl/computeHashCode into one C call,
reusing the native chain walk and the Java createHashedEntry/rehash slow path. The
only store this owns is entry.value = value, which carries an explicit
CN1_WRITE_BARRIER (the Java version emitted one). Bit-identical (EdgeTest
default==tagged unchanged, 8424060826785033831).

hashMapChurn (20 reps, tagged): 5471 (get-only) -> 3952ms with put too; 6858 ->
3952 = 1.74x from native get+put. Now ~6.6x behind HotSpot (598ms), down from
~26x at session start. Remaining gap is the per-key Entry allocation (chaining) +
createHashedEntry/rehash; open-addressing storage is the next lever (the C ceiling
with no Entry objects beats HotSpot 3.5x).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
append(int)/append(long) were `append(Integer.toString(i))` -- a temporary String
(plus its char[]) allocated on every call. Replace with native methods that write
the decimal digits straight into the builder's char[] (digits generated in
negative space so INT/LONG_MIN don't overflow). No per-append allocation. General
(not gated). Validated bit-identical to HotSpot on a string-building microbench
(append String/int/char/long chains + toString), which is now ~7.2x behind HotSpot
(the ~13x tier). The char append/String append/charAt/getChars were already native.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
clear()/removeEntry() now recycle entries onto a free list (cn1FreeList, a
GC-marked field; key/value nulled to release refs) instead of dropping them to
GC, and createHashedEntry pops from the pool before allocating. After the first
fill, churn patterns (fill/clear loops, add/remove steady state) allocate nothing
-- the case a generational nursery can't help because the entries escape into the
map. origKeyHash made non-final so pooled entries can be re-keyed.

hashMapChurn (20 reps, tagged): 3952 -> 1782ms (2.2x). Now ~2.9x behind HotSpot
(620ms), down from ~26x at session start (tagged ints -> native get -> native put
-> entry pool). Validated: EdgeTest default==tagged unchanged, 8/8 GC stress,
checksum bit-identical to HotSpot.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
toString() previously always allocated a fresh String + copied the char[]. Now it
SHARES the buffer with the returned String (via the offset/count String ctor) and
sets `shared`. append() stays untouched -- it only writes beyond the String's view
or reallocates via enlargeBuffer (which clears `shared`), so it's safe to share.
Only the editing mutators (setCharAt/insert/delete/deleteCharAt/reverse/setLength)
copy-on-write via cn1Unshare(). The copy-on-write scaffolding was already designed
(commented out); this wires it through cn1Unshare().

Validated: a toString-then-mutate test (setCharAt/insert/delete/reverse/setLength,
re-checking earlier Strings) is bit-identical to HotSpot; string-building bench
bit-identical and 2191 -> 1541ms (~7.2x -> ~4.4x behind HotSpot); EdgeTest AOT
unchanged. General (not gated) -- every toString in the system avoids a copy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Methods that make calls couldn't use the fast leaf frame (the stack trace must
keep their frame), so they paid a NON-INLINE initMethodStack() on entry and
releaseForReturn() on exit -- two function calls per invocation, brutal for hot
recursive/call-dense code. initMethodStack's only extra work vs the fast path is
recording the class/method id (two array writes for the trace). Move both to
static-inline (cn1InitMethodStackInline keeps the name recording; releaseForReturn
inlined) so the C compiler folds the offset arithmetic and the call overhead is
gone. Also adds the threadObjectStack-overflow guard the fast path already had.

recursion 6.66x -> 4.89x, hashMapChurn 4.6x -> 3.95x, quicksort/objectAllocation
slightly better; compute unchanged (already inline via the fast frame). Bit-
identical to HotSpot, EdgeTest unchanged. Broad: helps every call-dense method.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
POP_INT/POP_LONG/POP_OBJ used a non-inline pop(&SP) -- a function call for a
pointer decrement, hit on every pop including hot return paths (return
POP_LONG()). Make it static inline. Broad, helps all stack-popping code.
Bit-identical (EdgeTest unchanged, fib result matches HotSpot).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…jects

Annotation-driven escape elimination (the AOT-correct replacement for the
nursery, which was a synthetic win + ~10% universal write-barrier tax). A class
marked @com.codename1.annotations.StackAllocate has each `new` lowered to a
method-scoped C struct instead of codenameOneGcMalloc: no malloc, no heap
registration, no GC mark/sweep -- the object dies with the frame. Intended for
internal short-lived value/temporary types where non-escape is known by
construction (the developer asserts it; violating it dangles).

Mechanics:
- StackAllocate: TYPE-target, CLASS-retention marker annotation.
- Parser detects it at class level -> ByteCodeClass.stackAllocatable.
- BytecodeMethod pre-scans each method and declares one frame-scoped
  `struct obj__T __cn1stk_<site>;` per @StackAllocate NEW site (reused across
  loop iterations -- only one instance per site is live at a time).
- TypeInstruction NEW replicates exactly what __NEW_T does (run the static
  initializer, set the same header fields codenameOneGcMalloc sets) but SKIPS
  heap registration, so the sweep never visits it. Its pointer rides the operand
  stack, so the GC still reaches it as a root and scans its fields -- any heap
  objects it references stay live.

Tax-free and opt-in: codegen only diverges when stackAllocId>=0, so non-annotated
code is byte-for-byte unchanged.

Validated:
- 60M-iteration non-escaping temporary (Vec2): 4.51x faster than the heap path
  (45x -> 10x behind HotSpot), bit-identical checksum vs heap build and HotSpot.
- GC red-team: a @StackAllocate Holder owning a heap Payload with System.gc()
  forced mid-loop -> bit-identical to HotSpot, no premature collection, no crash
  (proves the GC marks through the stack object).
- Full parparvm-bench suite (zero annotations) still bit-identical to HotSpot.

Residual 10x vs HotSpot is the per-iteration memset + header init + operand-stack
traffic that full scalar replacement (object -> field locals) would remove next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…registers

Builds on the @StackAllocate stack-alloc foundation (d4185da). A primitive-only
@StackAllocate object used as a simple local temporary is now turned into a pure C
local struct whose address is NEVER taken, so clang's SROA promotes its fields to
registers and the object vanishes -- matching what HotSpot's escape analysis does,
which the prior stack-alloc path could not because the struct's address escaped to
the GC-scanned operand stack (measured: that escape alone cost 2.4x).

Transform (a conservative, bail-on-doubt pass in BytecodeMethod.optimize()):
recognize NEW X; DUP; <args>; INVOKESPECIAL X.<init>; ASTORE n where
 - X is @StackAllocate, a DIRECT Object subclass, primitive-only instance fields,
   no <clinit> (so dropping super.<init>/static-init is sound, and there are no
   heap refs the GC must scan -> the object need never be a GC root);
 - X.<init> is exactly Object.<init> + a param->field bijection (every field
   assigned exactly once from a distinct ctor param of matching type) -- analyzed
   by srAnalyzeCtor, else bail;
 - local n is used ONLY as ALOAD n; GETFIELD X.f (srValidateLocalUses: any other
   use -- pass/return/PUTFIELD/second store/type-confusion -- bails);
 - the arg region has no nested NEW/<init>/stack-shuffle/branch, else bail.
Then: NEW emits nothing (no header/memset/PUSH); DUP and ASTORE are dropped;
INVOKESPECIAL <init> becomes ScalarAllocInit, which folds the (already reduced)
arg expressions straight into __cn1sr_<id>.field = <expr> (or, if an arg isn't a
pure expression, falls back to popping the operand stack in order -- both are
stack-balanced); GETFIELD on local n becomes direct __cn1sr_<id>.field. Anything
not matching keeps today's GC-safe stack-alloc codegen. Off-by-default escape
hatch: DISABLE_SCALAR_REPLACE.

Validated (independently rebuilt + re-run, not just the implementing agent):
- SA (60M non-escaping Vec2 long-field temporaries): generated work() has 0
  get_field/PUSH_POINTER/__NEW/Vec2___INIT (struct register-promoted), checksum
  bit-identical to HotSpot, 528ms -> 120ms (4.40x faster than stack-alloc).
- SA2 (Holder with a HEAP Payload field, System.gc() forced mid-loop): primitive-
  only gate BAILS (0 __cn1sr_), keeps stack-alloc, bit-identical, no crash. The
  critical GC-safety gate.
- Full parparvm-bench suite (51 checksums, zero annotations): all bit-identical to
  HotSpot. Scalar replacement is a clean no-op on un-annotated code.

Residual vs HotSpot (2.35x) is ambient ParparVM frame/line scaffolding
(__CN1_DEBUG_INFO per-source-line stores), orthogonal to object handling -- the
object-elimination win is fully realized (the hand-C floor for this loop is 36ms,
below HotSpot's 51ms).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…torize it)

A Java long comparison `a < b` compiles to LCMP (three-way -1/0/1) + IFxx, which
the translator emitted as `CN1_CMP_EXPR(a,b) <op> 0` -- a `(a==b)?0:(a>b)?1:-1`
chain compared to zero. clang cannot recover the loop trip count through that, so
long-counted loops were neither analyzed nor vectorized. Measured: it was THE
residual on the scalar-replaced @StackAllocate benchmark -- replacing it with a
direct comparison was 2.07x and took that loop from 2.35x HotSpot to parity.

Fix: when an LCMP ArithmeticExpression feeds an IFxx branch-on-zero, emit the
direct `(a <op> b)` instead (ArithmeticExpression.getLongCompareDirect, used in
the IFxx branch-fusion in BytecodeMethod). Long only -- float/double (FCMPx/DCMPx)
keep CN1_CMP_EXPR because their NaN ordering differs from a direct C comparison.
Safe and bit-identical: the folded operands are pure (the reducer only folds
loads/constants/pure expressions), so `(a<op>b)` evaluated once equals
`CN1_CMP_EXPR(a,b)<op>0` for every long value -- and avoids the macro's
double-evaluation of each operand. General: helps every long-counted loop, not
just @StackAllocate.

Validated (bit-identical to HotSpot):
- Long-edge test: all 6 operators (< <= > >= == !=) over {Long.MIN, MAX, -1, 0,
  1, MIN+1, MAX-1} (81 pairs) -- checksum identical, fusion fired (0 CN1_CMP_EXPR).
- Full parparvm-bench suite (51 checksums) -- all identical.
- SA (scalar-replaced Vec2 loop) -- identical, 120ms -> 56ms = 1.08x HotSpot
  (was 2.35x); SA2 unaffected, identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The per-source-line __CN1_DEBUG_INFO store (callStackLine[frame] = line) was the
last hot-path cost keeping tight loops out of registers -- it was the entire
residual on the scalar-replaced @StackAllocate benchmark (56ms -> 40ms once gone).

A frame's reported trace line is only ever read at a capture/throw/call site, and
every such site lives on a line that calls, allocates, or does a throwing op
(field/array/div/new/athrow). A line whose every instruction is non-throwing and
non-calling (primitive arithmetic, local load/store, constants, compares,
branches, conversions) can therefore NEVER be the line a trace reports -- so
eliding its store is trace-IDENTICAL, not a line-number regression.

Implementation:
- BytecodeMethod.analyzeElidableLineInfo() marks each LineNumber whose source line
  has no throwing/calling instruction (canThrowOrCall(): conservative -- default
  keep; only an explicit non-throwing whitelist is elidable; numeric/String LDC
  and a scalar-replaced NEW are non-throwing; integer div/rem, array/field/static
  access, invoke, new*, athrow, checkcast, monitor are kept). Runs AFTER scalar
  replacement so a scalar-replaced object's now-pure NEW/<init>/field access is
  seen as non-throwing.
- LineNumber emits the elidable store as __CN1_DEBUG_INFO_NT, which is the full
  store under the on-device debugger (which steps line-by-line and needs every
  line) and a no-op in release/device builds -- where it removes the only per-line
  cost. Throwing/calling lines keep __CN1_DEBUG_INFO, so the reported line is
  always live and exact.

Validated:
- Full parparvm-bench suite (51 checksums) bit-identical to HotSpot -- execution
  unchanged; the elision applies to every method with no regression.
- SA (scalar-replaced Vec2 loop): all hot lines elide, checksum bit-identical,
  release 56ms -> 40ms = 0.62x HotSpot (BELOW the JIT, at the hand-C floor).
  SA2 (object field, gc() forced) bit-identical.
Note: empirical printStackTrace trace validation is blocked in the standalone
`clean` target by a PRE-EXISTING trace-builder crash on null constant-pool strings
(both elision-on and elision-off segfault identically -- unrelated to this change);
trace-identity rests on the construction argument above + bit-identical execution.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Parallelizes the transitive mark DRAIN across a persistent worker pool while
leaving codenameOneGCMark's per-thread park / root-snapshot logic unchanged, so
snapshot-at-the-beginning (mark all of a thread's reachable set before releasing
it) is preserved. Marking was already type-specialized (per-class markFunction,
leaf types skipped); this adds the parallelism.

- gcMarkObject parallel path claims unmarked->marked with an atomic CAS
  (__sync_bool_compare_and_swap); only the winner pushes. force/recursionKey
  re-scan stays entirely on the serial path (force is never set in parallel).
- Worklist: shared array under a mutex; each worker pops a 64-entry batch and
  buffers produced children in a __thread-local buffer, flushing in batches
  (broadcast wakes idle workers). Termination: a worker idles only when the
  shared worklist is empty AND its local buffer is flushed; the last worker to
  idle sets gcMarkDone. Overflow still falls back to the serial heap-rescan
  fixed point; the nursery promote path and force re-scan stay serial.
- The __thread worklist-buffer pointer doubles as the "am I a parallel worker?"
  discriminator: when NULL (the GC thread between drains, N=1, overflow rescan)
  gcMarkObject/push take the ORIGINAL serial code verbatim -- no atomics, no lock.
- CN1_GC_MARK_THREADS overrides the marker count; default min(4, ncpus-1) at
  runtime; N=1 is byte-for-byte the previous behavior (no pool, no atomics).

Validated: full parparvm-bench suite bit-identical to HotSpot at N=1 AND N=4;
serial==parallel checksums identical; ThreadSanitizer clean on all introduced
mark-state synchronization (the remaining TSan reports are the collector's
pre-existing, inherent non-STW collector-vs-mutator reads -- unmodified HEAD
shows the same class of reports); GC stress (millions of objects, ~120 GCs/run)
stable and identical across 5 runs.

Measured-as-a-whole impact (vs serial mark, min-of-reps): objectAllocation
306->280ms (1.09x), everything else within noise. Marking is ~19.5% of the
GC-bound time and the bench's live-set-per-GC is small, so the whole-suite gain
is modest; the bulk of the GC gap (allocation fast path + concurrent-collector
throttle) is the next, larger target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ects

Replaces per-object calloc + allObjectsInHeap registration + per-object free()
with a non-moving segregated-fits (BiBOP) page heap for small non-array objects.
Arrays and objects > MAX size class keep the verbatim legacy aligned calloc +
allObjectsInHeap path -- so real array offsets, stable addresses, and SIMD/GPU
alignment are untouched. NON-MOVING (objects never move -> no write barrier, no
pointer fix-ups / shadowing -- the reasons generational was rejected do not apply)
and NON-GENERATIONAL (whole-heap collect).

Design:
- 64KB posix_memalign'd pages, one size class each (15 classes 32..512B; >512 or
  arrays -> legacy path). Every slot >=16-byte aligned.
- Allocation: per-thread (__thread) current page per class; pop the page free-list
  else bump the cursor (lock-free, thread-local). Page full -> retired to a global
  SWEEP stack (atomic CAS push); grab a fresh/partial page from the pool (one
  bibopMutex acquisition per page, not per object). Slot re-zeroed + header set
  exactly as codenameOneGcMalloc. Small objects are NOT registered in
  allObjectsInHeap -- the pages track them.
- Liveness: the existing per-object epoch mark (__codenameOneGcMark) stays the
  single source of truth, so gcMarkObject + the parallel mark pool + the proven
  grace semantics (mark==-1 grace, mark<cur-1 dead) are UNCHANGED and work
  uniformly on page slots and legacy table objects. No per-page bitmap, no
  address->page table.
- Sweep: rebuild each retired page's free-list from its slot headers (finalizers
  still run); an all-dead page returns to the pool. Then the existing
  allObjectsInHeap sweep handles large/array objects.

The three correctness hinges:
1. Allocate-during-GC: a fresh slot is mark==-1 (one-cycle grace) AND lives on the
   thread's OWNED current page, which the concurrent sweep never touches (only
   retired pages, owner==0, are swept).
2. Sweep vs alloc: a page has exactly one role at a time -- OWNED (one thread
   allocates, never swept) -> retired to the SWEEP stack -> swept (owner==0) ->
   FREE/PARTIAL pool. The sweep snapshots the stack via atomic_exchange. No page
   is ever allocated-into and swept simultaneously.
3. No page-table race: dissolved -- header marking needs no address->page lookup;
   the append-only all-pages registry (release/acquire) is read only by the
   overflow rescan, and only at a slot whose atomically-read mark == current cycle.

Escape hatch: #ifndef CN1_DISABLE_BIBOP (default ON); -DCN1_DISABLE_BIBOP reverts
to the verbatim legacy collector. Independent of CN1_NURSERY (kept off).

Validated (macOS arm64): full parparvm-bench suite bit-identical to HotSpot with
BiBOP ON, -DCN1_DISABLE_BIBOP, and across 1/4/8 mark workers and forced worklist
overflow (-DCN1_GC_MARK_WORKLIST_SIZE=256, exercising the page rescan). TSan: zero
races on any BiBOP state (pages/pools/free-lists/registry/cursor) -- 111 reports
vs the legacy baseline's 119, all the pre-existing collector-vs-mutator object-
header family. GC-stress + 4-thread allocate-during-GC stress: checksums identical
across runs and to legacy/HotSpot (a single lost live object would diverge). RSS
24-26% LOWER than legacy and bounded over 2000 rounds (pages recycled, no drift).

Measured as a whole vs warmed Java 25 (+AOT cache), min-of-reps:
objectAllocation 278->144ms (1.93x; 15.1x->7.8x vs Java25), stringBuilding 1.14x,
hashMapChurn 1.05x, compute/arrays unchanged. Whole-suite geomean vs Java25
2.26x -> 2.08x, zero regression.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…x->1.6x HotSpot)

The per-call frame bookkeeping -- DEFINE_METHOD_STACK's per-call memset of the
locals+operand-stack elementStruct region, the callStackOffset bump/check, the
releaseForReturn offset restore, and the per-line __CN1_DEBUG_INFO stores -- is
pure overhead. A method that holds ZERO object references in its frame contributes
no GC roots, so the precise collector has nothing to scan there and the frame can
be eliminated outright. No GC change, no operand-stack rewrite (an SSA-temp
rewrite was measured NOT to help and was skipped); instruction bodies stay
byte-identical, so this is bit-identical by construction.

- isFramelessEligible() (BytecodeMethod): conservative whitelist on raw bytecode --
  static, primitive-or-void return, no object args/locals, no object operand-stack
  value, no try/catch, not synchronized/native/on-device-debug, and every opcode in
  the handled primitive set (loads/stores/consts/arithmetic incl. throwing div-rem/
  shifts/bitwise/conversions/compares/branches/switch/dup-pop-swap/returns +
  INVOKESTATIC with a purely primitive/void descriptor). Anything else -> ineligible
  -> byte-identical legacy codegen.
- DEFINE_METHOD_STACK_FRAMELESS (cn1_globals.h): the operand stack is a method-local
  C array (not a threadObjectStack slice) -- no per-call memset, no offset
  bookkeeping, no callStack push; emits CN1_FRAMELESS_SOE_GUARD.
- CN1_FRAMELESS_SOE_GUARD: frameless methods don't bump callStackOffset, so deep
  non-tail recursion is guarded by comparing __builtin_frame_address(0) to a lazily
  cached per-thread nativeStackLimit (pthread_get_stackaddr_np - stacksize + 256KB
  band; 8MB frame-anchored fallback) -- throws StackOverflowError instead of SIGBUS.
  __builtin_expect hints are load-bearing (177->147ms without/with).
- Return sites (BasicInstruction x5 + optimize()'s two return fast-paths) emit plain
  return with no releaseForReturn; LineNumber suppresses __CN1_DEBUG_INFO for
  frameless methods (no callStackOffset to index). Gate: -Dcn1.frameless (default ON);
  OFF emits byte-identical-to-HEAD code.

Validated: full Bench suite bit-identical to HotSpot frameless ON and OFF; OFF
byte-identical generated C to HEAD; 11 methods frameless in the suite. Deep
non-tail recursion throws StackOverflowError, not SIGSEGV. Measured vs warmed
Java 25+AOTcache: recursion 436->150ms = 2.92x faster (ON vs OFF), 4.64x -> 1.59x
HotSpot; every other benchmark within noise (no regression).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… methods (opt-in)

Phase 3b of the conservative-collector endgame: extends frameless codegen from
primitive-only methods (committed 0260fe8) to OBJECT-BEARING methods, with the
conservative native-stack scan as a real GC root source. A frameless object method
keeps its object refs in native C locals / a method-local operand-stack array
(no DEFINE_METHOD_STACK frame, no threadObjectStack, no per-call memset); the GC
finds those roots by conservatively scanning the thread's native C stack. Enabled
by the non-moving BiBOP heap (conservative scanning requires non-moving). Gated:
#ifdef CN1_CONSERVATIVE_GC_ROOTS (the runtime) + -Dcn1.frameless.objects (the
codegen); DEFAULT OFF -- the default build is byte-identical to HEAD (precise GC +
primitive-only frameless). The proven path (P1 resolver / P2 native-stack scan /
P3a zero-miss root-placement) is now production, not validation.

- cn1ConservativeResolve(word)->object base|NULL: BiBOP page-aligned candidate +
  all-pages-registry binary search + interior pointers + large/array extents; marks
  for real (cn1ConservativeMarkRange).
- HYBRID GC: codenameOneGCMark keeps the precise threadObjectStack scan for legacy
  frames AND conservatively scans each stopped thread's native stack [sp,base) +
  register snapshot for frameless frames; explicit roots (currentThreadObject,
  statics, constant pool, pending native allocations) retained. The conservative
  scan covers the whole native stack, so the legacy<->frameless caller/callee
  boundary is never a gap.
- Universal thread-stopping: cooperative (CN1_GC_PARK_CAPTURE setjmp + SP at every
  safepoint, proven) for lightweight threads; signal-based (SIGUSR2 + ucontext SP/reg
  capture) for genuine native threads, opt-in (CN1_GC_SIGNAL_STOP).
- Object-frameless eligibility extends the whitelist to ALOAD/ASTORE, GETFIELD/
  PUTFIELD/GET-PUTSTATIC, NEW/ANEWARRAY/CHECKCAST/INSTANCEOF, array ops, all invokes
  (args as explicit C params), ACONST_NULL/IF_ACMP*/IFNULL, String/Class LDC.
  Excluded: try/catch, ATHROW, MONITOR*, MULTIANEWARRAY -> stay legacy. Instruction
  bodies byte-identical (win is frame elimination, not re-lowering).

Validated (CN1_CONSERVATIVE_GC_ROOTS + -Dcn1.frameless.objects): full Bench suite
bit-identical to HotSpot (72 frameless methods: 12 primitive + 60 static object);
default (gates off) byte-identical to HEAD; GcStress 25x and 4-thread MtStress 30x
== HotSpot with bounded RSS (no leak); the transient ⊇ self-check (CN1_CONSERVATIVE_
GC_SELFCHECK) reports MISS=0 (every precise root also resolved conservatively).
GcStress 5x re-confirmed == HotSpot here.

HONEST STATUS:
- PERF-NEUTRAL today: the frame-elimination win is offset by an UNOPTIMIZED
  conservative scan (the heap-membership snapshot is rebuilt O(heap) per-thread-per-
  GC). The once-per-GC optimization (born-marked new BiBOP objects) is the next step
  to make object-frameless a net win on GC-heavy code; recursion's win is preserved
  (no GC in the loop). That's why this ships OPT-IN, default off.
- INSTANCE-method frameless (-Dcn1.frameless.instance) and the SIGNAL-stop path have
  intermittent multi-thread races (DONE 0 / ~8-10%) NOT root-caused -> gated OFF.
  The static + cooperative path (what's validated above) is solid (30/30 MT).
- Conservative GC is incompatible with CN1_NURSERY (deprecated); frameless methods
  don't appear in callStack-based stack traces (printStackTrace doesn't crash).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hread)

java.lang.Thread.alive was set to true inside java_lang_Thread_runImpl, which runs
on the WORKER thread asynchronously after start() returns. java_lang_Thread_start__
only did pthread_create. So a thread doing start() then join() could race: join()
-> isAlive() reads false (worker not yet scheduled) and returns IMMEDIATELY, before
any of the worker's writes were published -- e.g. main summing a worker-filled
results[] array could read it still zero. Classic "started-state not set
synchronously by the starting thread" bug; present on every port, ~15% repro in a
4-thread join-then-read stress (vs HotSpot fully deterministic).

Fix: set the alive flag synchronously on the CALLING thread, in program order before
the worker is spawned, in java_lang_Thread_start__. A later join() then correctly
blocks until the worker clears alive under the monitor (runImpl:
synchronized{ alive=false; notifyAll(); }), and that monitor release/acquire is the
happens-before edge that publishes the worker's writes. Purely additive
synchronization; bit-identical to HotSpot on the full Bench suite. MtStress
3/20-failing -> 50/50 deterministic == HotSpot after the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rameless

Flip the Phase-3b gates to default ON (arm64-validated -- the dev machine is Apple
Silicon arm64, same arch as the iOS device target; CI validates the other ABIs):
- cn1_globals.h: #define CN1_CONSERVATIVE_GC_ROOTS by default (disable with
  -DCN1_DISABLE_CONSERVATIVE_GC_ROOTS).
- BytecodeMethod: cn1.frameless.objects + cn1.frameless.instance default true.

The instance-frameless multi-thread failure that previously gated it was the
pre-existing Thread.start/join visibility race, fixed in 9933311. Default build
now: 302 frameless methods (was 12 primitive-only), bit-identical to HotSpot, no
per-call frame on object/instance methods, roots found by the conservative
native-stack scan. Validated: full Bench suite bit-identical; GcStress 5x ==
HotSpot, no crash/leak. Cooperative thread-stop covers Java threads (what the bench
exercises); native-thread coverage via the signal path (CN1_GC_SIGNAL_STOP) stays
the edge for CI/on-device.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…LINE_ALLOC)

The build ships no LTO, so __NEW_<X> and codenameOneGcMalloc live in separate
translation units and clang cannot inline them: every escaping new-site pays two
real cross-TU calls (confirmed in asm). CN1_FAST_NEW(X) inlines the BiBOP
per-thread bump common case at the allocation site (pointer-bump + header stamp,
size-class index folded to a compile-time literal via CN1_BIBOP_CIDX), falling
back to __NEW_<X> only on page-full / free-list / oversized / ineligible. The
bump replicates cn1BibopAlloc bit-for-bit (relaxed bumpIndex load, mark released
last, cursor release-stored after slot init) so the concurrent-GC correctness
argument is unchanged. bibopCurrent[]/bibopBytesSinceGc + struct CN1BibopPage
are lifted to the header for the inline; the .m keeps a _Static_assert that the
size-class array still matches.

Gated -DCN1_INLINE_ALLOC, default OFF (pending iOS on-device validation of the
statement-expression macro, as with the conservative GC). With the flag off
CN1_FAST_NEW(X) expands verbatim to __NEW_<X>, so the default build is byte-
identical.

Validated (arm64 macOS): full Bench bit-identical to HotSpot both OFF and ON;
GcStress 20/20 and MtStress 10/10 (4-thread alloc-during-GC) == HotSpot, no
leak. Measured ON vs OFF: objectAllocation 107.9->79.0ms (-27%, 5.4x->3.94x vs
warmed Java25), stringBuilding 61.2->51.5ms (-16%); compute/arrays within +/-1%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…alloc fast-path tier 2)

Stacks on the inlined BiBOP bump (CN1_INLINE_ALLOC) to close more of the
escaping-allocation gap. Two independently-gated levers:

Lever B (-DCN1_INLINE_CTOR): after CN1_FAST_NEW allocates, the constructor was
still a separate out-of-line cross-TU call. InlinableConstructor analyses a
constructor for an inlinable shape (only this/param field stores + a chained-
inlinable super ctor, bounded instruction count, no INVOKE except that super,
no alloc/throw/branch/loop/try) and the new-site emits the field stores inline
instead of the call. Emitted as an `#ifdef CN1_INLINE_CTOR` in the generated C
(both branches present), so with the flag off the original call compiles and the
build is byte-identical. Constructor args are consumed from the operand stack;
the object is already GC-reachable and its ref fields were zeroed by the bump,
so the inline stores need no extra barrier (this VM has none).

Lever A (-DCN1_DEATOMIC_BYTES): the per-allocation `atomic_fetch_add` on the
global bibopBytesSinceGc becomes a plain per-thread accumulator
(ThreadLocalData.bibopBytesLocal) flushed in bulk at page-acquire and thread
death. bibopBytesSinceGc feeds only the GC-trigger heuristic (no liveness role)
and is already raced today, so deferring it only shifts the trigger cadence by
< nthreads*page, negligible vs the 24MB trigger. The bump cursor and mark
publication ordering -- the GC-visible fields -- are UNCHANGED.

Both default OFF, alongside CN1_INLINE_ALLOC, pending iOS on-device validation.

Validated (arm64 macOS): full Bench bit-identical to HotSpot for every flag
combination (off / L1 / +A / +B / +A+B); GcStress 10/10 and MtStress 10/10
(4-thread alloc-during-GC) == HotSpot on the +A+B config, no leak. Interleaved
(thermal-drift-cancelling) objectAllocation: off 171.9 -> L1 126.9 -> +B 80.1
-> +A+B 71.4 ms (2.4x speedup; each lever stacks). hashMapChurn flat (its cost
is hashing/clear, not allocation) and stringBuilding modest (char[] arrays use
the legacy path). Net: objectAllocation ~5.7x -> ~2.7x warmed Java25; compute/
arrays unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lightweight pending array)

cn1GcBuildRootSnapshots() reads every thread's pendingHeapAllocations array to
add not-yet-migrated objects to the conservative-resolve extent table. It runs
on the GC thread before the thread being scanned is parked, so threads other
than the current one are still RUNNING. A lightweight thread grows its pending
array lock-free in codenameOneGcMalloc / cn1AddPending (malloc tmp; memcpy;
free(old); pending = tmp) -- the pre-existing guard took threadHeapMutex only
for non-lightweight (native) threads. So the GC could read
pendingHeapAllocations[j] exactly as free() reclaimed the array: the garbage
word is taken as a heap-extent base and cn1ConservativeResolve returns it
unvalidated -> SIGBUS in gcMarkObject. Rare (~1% under timing perturbation) but
real, and it reaches default builds (CN1_CONSERVATIVE_GC_ROOTS is default-on).

Fix: serialize the grow-and-free against the snapshot read. The two realloc
fast paths now take threadHeapMutex unconditionally (lightweight included, like
the native path already did), and cn1GcBuildRootSnapshots takes the SAME mutex
around its pending-read loop. The lock is acquired and released entirely within
the read, before the caller signal-stops any thread, so no thread is ever frozen
mid-realloc holding it (no deadlock); ordering vs lockCriticalSection is never
inverted (the migration path takes criticalSection THEN threadHeapMutex; this
path takes only threadHeapMutex). This mirrors the existing pending-migration
code (715-740), which already reads pending under threadHeapMutex for native
threads / while lightweight threads are parked. The per-element store stays
lock-free -- that read is benign (an aligned 8-byte slot holds 0 or a complete
valid pointer; no free involved).

Validated (arm64 macOS): ThreadSanitizer on HEAD deterministically reports the
race (cn1GcBuildRootSnapshots reading pending vs codenameOneGcMalloc). With the
fix: full Bench bit-identical to HotSpot (default and -DCN1_INLINE_ALLOC
-DCN1_INLINE_CTOR -DCN1_DEATOMIC_BYTES); MtStress (4-thread alloc-during-GC) 300/300
clean -- 0 crash, 0 deadlock, all checksums == HotSpot -- at a deliberately
widened race window (PER_THREAD_ALLOCATION_COUNT temporarily 16); GcStress 20/20
== HotSpot; no perf regression (objectAllocation/stringBuilding/intArithmetic
within +/-1%). Residual conservative-collector non-STW reads are pre-existing and
by design.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…T iOS VM)

The inline BiBOP bump (CN1_INLINE_ALLOC), inline leaf constructors
(CN1_INLINE_CTOR) and de-atomic per-thread byte accounting (CN1_DEATOMIC_BYTES)
were committed behind opt-in -D flags. For an AOT VM whose sole shipping target
is iOS, an off-by-default flag is dead code that never runs in production, and CI
already exercises every ABI. Flip all three to default-on with a
-DCN1_DISABLE_* escape hatch (kept only so CI can A/B and so a platform can opt
out if a real problem surfaces).

Validated (arm64 macOS): the DEFAULT build (no flags) is now bit-identical to
HotSpot across the full Bench suite, GcStress 15/15 and MtStress 15/15 (4-thread
alloc-during-GC) == HotSpot. Perf is the previously-measured strongest config:
objectAllocation ~2.7x warmed Java25 (was 5.7x), compute/arrays at parity.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… arena

Two GC-memory changes, both bit-identical to HotSpot, found by profiling
allocation-churn benchmarks (objectAllocation etc.) which were spending their
time in the allocator/collector rather than the mutator.

1. Adaptive allocation pacing. System.gc() used to Thread.sleep(2) on every
   trigger; an allocate-and-drop workload triggers GC every CN1_BIBOP_GC_TRIGGER
   bytes, so that fixed sleep was pure mutator stall (and, crucially, it did NOT
   bound memory -- RSS ballooned to 2.35-7GB run-to-run as the mutator outran the
   collector). Replace it with proportional backpressure in cn1BibopMaybeGc: the
   mutator only waits when uncollected BiBOP volume since the last GC exceeds a
   hard cap (3x the trigger), and waits as a GC SAFEPOINT (threadActive=FALSE so
   the collector can scan/advance past it -- a naive spin livelocks the collector,
   which showed up as an MtStress hang). When the collector keeps up the cap is
   never hit and this never waits. Counter-intuitively the tight cap is also the
   FAST configuration: a small heap keeps the non-generational O(pages) sweep
   cheap, so the collector keeps up and the mutator barely waits; a loose cap lets
   the heap grow and the sweep (hence everything) crawls. Disable: -DCN1_BIBOP_NO_PACING.

2. Batched page arena. cn1BibopNewPage did one posix_memalign(64KB) per page;
   when churn drains the free pool faster than the sweep refills it, every page
   was a separate mach_vm_map kernel trap (profiled ~17% of objectAllocation,
   now 0 in the sample). Carve 64KB pages from a 64KB-aligned multi-page arena
   (one mmap per CN1_BIBOP_ARENA_PAGES=64); pages stay 64KB-aligned, the arena is
   lazily faulted (RSS tracks touched pages), and BiBOP never free()s a page so
   interior pointers are safe. Disable: -DCN1_BIBOP_NO_ARENA.

Result on objectAllocation churn: peak RSS 2.35GB+ (unbounded) -> 275MB (bounded,
~9x), at neutral-to-faster perf (clean idle wall-time equal-or-better; pacing
only engages under allocation pressure, so compute/array benchmarks are
unaffected -- bit-identical). This bounds what was effectively an unbounded-RSS
OOM risk on device. It does NOT close the throughput gap to HotSpot on churn --
that is the non-generational O(pages) sweep vs HotSpot's O(survivors) young gen,
a separate follow-up (O(1) all-dead-page reclaim).

Validated (arm64 macOS): full Bench bit-identical to HotSpot; GcStress 20/20;
MtStress (4-thread alloc-during-GC) 12/12, no hang; RSS bounded over sustained
churn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…us pages

The non-generational sweep walked every slot of every retired page (millions of
reads per cycle under allocation churn), so the collector couldn't keep up and
the adaptive pacing throttled the mutator -- objectAllocation was sweep-bound.
Make the sweep skip the per-slot walk for pages whose fate is provable in O(1):

A retired page is "homogeneous" -- safe to decide without walking -- iff
  !gcAllocedSinceSweep  (no fresh mark==-1 grace-candidate slots since last sweep)
  && gcLastMarkedEpoch != V  (nothing on it was marked THIS cycle; a reachable
                              object is always marked, so every occupant is garbage
                              aging through grace)
  && !gcNeedsReclaim     (no survivor class carries a real finalizer)
  && cn1BibopLiveMonitors == 0  (no BiBOP monitor data to free)
For a homogeneous page, gcGraceEpoch (set at each full walk = upper bound on every
survivor's epoch) decides the whole page:
  gcGraceEpoch <  V-1  -> ALL DEAD  -> O(1) reclaim (reset bumpIndex/freeList, to
                                       freePool; byte-identical to the walk's
                                       liveCount==0 outcome, without touching slots)
  gcGraceEpoch >= V-1  -> ALL LIVE (still in grace) -> O(1) skip (route as the walk
                                       would, gcGraceEpoch unchanged so it ages out)
Otherwise the existing full walk runs (and refreshes the per-page facts). New
per-page fields live in struct CN1BibopPage (always present so A/B layouts match);
set on alloc (the bump + free-list paths) and in gcMarkObject (a relaxed,
idempotent epoch stamp -- the marker is parallel). Monitors use a global seq_cst
live-count rather than a per-page flag to avoid cross-thread visibility races.
Gate: -DCN1_BIBOP_NO_FASTSWEEP.

Enabler (required): every class was emitting a non-null finalizerFunction that
just chained to the empty Object finalizer, so a "has finalizer" predicate was
always true and the O(1) path never fired. ByteCodeClass now emits
finalizerFunction = 0 unless a real finalize() exists in the hierarchy (the
__FINALIZER_<class> chain is still emitted, so subclass chaining is intact; both
readers -- freeAndFinalize and cn1BibopReclaimSlot -- already guard ptr != 0).
Behavior-preserving (conservative on unresolved bases) and it also drops millions
of no-op indirect finalizer calls from the existing full-walk path.

Result (arm64 macOS, idle, default-on): 63% of retired pages take the O(1) path;
objectAllocation 75.4 -> 46.5ms (1.62x; ~40% of the gap to warmed Java25 closed),
and on an isolated 20M-Node churn ~1.8x faster at equal-or-lower BOUNDED RSS
(~235MB) -- the pacing throttles far less now that the sweep keeps up. No
regression on compute/array benches.

Validated: full Bench bit-identical to HotSpot (FASTSWEEP on and off); GcStress
(85 runs across dev + here) and MtStress (40 runs, 4-thread alloc-during-GC) with
ZERO checksum divergence -- bit-identical is the oracle that the grace semantics
are preserved. (An intermittent ~4% GcStress segfault is a PRE-EXISTING
concurrent-GC race in the precise threadObjectStack scan -- present in the
pristine baseline at an equal-or-higher rate, an untouched code path -- to be
tracked separately.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nter coarsening)

Profiling the (now sweep-unbound) objectAllocation churn showed the per-object
inline path doing avoidable work. Two removals, both bit-identical:

- Drop the __ownerThread store. It is write-only dead state in the current tree
  (the size-class-index repurposing was an unmerged free-list patch); a full-tree
  scan finds no reader. Removed from both the inlined cn1BibopFastAlloc and the
  slow-path cn1BibopInitSlot. (Field kept for struct-layout stability.)

- Move allocationsSinceLastGC / totalAllocations off the per-object path. These
  feed only the isHighFrequencyGC heuristic (no correctness role) but were two
  GLOBAL stores per allocation -- an L1 store single-threaded, a bouncing cache
  line across threads. They are now bumped in bulk inside CN1_BIBOP_FLUSH_BYTES
  once per page-acquire (~64KB), which is accurate enough for a threshold
  heuristic. (Non-DEATOMIC build keeps the per-object update in ACCOUNT_BYTES.)

Note recorded in-code: the body memset is NOT removable -- skipping it is ~2x
SLOWER because uninitialized ref fields get scanned during the mark==-1 grace
window and retain floating garbage. It is load-bearing, not overhead.

Result: objectAllocation 46.2 -> 44.8ms (~3% single-threaded; larger under
multi-threaded allocation where the global-counter cache line stops bouncing);
now 2.29x warmed Java25. Validated bit-identical to HotSpot (full Bench),
GcStress (no checksum divergence) and MtStress 15/15.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fields off-object)

Profiling objectAllocation showed the per-allocation cost is store-bound, and
the object is fat: a 6-field, 48-byte header vs HotSpot's ~16 / 2 fields, so a
Node{int,ref} occupied a 64-byte BiBOP slot -- 2x the bytes to allocate, zero,
and stream through cache on every object. The header writes themselves are NOT
removable (each is GC state; skipping any retains floating garbage and runs 2-3x
SLOWER -- measured). So shrink by RELOCATING fields off the object, not skipping:

- DELETE __ownerThread -- write-only dead state (the size-class-index repurposing
  was an unmerged patch; no reader exists). 48 -> 40.
- __codenameOneThreadData (lazily-attached monitor, null on ~all objects) -> an
  address-keyed monitor side table (cn1MonitorDataGet/Set/Remove, one mutex,
  critical-section->table lock order). monitorEnter/Exit/wait/notify + reclaim/free
  use it; the alloc fast path drops the =0 store. 40 -> 24.
- __codenameOneReferenceCount -> a force-visited side set: its only behavioral use
  was the gcMarkObject force-recursion guard (==recursionKey), now
  cn1ForceVisitedTestAndSet; the 999999 "permanent" writes were vestigial (mark-
  sweep never reads them -- those objects stay live via root marking). The alloc
  fast path drops the =1 store. 24 -> 16.

Header is now {clazz*, gcMark, heapPosition} = 16 bytes (HotSpot-class). Node drops
64->32 byte class (half), HashMap.Entry 80->48.

Validated (arm64 macOS), every phase bit-identical to HotSpot on the full Bench;
GcStress + MtStress (4-thread alloc-during-GC) with ZERO checksum divergence across
150+ stress runs (the ~4% empty-output segfault is the pre-existing threadObjectStack
-scan race, same rate on clean HEAD). Perf (idle, interleaved): objectAllocation
0.80x (3.4x->3.0x warmed Java25), hashMapChurn 0.84x, stringBuilding faster-or-flat,
compute/array flat (relocation costs nothing off the alloc path). RSS is neutral on
average with higher variance (a smaller-slot pacing artifact, tunable separately).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rdening

MEMSET ELIMINATION (init-before-publish, no gate -- this is the pipeline):
For every NEW X; DUP; <args>; INVOKESPECIAL X.<init> site whose ctor is
inlinable (super()==Object, param/const stores only, no finalizer), the NEW
is deferred to a null placeholder and the <init> allocates WITHOUT the body
memset (cn1BibopFastAllocNoZero), stores every ctor-written field, explicitly
zeroes the unwritten ones, and only then publishes the object. Ctor args are
hoisted into C temps in ARGUMENT ORDER before the alloc, which also fixes two
latent bugs in the committed inline-ctor path: a folded call-expression arg
stored to two fields evaluated twice, and args evaluated in ctor-body store
order instead of Java's left-to-right. objectAllocation 1.70x warmed Java 25
(was 5.7x at branch start); all 10 Bench checksums bit-identical to HotSpot.

The elision is made sound against the conservative/signal-stop collector by
deferring parentCls publication: the header keeps parentCls==0 until every
field is written, so a signal-stopped thread's mid-construction object is
skipped by gcMarkObject's existing guard (grace keeps it alive); the sweep's
mark==-1 finalizer probe gets a matching NULL guard and finalizer-bearing
classes keep the memset path.

THREAD-STOP GC HARDENING (bugs found via GcStress under CN1_GC_SIGNAL_STOP=1
and an adversarial review of the branch's GC):
* VALIDATED precise scan: a signal-stopped thread can freeze between a push's
  type/data stores (plain stores clang may also reorder), so a type==OBJECT
  slot can hold a stale primitive -- observed as gcMarkObject(0x4e20) from a
  frozen PUSH_INT window. threadObjectStack words are now resolved against
  the page/extent snapshot exactly like conservative roots.
* Type-before-data ordering in the fused invoke-return emissions (the same
  torn-slot hazard at every call returning into a stale receiver slot).
* Generation-counted signal handshake: a timed-out stop PRE-RELEASES its
  generation and releases are monotonic, so an abandoned or descheduled
  handler can never strand spinning forever.
* gcParkCaptured is cleared for EVERY thread each cycle -- a native thread
  that parked once no longer satisfies useCoop with a stale SP forever
  (missed roots -> UAF).
* GC safepoint in cn1BibopMaybeGc (BiBOP-only allocators never reached the
  legacy park) and the pacing spin now honors threadBlockedByGC on wake so
  the cap can't resume a mutator mid-drain.
* Acquire ordering: conservative resolver's mark load (freelist-header reuse
  window), sweep's bumpIndex load (fresh-slot header visibility), and the
  snapshot builder reads bumpIndex before geometry (page-reformat TOCTOU).
* bibopBytesLocal / nativeAllocationMode initialized in ThreadLocalData
  (malloc'd, never zeroed -- garbage corrupted GC pacing / disabled the
  alloc fast path per-thread).

Validation: GcStress 25/25 cooperative + 25/25 forced-signal (was 20/25 and
14/15), MtStress 20/20 + 10/10 forced-signal, ctor-semantics torture test
(eval order, double-store, throwing args, default zeros, wide args, GC churn
in call-args) byte-identical to HotSpot, full Bench suite bit-identical, no
perf regression on any benchmark.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er-cycle root snapshot

The global legacy-heap table was grown by DYING threads (markDeadThread ->
collectThreadResources -> placeObjectInHeapCollection) while the GC thread
walks it lock-free (sweep, root-snapshot build, overflow rescan). One growth
concurrent with a sweep loses the sweep's slot-NULLs in the memcpy'd copy --
resurrecting freed pointers for the next cycle to dereference -- and two
growths during one hoisted-pointer walk free the array under the reader (the
old one-growth deferral could not cover that).

Fix: make the table strictly GC-thread-owned. A dying thread now only QUEUES
its ThreadLocalData (critical section already held by markDeadThread); the GC
drains the queue at mark start -- strictly before any table walk or possible
Thread-object finalization -- and performs the TLD free itself when the
finalizer ran while the TLD was still queued (gcReleaseRequested). Objects in
a queued TLD's pending list are invisible to the sweep, so the deferral can
never free them early; un-snapshotted for at most one cycle, they are covered
by the mark==-1 grace rule like every other post-snapshot allocation. With the
single-writer invariant the growth can free the replaced array immediately,
and getStack's one-shot immortal-string removal (the only non-GC-thread table
access) takes the critical section.

Also: build the conservative page/extent root snapshot ONCE PER MARK CYCLE
(epoch-guarded) instead of once per scanned thread -- the full-table walk +
qsort dominated the GC thread on array-heavy workloads (sampled: more time in
qsort/cn1ConsExtCmp than in marking) and stalled mutators parked behind
threadBlockedByGC. Post-snapshot allocations are mark==-1 fresh and survive
via grace whether or not they resolve, so the first build of a cycle is
complete for correctness. recursion 146->127ms; GC CPU burn on string/array
churn cut sharply.

Validation: new ThreadChurn stress (8 dying threads x 12 rounds x 3k pending
arrays + >30000 live arrays forcing table growth under concurrent GC) 15/15 +
8/8 forced-signal, checksum identical to HotSpot; GcStress 20/20+15/15 coop,
10/10+8/8 forced-signal; MtStress 10/10; full Bench suite bit-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
IOSNative's cached "..." String was pinned with the old idiom --
removeObjectFromHeapCollection + __codenameOneReferenceCount = 999999 --
which the VM's header shrink removed (__codenameOneReferenceCount was
relocated off-object) and the BiBOP sweep never honored anyway
(removeObjectFromHeapCollection is a no-op for page-resident objects).
Both RTL and LTR sites now use cn1AddImmortalRoot, the same migration
the getStack separator strings already received; the immortal-root scan
marks the String and (through it) its value array every cycle.

This was the last compile error in the iOS CI jobs (native-ios,
build-ios-watch et al on Xcode): "no member named
'__codenameOneReferenceCount' in 'struct JavaObjectPrototype'".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shai-almog

shai-almog commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Apple Watch (watchOS / Core Graphics)

Compared 216 screenshots: 213 matched, 3 updated.

  • ButtonTheme_dark — updated screenshot. Screenshot differs (416x496 px, bit depth 8).

    ButtonTheme_dark
    Preview info: JPEG preview quality 70; JPEG preview quality 70.
    Full-resolution PNG saved as ButtonTheme_dark.png in workflow artifacts.

  • ButtonTheme_light — updated screenshot. Screenshot differs (416x496 px, bit depth 8).

    ButtonTheme_light
    Preview info: JPEG preview quality 70; JPEG preview quality 70.
    Full-resolution PNG saved as ButtonTheme_light.png in workflow artifacts.

  • ToastBarTopPosition — updated screenshot. Screenshot differs (416x496 px, bit depth 8).

    ToastBarTopPosition
    Preview info: JPEG preview quality 70; JPEG preview quality 70.
    Full-resolution PNG saved as ToastBarTopPosition.png in workflow artifacts.

@shai-almog

shai-almog commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Compared 140 screenshots: 140 matched.
✅ Native Mac screenshot tests passed.

Benchmark Results

  • VM Translation Time: 0 seconds
  • Compilation Time: 173 seconds

Detailed Performance Metrics

Metric Duration
SIMD kernel backend SSE2 (x64) / NEON (arm64) native kernels
SIMD int-add (64K x300) java 53ms / native 2ms = 26.5x speedup
SIMD float-mul (64K x300) java 53ms / native 3ms = 17.6x speedup
SIMD kernel correctness PASS (native result == scalar reference)
Base64 payload size 8192 bytes
Base64 benchmark iterations 6000
Base64 SIMD byte path active (NEON-accelerated)
Base64 CN1 encode 155.000 ms
Base64 CN1 decode 117.000 ms
Base64 native encode 506.000 ms
Base64 encode ratio (CN1/native) 0.306x (69.4% faster)
Base64 native decode 214.000 ms
Base64 decode ratio (CN1/native) 0.547x (45.3% faster)
Base64 SIMD encode 50.000 ms
Base64 encode ratio (SIMD/CN1) 0.323x (67.7% faster)
Base64 SIMD decode 45.000 ms
Base64 decode ratio (SIMD/CN1) 0.385x (61.5% faster)
Base64 encode ratio (SIMD/native) 0.099x (90.1% faster)
Base64 decode ratio (SIMD/native) 0.210x (79.0% faster)
Image encode benchmark iterations 100
Image createMask (SIMD off) 8.000 ms
Image createMask (SIMD on) 2.000 ms
Image createMask ratio (SIMD on/off) 0.250x (75.0% faster)
Image applyMask (SIMD off) 69.000 ms
Image applyMask (SIMD on) 62.000 ms
Image applyMask ratio (SIMD on/off) 0.899x (10.1% faster)
Image modifyAlpha (SIMD off) 64.000 ms
Image modifyAlpha (SIMD on) 59.000 ms
Image modifyAlpha ratio (SIMD on/off) 0.922x (7.8% faster)
Image modifyAlpha removeColor (SIMD off) 64.000 ms
Image modifyAlpha removeColor (SIMD on) 56.000 ms
Image modifyAlpha removeColor ratio (SIMD on/off) 0.875x (12.5% faster)

@shai-almog

shai-almog commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Apple TV (tvOS / Metal)

Compared 138 screenshots: 110 matched, 28 missing actuals.

  • AppReviewDialog — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • DesktopMode — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • DialogTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • DialogTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • FloatingActionButtonTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • FloatingActionButtonTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • Gpu3DAnimation — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • Gpu3DCube — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • Gpu3DModel — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • Gpu3DTexturedCube — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • ListTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • ListTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • LottieAnimatedScreenshotTest — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • MultiButtonTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • PaletteOverrideTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • PaletteOverrideTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • RealOsmVector — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • SVGAnimatedScreenshotTest — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • SVGStatic — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • ShowcaseTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • ShowcaseTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • SpanLabelTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • SpanLabelTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • VectorMapDarkStyle — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • VectorMapMarkers — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • VectorMapShapes — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • css-gradients — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • landscape — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

@shai-almog

shai-almog commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Compared 140 screenshots: 140 matched.
✅ Native iOS Metal screenshot tests passed.

Benchmark Results

  • VM Translation Time: 0 seconds
  • Compilation Time: 268 seconds

Build and Run Timing

Metric Duration
Simulator Boot 61000 ms
Simulator Boot (Run) 0 ms
App Install 11000 ms
App Launch 3000 ms
Test Execution 298000 ms

Detailed Performance Metrics

Metric Duration
SIMD kernel backend SSE2 (x64) / NEON (arm64) native kernels
SIMD int-add (64K x300) java 55ms / native 3ms = 18.3x speedup
SIMD float-mul (64K x300) java 54ms / native 2ms = 27.0x speedup
SIMD kernel correctness PASS (native result == scalar reference)
Base64 payload size 8192 bytes
Base64 benchmark iterations 6000
Base64 SIMD byte path active (NEON-accelerated)
Base64 CN1 encode 146.000 ms
Base64 CN1 decode 115.000 ms
Base64 native encode 452.000 ms
Base64 encode ratio (CN1/native) 0.323x (67.7% faster)
Base64 native decode 203.000 ms
Base64 decode ratio (CN1/native) 0.567x (43.3% faster)
Base64 SIMD encode 48.000 ms
Base64 encode ratio (SIMD/CN1) 0.329x (67.1% faster)
Base64 SIMD decode 43.000 ms
Base64 decode ratio (SIMD/CN1) 0.374x (62.6% faster)
Base64 encode ratio (SIMD/native) 0.106x (89.4% faster)
Base64 decode ratio (SIMD/native) 0.212x (78.8% faster)
Image encode benchmark iterations 100
Image createMask (SIMD off) 6.000 ms
Image createMask (SIMD on) 1.000 ms
Image createMask ratio (SIMD on/off) 0.167x (83.3% faster)
Image applyMask (SIMD off) 38.000 ms
Image applyMask (SIMD on) 115.000 ms
Image applyMask ratio (SIMD on/off) 3.026x (202.6% slower)
Image modifyAlpha (SIMD off) 139.000 ms
Image modifyAlpha (SIMD on) 126.000 ms
Image modifyAlpha ratio (SIMD on/off) 0.906x (9.4% faster)
Image modifyAlpha removeColor (SIMD off) 70.000 ms
Image modifyAlpha removeColor (SIMD on) 63.000 ms
Image modifyAlpha removeColor ratio (SIMD on/off) 0.900x (10.0% faster)

@shai-almog

shai-almog commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

iOS screenshot updates

Compared 137 screenshots: 136 matched, 1 updated.

  • ToastBarTopPosition — updated screenshot. Screenshot differs (1179x2556 px, bit depth 8).

    ToastBarTopPosition
    Preview info: JPEG preview quality 70; JPEG preview quality 70; downscaled to 825x1789.
    Full-resolution PNG saved as ToastBarTopPosition.png in workflow artifacts.

Benchmark Results

  • VM Translation Time: 0 seconds
  • Compilation Time: 326 seconds

Build and Run Timing

Metric Duration
Simulator Boot 114000 ms
Simulator Boot (Run) 1000 ms
App Install 26000 ms
App Launch 2000 ms
Test Execution 516000 ms

Detailed Performance Metrics

Metric Duration
SIMD kernel backend SSE2 (x64) / NEON (arm64) native kernels
SIMD int-add (64K x300) java 65ms / native 5ms = 13.0x speedup
SIMD float-mul (64K x300) java 80ms / native 10ms = 8.0x speedup
SIMD kernel correctness PASS (native result == scalar reference)
Base64 payload size 8192 bytes
Base64 benchmark iterations 6000
Base64 SIMD byte path active (NEON-accelerated)
Base64 CN1 encode 286.000 ms
Base64 CN1 decode 184.000 ms
Base64 native encode 541.000 ms
Base64 encode ratio (CN1/native) 0.529x (47.1% faster)
Base64 native decode 276.000 ms
Base64 decode ratio (CN1/native) 0.667x (33.3% faster)
Base64 SIMD encode 54.000 ms
Base64 encode ratio (SIMD/CN1) 0.189x (81.1% faster)
Base64 SIMD decode 47.000 ms
Base64 decode ratio (SIMD/CN1) 0.255x (74.5% faster)
Base64 encode ratio (SIMD/native) 0.100x (90.0% faster)
Base64 decode ratio (SIMD/native) 0.170x (83.0% faster)
Image encode benchmark iterations 100
Image createMask (SIMD off) 8.000 ms
Image createMask (SIMD on) 1.000 ms
Image createMask ratio (SIMD on/off) 0.125x (87.5% faster)
Image applyMask (SIMD off) 51.000 ms
Image applyMask (SIMD on) 69.000 ms
Image applyMask ratio (SIMD on/off) 1.353x (35.3% slower)
Image modifyAlpha (SIMD off) 100.000 ms
Image modifyAlpha (SIMD on) 69.000 ms
Image modifyAlpha ratio (SIMD on/off) 0.690x (31.0% faster)
Image modifyAlpha removeColor (SIMD off) 67.000 ms
Image modifyAlpha removeColor (SIMD on) 49.000 ms
Image modifyAlpha removeColor ratio (SIMD on/off) 0.731x (26.9% faster)

shai-almog and others added 2 commits July 3, 2026 04:58
… a trace

Every stack-overflow guard threw a FRESH StackOverflowError -- but the
throw happens at stack exhaustion, and filling the new error's trace
(fillInStack -> getStack) allocates a StringBuilder and calls
getClass/append/..., each of which trips the same overflow guard and
throws again. The recursion consumed the remaining stack until the hard
guard page: a 511-frame throwException/fillInStack/getStack storm ending
in SIGSEGV, observed crashing the iOS UI-test app mid-suite (the
screenshots after the first deep-recursion test were all "missing").
The framed call-depth guards had the same recursion in bounded form.

Fix is the JVM-standard one: a PREALLOCATED shared StackOverflowError,
created at startup (initConstantPool, where stack is plentiful) with its
stack field PRE-FILLED -- fillInStack's null-check then skips trace
building entirely, so throwing it allocates nothing and calls nothing.
All six guard sites (frameless SOE guard, fast/inline/full framed init
depth+operand-stack checks, nativeMethods) now route through
cn1ThrowStackOverflow; a startup-only fallback builds a fresh error if
the guard fires before preallocation.

New SoeTest (permanent suite): three rounds of deep recursion, each SOE
caught, String.valueOf(e) usable, VM fully functional after recovery --
previously a hard SIGSEGV. Verified under clang AND gcc-16 -O3.
FusedTest/SbTorture/MapTorture/Bench/GcStress unchanged bit-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- DB_DUPLICATE_SWITCH_CLAUSES: merge the NEWARRAY and ALOAD/ASTORE
  clauses in isFramelessEligible -- they share the identical
  object-mode-only gate, so one clause group states that directly
  instead of duplicating the body.
- SIC_INNER_SHOULD_BE_STATIC_ANON: the scalar-replacement read
  expression captured only the lvalue string but as an anonymous inner
  class still pinned the enclosing BytecodeMethod; it is now the named
  static ScalarReplacedRead.

No behavior change; spotbugs:spotbugs reports zero bugs and the
translator tests pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

✅ Continuous Quality Report

Test & Coverage

Static Analysis

  • SpotBugs [Report archive]
    • ByteCodeTranslator: 0 findings (no issues)
    • android: 0 findings (no issues)
    • codenameone-maven-plugin: 0 findings (no issues)
    • core-unittests: 0 findings (no issues)
    • ios: 0 findings (no issues)
  • PMD: 0 findings (no issues) [Report archive]
  • Checkstyle: 0 findings (no issues) [Report archive]

Generated automatically by the PR CI workflow.

…ault-on;

remove public @StackAllocate; performance guide update

REVIEW FIX -- charAt bounded by count, not capacity: the cn1InlStrCharAt
intrinsic guarded against the backing array's length, so for a String
whose value array is longer than its logical count (aliasing/offset
constructors) charAt(length()) read past the logical end instead of
throwing. The fast path now checks java_lang_String_count; the
out-of-line native and the JS-port twin had the same pre-existing
laxness and now agree; StrCmp gained a regression section (built
charAt(length()) and charAt(-1) must throw with a 32-char builder
buffer behind a 3-char string) -- byte-identical to the host JVM.

BENCHMARK SUITE IN-REPO (vm/benchmarks): Bench.java + the full torture
set (MapTorture/SbTorture/StrCmp/FusedTest/IbpTest/ExcTest/ThreadChurn/
SoeTest/GcStress/MtStress) with repo-relative scripts:
translate-and-build.sh (translator + JavaAPI cached builds, mandatory
-fwrapv/-fno-strict-aliasing/-fno-builtin-fmod flags),
run-benchmark.sh (interleaved best-of-N vs a host JVM, refuses to print
ratios on checksum mismatch), run-gauntlet.sh (byte-identical tortures
+ GC stress in cooperative AND forced-signal modes), and a README with
instructions, workload descriptions and reference results. Both scripts
validated end-to-end here (gauntlet GREEN, Bench bit-identical).

TAGGED INTEGERS DEFAULT-ON: writing the benchmark scripts exposed that
-DCN1_TAGGED_INT was opt-in and NO shipping config set it -- deployed
apps never got it (hashMapChurn 2.8x untagged vs 0.97x tagged). Now
default-on for 64-bit-pointer targets, opt-out via
-DCN1_DISABLE_TAGGED_INT; the pointer-size gate still auto-disables it
on arm64_32 (Watch) and other 32-bit targets. The tagged-off shape was
re-validated bit-identical this session.

@StackAllocate REMOVED from the public API (CodenameOne/src): nothing
applies it, and its contract -- no instance of the class EVER escapes
its creating frame -- depends on every caller, which no reusable class
can promise. The translator machinery stays: it is the engine behind
the AUTOMATIC per-call-site StringBuilder stack allocation, which
proves escape per site instead of trusting an annotation. @fused stays
public: its contract (constructor-created arrays remain encapsulated)
is enforceable by the class author alone.

Developer guide (performance.asciidoc): new sections for @fused (with
the contract and an example) and for the automatic optimizations
(stack-allocated string building, tagged integers, closed-world
devirtualization, compact collections, bounds-check elimination), plus
a pointer to vm/benchmarks; the fast-stack section now mentions the
frameless form.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Developer Guide build artifacts are available for download from this workflow run:

Developer Guide quality checks:

  • AsciiDoc linter: No issues found (report)
  • Vale: No alerts found (report)
  • Paragraph capitalization: No paragraph capitalization issues (report)
  • LanguageTool: No grammar matches (report)
  • Image references: No unused images detected (report)

shai-almog and others added 2 commits July 3, 2026 07:00
Vale (Microsoft.Contractions x3) and LanguageTool (pointer-chase verb
agreement; 'devirtualization' added to the guide's accept list -- it is
the standard compiler term) flagged the new performance-annotations
prose; the developer-guide docs build treats these as build-breaking.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n stub bug

DateSpinner3D decides its column order by testing whether the first
character of L10NManager.formatDateLongStyle(date) is a LETTER
(month-day-year) or a digit (day-month-year):

    String firstChar = ...substring(0, 1);
    monthDayYear = !firstChar.toLowerCase().equals(firstChar.toUpperCase());

On the Linux/clean target, String.toLowerCase/toUpperCase were STUBS
returning `this` (fixed earlier on this branch with a real
towupper/towlower implementation) -- so lower.equals(upper) was always
true and the picker was forced to day-month-year regardless of locale.
The committed goldens captured that artifact. With working case
conversion, "July 3, 2026" correctly selects the US month-day-year
column order, and the five LightweightPicker/ValidatorLightweightPicker
screenshots (both arches) are refreshed from the CI captures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shai-almog

shai-almog commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Native Windows port (arm64)

Compared 138 screenshots: 97 matched, 5 updated, 36 missing actuals.

  • LightweightPickerButtons — updated screenshot. Screenshot differs (784x561 px, bit depth 8).

    LightweightPickerButtons
    Preview info: JPEG preview quality 70; JPEG preview quality 70.
    Full-resolution PNG saved as LightweightPickerButtons.png in workflow artifacts.

  • LightweightPickerButtons_above_center — updated screenshot. Screenshot differs (784x561 px, bit depth 8).

    LightweightPickerButtons_above_center
    Preview info: JPEG preview quality 70; JPEG preview quality 70.
    Full-resolution PNG saved as LightweightPickerButtons_above_center.png in workflow artifacts.

  • LightweightPickerButtons_below_right — updated screenshot. Screenshot differs (784x561 px, bit depth 8).

    LightweightPickerButtons_below_right
    Preview info: JPEG preview quality 70; JPEG preview quality 70.
    Full-resolution PNG saved as LightweightPickerButtons_below_right.png in workflow artifacts.

  • LightweightPickerButtons_between_mixed — updated screenshot. Screenshot differs (784x561 px, bit depth 8).

    LightweightPickerButtons_between_mixed
    Preview info: JPEG preview quality 70; JPEG preview quality 70.
    Full-resolution PNG saved as LightweightPickerButtons_between_mixed.png in workflow artifacts.

  • ValidatorLightweightPicker — updated screenshot. Screenshot differs (784x561 px, bit depth 8).

    ValidatorLightweightPicker
    Preview info: JPEG preview quality 70; JPEG preview quality 70.
    Full-resolution PNG saved as ValidatorLightweightPicker.png in workflow artifacts.

  • AppReviewDialog — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • DesktopMode — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • DialogTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • DialogTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • FloatingActionButtonTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • FloatingActionButtonTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • Gpu3DAnimation — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • Gpu3DCube — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • Gpu3DModel — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • Gpu3DTexturedCube — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • ListTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • ListTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • LottieAnimatedScreenshotTest — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • MultiButtonTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • MultiButtonTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • NativeMapFallback — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • PaletteOverrideTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • PaletteOverrideTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • PickerTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • RealOsmVector — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • SVGAnimatedScreenshotTest — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • SVGStatic — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • ShowcaseTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • ShowcaseTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • SpanLabelTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • SpanLabelTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • TabsTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • TabsTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • ToolbarTheme_dark — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • ToolbarTheme_light — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • VectorMapDarkStyle — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • VectorMapMarkers — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • VectorMapShapes — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • VideoIODecodedFrames — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • css-gradients — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

  • landscape — missing actual screenshot. Actual screenshot missing (test did not produce output).

    No preview available for this screenshot.

shai-almog and others added 9 commits July 3, 2026 09:42
The watch job's 3 screenshot diffs (ButtonTheme_dark/light,
ToastBarTopPosition) reproduce on CI but not locally (216-test local
run: the 3 CI failures pass; only locale-dependent picker/chart-time
diffs appear, from the en_IL host). CI gives no app-side visibility --
the launch discarded stdout/stderr -- so the failure mode (toast absent
at capture +2s, annotation callout falling back to the default font) is
unexplained. Wire simctl launch --stdout/--stderr and a log-stream
sidecar into the artifacts dir, dump the layered-pane tree at the
toast test's capture point, and log the annotation painter's resolved
font height so the next CI run answers what state the overlay is in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI hang forensics (build-ios wedged 22min inside RichTextAreaScreenshotTest,
process sampled at 13.3GB peak): the EDT sat in monitorEnter inside
__STATIC_INITIALIZER_com_codename1_io_BufferedOutputStream while logging a
throwable, with no live owner -- a thread whose <clinit> threw had unwound
via throwException without ever reaching the trailing monitorExit, leaking
the class monitor locked. Every later thread touching the class then blocks
forever, the GC's world-pause spins on the wedged mutator, and the whole
app freezes: exactly the intermittent mid-suite suite deaths seen across
build-ios / linux-gtk / screenshot-capture.

Emit the static initializer with monitorEnterBlock/monitorExitBlock (the
synchronized-method pattern) so throwException's unwind releases the class
monitor. New ClinitThrow reproducer deadlocks on the old emission and
completes with the fix; full gauntlet green (all tortures byte-identical,
GcStress/MtStress in both stop modes).

Also stream the app's full per-process console into the iOS test artifacts
(the CN1SS-filtered log hid the exception text that seeded this hang).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nostics

- cn1_globals.h: JAVA_ARRAY_BYTE must be 'signed char' -- bare char is
  unsigned in the aarch64/arm Linux ABI (signed on x86 and all Apple
  targets, which is why iOS never saw it). On the Linux arm64 leg every
  negative-byte round-trip broke: SimdApiTest's saturating byte add and
  SimdLargeAllocaTest's allocaByteFilled readback failed deterministically
  (reproduced + isolated in a local Docker arm64 rig; these API-test
  failures don't gate the screenshot job, so CI never surfaced them).
  Gauntlet green.

- ToastBarTopPosition: replace the fixed 2s wait with polling for the
  toast actually being visible (+2 settle ticks, 15s cap). The watch
  artifact's instrumented run shows the ToastBarComponent still
  visible=false height=0 at capture +3.5s: the EDT was inside the
  slideUp/slideDownAndWait nested loop and the UITimer fired from it,
  capturing mid-animation on slow runners (watch always, tvOS this
  round). The lingering animation also polluted the next test's
  glass-pane paint -- the ButtonTheme annotation-font diffs.

- Linux suite: tee the app's full stdout/stderr to CN1_APP_LOG_TEE and
  upload it (both glibc legs + musl); javac in CompilerHelper defaults
  to -encoding UTF-8 (C-locale containers read sources as US-ASCII);
  include the javac error log in the server-compile assert.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lldb on a live wedged suite (the RichTextArea freeze that killed
build-ios/metal/mac-native at exactly 78 screenshots) shows the EDT
self-deadlocked in monitorEnter: the EDT runs on the main pthread with an
explicitly-passed CodenameOneThread state (threadId 3), took the
BufferedOutputStream class-init monitor as owner 3 while logging its first
throwable, and the clinit body's generated static-field accessors then
re-entered the initializer via getThreadLocalData() -- which returns the
main pthread's own TLS struct (threadId 1). The ownership check compared
1 != 3, missed the reentrant case, and pthread_mutex_lock'd the mutex the
same pthread already held. The GC's world-pause then spun on the wedged
EDT and the whole app froze.

Mutual exclusion belongs to the execution thread: ownerThread now stores
CN1_MONITOR_SELF() (pthread_self(), GetCurrentThreadId() under the Windows
shim), so dual thread-states on one pthread cannot defeat the reentrancy
check. Latent on master too (same code); this branch's seed exception (an
EDT StackOverflowError in RichTextArea, now logged and survivable, fix
tracked separately) merely exposed it.

Full local iOS suite now runs to completion past the old wedge point;
gauntlet green (all tortures byte-identical, GcStress/MtStress both stop
modes).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
GetCurrentThreadId() was undeclared in the generated-C context (windows.h
is not included by cn1_globals.h); the shim's pthread_t already carries
GetCurrentThreadId() in .id.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The artifacts/linux-port/raw directory only exists after the capture
copies screenshots, so the tee's FileWriter threw at app start and the
log silently never materialized in CI artifacts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The whole JS suite was dying at boot (0 of 139 screenshots, mislabeled
flaky): UIManagerHolder's clinit threw
'cn1_java_lang_Integer_valueOfHeap_int_R_java_lang_Integer is not
defined'. The cull retention and RTA seeds kept the delegate twins alive
through every analysis pass -- but the bundle writer's identifier
minifier renamed their DEFINITIONS, because parparvm_runtime.js calls
them as bare identifiers and the exclusion set only collects
string-literal tokens plus native stubs. Record the delegate identifiers
at emission (mirroring NATIVE_METHOD_IDENTIFIERS) and exclude them.

Verified on a minimal Integer.valueOf app: both the canonical name and
its __impl body now ship unrenamed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The arm64/x64 glibc legs still intermittently freeze mid-print entering a
theme test (the tee now proves the app goes silent with no exception and
no GC diagnostics -- those are __OBJC__-only). Linux has no equivalent of
the macOS sample-based hang report that pinned the iOS EDT deadlock, so
capture one: when the app is alive but its output stalls 90s, dump every
thread's native stack into the uploaded artifact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shai-almog

shai-almog commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Compared 133 screenshots: 133 matched.
✅ JavaScript-port screenshot tests passed.

shai-almog and others added 6 commits July 3, 2026 17:54
The guard compared the current C frame address one-sidedly against the
per-thread-state stack limit. iOS natives dispatch_sync blocks onto the
main queue and call Java helpers (toNSString -> String.getBytes) with the
EDT's CAPTURED threadStateData, so main-thread frame addresses were
tested against the EDT's stack bounds -- when the main thread's stack
mapped below the EDT's limit, the first such call spuriously threw
StackOverflowError. That was the seed of the RichTextArea failure chain
(78-of-140 wedge on build-ios/metal/mac-native): the spurious SOE was
logged, logging entered BufferedOutputStream's class initializer, and the
two monitor bugs fixed previously turned that into a full freeze. Live
evidence: at the SOE breakpoint the entire process held 123 frames across
all threads (nothing deep), and both depth counters sat single-digit.

Make the trip test two-sided: only a frame address INSIDE the 256KB guard
band [limit - BAND, limit) throws. A foreign stack essentially never maps
into another stack's band, while genuine overflow must descend through
the band (no frameless frame approaches 256KB), so real detection is
preserved. Validated: full local iOS suite runs clean end-to-end with
zero StackOverflowError and RichTextArea/CodeEditor producing their
screenshots; gauntlet green (SoeTest still passes -- deep recursion on
the OWN stack still lands in the band and throws catchably).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tatus

The gdb-attach watchdog fired on output silence and found the pid already
gone: the app DIES mid-run (stdout cut mid-line, no exception, harness
then burns its stabilization window polling a dead process). Detect
process death in the harness immediately and print the exit status
(128+N = signal N), and post-mortem any core dump into the artifact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The core shows a parallel mark worker calling through a corrupt
markFunction (gcMarkWorkerDrainLoop popped a worklist entry whose
object header was destroyed between push and pop). Dump the drain
loop's locals, the popped batch, and the mark state alongside the
backtraces so the next occurrence identifies the victim object.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The poll settled after 2 visible ticks (~400ms), but ToastBar.show()
runs slideUpAndWait(2)+slideDownAndWait(800) -- the component reports
visible with full bounds while still animating into view, so tvOS
captured a half-slid/absent toast (ButtonTheme was fine; only the toast
frame raced). Require 1400ms of continuous visibility past the ~802ms
animation before capturing; the 15s cap still bounds a broken toast.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ruption)

Root cause of the Linux arm64 suite crash (random SIGSEGV in the theme
phase; three CI cores at three different wild PCs -- gcMarkWorkerDrainLoop
markFunction, cn1MakeFont, LinuxImplementation_exists -- classic
heap-corruption signature; x64 leg never crashed). The allocator
(cn1BibopInitSlot) writes parentClsReference/heapPosition and then
RELEASE-stores the mark word LAST: the mark word is the object's single
publication point. gcMarkObject's parallel-worker path loaded it RELAXED,
so on arm64's weak memory model a worker could observe the object without
observing the preceding parentClsReference store, then dereferenced a
stale/garbage parentClsReference->markFunction. x86 hid it (every x86
load is acquire); it is branch-only (parallel marking, aa2838e, is not
on master).

Acquire-load the mark word before reading any other header field, pairing
with the allocator's release store; reuse that snapshot as the claim's
'old'. Orders every parentClsReference read -- the guard, the CAS-success
deref, and (through the worklist mutex's release/acquire) the drain
worker's deref. Serial path unchanged. Gauntlet green on Apple-Silicon
arm64 (same weak-memory model, parallel path active): all tortures
byte-identical, GcStress/MtStress both stop modes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The acquire-load fix removed the parallel mark-WORKER crash (zero
gcMarkWorkerDrainLoop frames in the next arm64 core), but arm64 Linux
still corrupts the heap -- the crash moved to a frameless method reading
a smashed threadStateData -- so a second ordering hole remains in the
branch-only parallel-GC work. Force one marker (bypassing the whole
parallel path: gcMarkDrainParallel -> serial gcMarkDrain, no atomics, no
pool) as a git-A/B isolation step. Green arm64 => parallel marking is the
sole remaining corruptor and the audit continues offline behind
CN1_GC_MARK_THREADS>1; still-red => the bug is elsewhere in the branch GC
changes. The acquire fix stays in for when parallel marking is re-enabled.
Gauntlet green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant