perf: replace ArrayList consumer wheel with LongMap for O(1) keyed removal#3063
Open
He-Pin wants to merge 3 commits into
Open
perf: replace ArrayList consumer wheel with LongMap for O(1) keyed removal#3063He-Pin wants to merge 3 commits into
He-Pin wants to merge 3 commits into
Conversation
…moval Motivation: BroadcastHub's findAndRemoveConsumer used ArrayList.removeIf which is O(k) per event with lambda allocation on every call. In high-fan-out scenarios (thousands of consumers clustered in the same wheel slot), this creates a producer backpressure bottleneck: the head can only advance after the head slot is empty, and draining a large slot requires k linear scans each of O(k) cost. Modification: Replace Array[java.util.ArrayList[Consumer]] with Array[LongMap[Consumer]] keyed by Consumer.id. Slots are lazily allocated (null = empty) and released to null when drained, eliminating baseline memory for empty slots and enabling GC of drained LongMaps. Hot path uses getOrNull + -= (two primitive hash lookups) instead of remove (which would allocate Option), achieving zero heap allocation per add/remove cycle. No Long boxing since LongMap stores primitive long keys. Adds null guards in Advance/NeedWakeup event handlers to prevent latent NPE when findAndRemoveConsumer returns null. Updates onUpstreamFailure and wakeupIdx to skip null (empty) slots. Result: Consumer add/remove is O(1) with zero Long boxing and zero Option allocation. High-consumer lockstep scenarios see dramatically reduced producer backpressure from wheel slot contention. Memory for empty wheel slots drops from ~40 bytes per ArrayList to 0 (null). Tests: - sbt "stream-tests/Test/testOnly *HubSpec" → 50 passed, 0 failed - sbt "++3.3.8; stream/compile" → success - sbt "stream/mimaReportBinaryIssues" → no issues - sbt "bench-jmh/compile" → success References: Inspired by akkadotnet/akka.net#8264 (Dictionary-based consumer wheel). Pekko uses scala.collection.mutable.LongMap instead of HashMap for zero boxing on Long keys and contiguous open-addressing memory layout.
Adds BroadcastHubBenchRunner for direct measurement of consumer wheel throughput under high-fan-out scenarios, bypassing JMH infrastructure classpath issues in the bench-jmh module. Measures lockstep broadcast throughput at 4 consumer counts (64, 256, 1000, 2000) across 2 buffer sizes (64, 256) with 2 warmup + 3 measured runs per configuration. Results on Apple M-series (elements/sec, higher is better): Buffer=64 (128 wheel slots, max clustering): 64 consumers: 296,756 elem/s 256 consumers: 76,075 elem/s 1000 consumers: 19,737 elem/s 2000 consumers: 10,223 elem/s Buffer=256 (512 wheel slots, moderate clustering): 64 consumers: 1,148,340 elem/s 256 consumers: 271,505 elem/s 1000 consumers: 70,727 elem/s 2000 consumers: 33,717 elem/s Throughput degrades gracefully with consumer count, demonstrating the O(1) LongMap removal holds up under high per-slot contention. Tests: - sbt "bench-jmh/compile" → success - sbt "bench-jmh/runMain org.apache.pekko.stream.BroadcastHubBenchRunner" → completed References: Refs #3063
…sults Run headerCreateAll for the new benchmark runner file. Comparison benchmark results (old ArrayList vs new LongMap): Buffer=64 (128 wheel slots): Consumers ArrayList(elem/s) LongMap(elem/s) Speedup 64 305,657 296,756 0.97x 256 72,446 76,075 1.05x 1000 13,070 19,737 1.51x 2000 4,348 10,223 2.35x Buffer=256 (512 wheel slots): Consumers ArrayList(elem/s) LongMap(elem/s) Speedup 64 1,099,345 1,148,340 1.04x 256 197,676 271,505 1.37x 1000 27,804 70,727 2.54x 2000 7,943 33,717 4.24x The LongMap optimization provides 2.35x-4.24x speedup at 2000 consumers, with the gap widening as consumer count increases — confirming the O(k) linear scan was the dominant bottleneck. Tests: - sbt "bench-jmh/headerCreateAll" → header created - sbt "bench-jmh/compile" → success References: Refs #3063
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
BroadcastHub's
findAndRemoveConsumerusedArrayList.removeIfwhich is O(k) per event with lambda allocation on every call. In high-fan-out scenarios (thousands of consumers clustered in the same wheel slot), this creates a producer backpressure bottleneck: the head can only advance after the head slot is empty, and draining a large slot requires k linear scans each of O(k) cost.Inspired by akkadotnet/akka.net#8264 which replaced
ImmutableList<Consumer>[]withDictionary<long, Consumer>[].Modification
Replace
Array[java.util.ArrayList[Consumer]]withArray[LongMap[Consumer]]keyed byConsumer.id:LongMap.getOrNull+-=(two primitive hash lookups) replacesArrayList.removeIf(O(k) linear scan with lambda allocation)getOrNullreturns rawV(notOption[V]), avoidingSome/Noneallocation on the hot pathLongMapstores primitivelongkeys in contiguous open-addressing arrays, unlikeHashMap[Long, _]which boxes tojava.lang.Longnull(no backing map), reducing baseline memory from ~40 bytes ×bufferSize×2to 0LongMapreference is nulled for immediate GCAdvance/NeedWakeuphandlers now guard againstfindAndRemoveConsumerreturning null, preventing latent NPEonUpstreamFailureandwakeupIdxskip null (empty) slots during iterationResult
ArrayList)LongMap)No public API changes. Binary compatibility preserved.
Benchmark Results
Lockstep broadcast throughput (elements/sec, higher is better). 100K elements, 2 warmup + 3 measured runs, Apple M-series. Run with
BroadcastHubBenchRunner.Buffer=64 (128 wheel slots, maximum clustering):
Buffer=256 (512 wheel slots, moderate clustering):
Key observations:
Tests
sbt "stream-tests/Test/testOnly *HubSpec"→ 50 passed, 0 failed (includes 2 new high-consumer tests)sbt "++3.3.8; stream/compile"→ successsbt "stream/mimaReportBinaryIssues"→ no issuessbt "bench-jmh/compile"→ successsbt "bench-jmh/runMain org.apache.pekko.stream.BroadcastHubBenchRunner"→ completed (results above, both old and new code)sbt "bench-jmh/headerCreateAll"→ cleanscalafmt --mode diff-ref=origin/main→ cleangit diff --check→ no whitespace errorsReferences
Inspired by akkadotnet/akka.net#8264. Pekko uses
scala.collection.mutable.LongMapinstead ofDictionary/HashMapfor zero boxing onLongkeys and contiguous open-addressing memory layout.