Add: scope_stats collector for per-scope queue-fill peaks#858
Open
doraemonmj wants to merge 3 commits into
Open
Add: scope_stats collector for per-scope queue-fill peaks#858doraemonmj wants to merge 3 commits into
doraemonmj wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a decoupled scope-stats collector (PTO2_SCOPE_STATS) to report peak ring-buffer fill (heap bytes and tasks in-flight) and tensormap usage at each user scope's end. The feedback highlights critical thread-safety issues with global variables in multi-threaded orchestrator environments, recommending the use of thread_local storage and passing the orchestrator state pointer to the hooks for lazy binding. Additionally, defensive checks are recommended to prevent potential division-by-zero and null pointer dereferences.
7f7da20 to
b7005c5
Compare
added 2 commits
May 26, 2026 22:12
Introduce PTO2_SCOPE_STATS — a compile-time gated, runtime-cheap probe that captures the peak ring-buffer fill (heap bytes + tasks in-flight) for all 4 rings within each user scope, plus a tensormap usage snapshot, emitted as one [ScopeStats] line at scope_end. Independent of PTO2_PROFILING: answers a sizing question (did this scope come close to filling the ring?) rather than a timing question. - Add scope_stats_collector TU (a2a3 default ON, a5 default OFF) with begin/end/state_changed/bind probes; collector keeps per-(depth, ring) peak tables and emits a single log line at each user scope's end. - Probe PTO2TaskAllocator at the two heap-state write points (alloc commit, heap_tail recompute) via scope_stats_on_allocator_state_changed — heap usage and tasks in-flight move in lockstep, so one probe samples both. - Wire scope_stats_bind in PTO2OrchestratorState::init_from_layout and on_begin/on_end in begin_scope/end_scope (orchestrator layer, so the implicit aicpu_executor root scope still pairs; suppressed from logs). - Expose neutral const introspection (current_used/pool_capacity on PTO2TensorMap; friend-accessible heap_top_/heap_tail_/local_task_id_ on PTO2TaskAllocator) gated by PTO2_SCOPE_STATS so OFF mode adds nothing. - Document the new gate, log format, and orthogonality in docs/profiling_levels.md.
Move scope_stats from a pure compile-time gate (PTO2_SCOPE_STATS, default OFF, rebuild to flip) to the same shape as l2_swimlane/pmu/dep_gen: compiled in under the PTO2_PROFILING umbrella and toggled per-run by `--enable-scope-stats` (default OFF). Probes early-return on a relaxed bool load when the flag is off. - Add CallConfig::enable_scope_stats; bump wire size to 7*i32+1024; thread through nanobind, mailbox _CFG_FMT, run_prepared (host/sim x a2a3/a5), DeviceRunner setter + bitmask, and AICPU kernel.cpp dispatch. - Add PROFILING_FLAG_SCOPE_STATS bit; widen kernel_args comment. - Drop PTO2_SCOPE_STATS define and rewrite all #if sites to PTO2_PROFILING. - Switch the [ScopeStats] emit from a raw unified_log_info_v call to LOG_INFO_V5 (adds the file:line prefix; emit semantics unchanged). - Wire --enable-scope-stats through conftest.py + scene_test.py (pytest + standalone + child-cmd) without folding into diagnostics_any() — scope_stats has no output_prefix dependency. - Update profiling_levels.md to describe the runtime-flag form. Known limitation: workloads with hot-loop scopes (e.g. paged_attention) can hit a flow-control deadlock when --enable-scope-stats is on, because the per-scope V5 emit IO slows the orchestrator below scheduler drain rate. Next commit will replace per-scope emit with an aggregated end-of-run summary so the hot path has no log IO.
3f3d0bb to
e72ddd5
Compare
Per-scope_end LOG_INFO_V5 in scope_stats_on_end starved the orchestrator (8469 calls @ ~12us each in paged_attention), pushing ring 2 task window past its spin cap and tripping PTO2_ERROR_FLOW_CONTROL_DEADLOCK whenever --enable-scope-stats was on. - Device side: collector now appends ScopeStatsRecord into a host-allocated ScopeStatsBuffer (shared header + ring of records); on_end / on_fatal do a struct store + counter bump, zero IO - Host side: ScopeStatsHostBuffer (header-only) owns the buffer and dumps <output_prefix>/scope_stats.json next to the other DFX outputs (l2_perf_records.json / deps.json), with used/cap formatted as strings to match the prior log shape - Transport: a2a3 maps the region host-visible via halHostRegister; a5 keeps a host shadow and refreshes it via rtMemcpy DEVICE_TO_HOST at dump time. set_platform_scope_stats_base mirrors the existing set_platform_*_base hooks; kernel_args carries the device pointer - Wiring: DeviceRunner gains init_scope_stats() / dump call after stream sync / finalize, mirroring init_dep_gen; CallConfig and scene_test pull --enable-scope-stats into diagnostics_any/on so the output_prefix contract applies
e72ddd5 to
3a3e4e8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a scope_stats diagnostic that records per-scope peak occupancy of ring buffer / heap / tensormap / fanin / dep_pool, and dumps <output_prefix>/scope_stats.json at the end of each run. Aimed at capacity-class failures (PTO2_ERROR_FLOW_CONTROL_DEADLOCK, RING_BUFFER_BACKPRESSURE) — tells the user which scope, on which ring, exhausted which resource.
Details
Collector (device side). New scope_stats_collector.cpp under src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/. Probes wired into the orchestrator track a PTO2_MAX_SCOPE_DEPTH × PTO2_MAX_RING_DEPTH peak matrix; per-scope source site flows from PTO2ScopeGuard's ctor via scope_stats_set_pending_site. Compile-time gated by PTO2_PROFILING.
Runtime flag. New --enable-scope-stats CLI / CallConfig::enable_scope_stats / PROFILING_FLAG_SCOPE_STATS bit. AICPU entry calls set_scope_stats_enabled(...); when off, every probe collapses to one relaxed bool load.
Emission off the hot path. Earlier draft called LOG_INFO_V5 synchronously at every scope exit (8469 calls per paged_attention run, ~12 µs apart), which starved the orchestrator and tripped PTO2_ERROR_FLOW_CONTROL_DEADLOCK. The probe now just stores a record into a host-allocated ScopeStatsBuffer (header + ring of records) — zero vsnprintf, zero IO. scope_stats_on_fatal only latches fatal_latched.
Output. ScopeStatsHostBuffer (header-only) dumps <output_prefix>/scope_stats.json next to l2_perf_records.json / deps.json, keeping the used/cap string shape (e.g. "task_window": ["0/16384","5/16384",...]).
Cross-platform transport. a2a3 maps the buffer host-visible via halHostRegister; a5 keeps a host shadow and refreshes it via rtMemcpy DEVICE_TO_HOST at dump time; sim is single-address-space. Mode is selected by which callbacks the user passes to ScopeStatsHostBuffer::init(...).
Wiring. kernel_args.scope_stats_data_base + set_platform_scope_stats_base(...) mirror the existing set_platform_*_base hooks. DeviceRunner gains a private init_scope_stats() / dump / finalize in the same shape as init_dep_gen. CallConfig::diagnostics_any() and scene_test.py::diagnostics_on include enable_scope_stats so the output_prefix requirement applies.