Skip to content

Add: scope_stats collector for per-scope queue-fill peaks#858

Open
doraemonmj wants to merge 3 commits into
hw-native-sys:mainfrom
doraemonmj:scopememoryuse
Open

Add: scope_stats collector for per-scope queue-fill peaks#858
doraemonmj wants to merge 3 commits into
hw-native-sys:mainfrom
doraemonmj:scopememoryuse

Conversation

@doraemonmj
Copy link
Copy Markdown
Contributor

@doraemonmj doraemonmj commented May 26, 2026

Summary
Add a scope_stats diagnostic that records per-scope peak occupancy of ring buffer / heap / tensormap / fanin / dep_pool, and dumps <output_prefix>/scope_stats.json at the end of each run. Aimed at capacity-class failures (PTO2_ERROR_FLOW_CONTROL_DEADLOCK, RING_BUFFER_BACKPRESSURE) — tells the user which scope, on which ring, exhausted which resource.

Details

  • Collector (device side). New scope_stats_collector.cpp under src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/. Probes wired into the orchestrator track a PTO2_MAX_SCOPE_DEPTH × PTO2_MAX_RING_DEPTH peak matrix; per-scope source site flows from PTO2ScopeGuard's ctor via scope_stats_set_pending_site. Compile-time gated by PTO2_PROFILING.

  • Runtime flag. New --enable-scope-stats CLI / CallConfig::enable_scope_stats / PROFILING_FLAG_SCOPE_STATS bit. AICPU entry calls set_scope_stats_enabled(...); when off, every probe collapses to one relaxed bool load.

  • Emission off the hot path. Earlier draft called LOG_INFO_V5 synchronously at every scope exit (8469 calls per paged_attention run, ~12 µs apart), which starved the orchestrator and tripped PTO2_ERROR_FLOW_CONTROL_DEADLOCK. The probe now just stores a record into a host-allocated ScopeStatsBuffer (header + ring of records) — zero vsnprintf, zero IO. scope_stats_on_fatal only latches fatal_latched.

  • Output. ScopeStatsHostBuffer (header-only) dumps <output_prefix>/scope_stats.json next to l2_perf_records.json / deps.json, keeping the used/cap string shape (e.g. "task_window": ["0/16384","5/16384",...]).

  • Cross-platform transport. a2a3 maps the buffer host-visible via halHostRegister; a5 keeps a host shadow and refreshes it via rtMemcpy DEVICE_TO_HOST at dump time; sim is single-address-space. Mode is selected by which callbacks the user passes to ScopeStatsHostBuffer::init(...).

  • Wiring. kernel_args.scope_stats_data_base + set_platform_scope_stats_base(...) mirror the existing set_platform_*_base hooks. DeviceRunner gains a private init_scope_stats() / dump / finalize in the same shape as init_dep_gen. CallConfig::diagnostics_any() and scene_test.py::diagnostics_on include enable_scope_stats so the output_prefix requirement applies.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a decoupled scope-stats collector (PTO2_SCOPE_STATS) to report peak ring-buffer fill (heap bytes and tasks in-flight) and tensormap usage at each user scope's end. The feedback highlights critical thread-safety issues with global variables in multi-threaded orchestrator environments, recommending the use of thread_local storage and passing the orchestrator state pointer to the hooks for lazy binding. Additionally, defensive checks are recommended to prevent potential division-by-zero and null pointer dereferences.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scope_stats_collector.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scope_stats_collector.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scope_stats_collector.cpp Outdated
Comment thread src/a2a3/platform/include/aicpu/scope_stats_collector.h
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scope_stats_collector.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scope_stats_collector.cpp Outdated
Comment thread src/a5/platform/include/aicpu/scope_stats_collector.h
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
majin0824 added 2 commits May 26, 2026 22:12
Introduce PTO2_SCOPE_STATS — a compile-time gated, runtime-cheap probe
that captures the peak ring-buffer fill (heap bytes + tasks in-flight)
for all 4 rings within each user scope, plus a tensormap usage snapshot,
emitted as one [ScopeStats] line at scope_end. Independent of
PTO2_PROFILING: answers a sizing question (did this scope come close to
filling the ring?) rather than a timing question.

- Add scope_stats_collector TU (a2a3 default ON, a5 default OFF) with
  begin/end/state_changed/bind probes; collector keeps per-(depth, ring)
  peak tables and emits a single log line at each user scope's end.
- Probe PTO2TaskAllocator at the two heap-state write points (alloc
  commit, heap_tail recompute) via scope_stats_on_allocator_state_changed
  — heap usage and tasks in-flight move in lockstep, so one probe samples
  both.
- Wire scope_stats_bind in PTO2OrchestratorState::init_from_layout and
  on_begin/on_end in begin_scope/end_scope (orchestrator layer, so the
  implicit aicpu_executor root scope still pairs; suppressed from logs).
- Expose neutral const introspection (current_used/pool_capacity on
  PTO2TensorMap; friend-accessible heap_top_/heap_tail_/local_task_id_
  on PTO2TaskAllocator) gated by PTO2_SCOPE_STATS so OFF mode adds
  nothing.
- Document the new gate, log format, and orthogonality in
  docs/profiling_levels.md.
Move scope_stats from a pure compile-time gate (PTO2_SCOPE_STATS, default
OFF, rebuild to flip) to the same shape as l2_swimlane/pmu/dep_gen:
compiled in under the PTO2_PROFILING umbrella and toggled per-run by
`--enable-scope-stats` (default OFF). Probes early-return on a relaxed
bool load when the flag is off.

- Add CallConfig::enable_scope_stats; bump wire size to 7*i32+1024;
  thread through nanobind, mailbox _CFG_FMT, run_prepared (host/sim x
  a2a3/a5), DeviceRunner setter + bitmask, and AICPU kernel.cpp dispatch.
- Add PROFILING_FLAG_SCOPE_STATS bit; widen kernel_args comment.
- Drop PTO2_SCOPE_STATS define and rewrite all #if sites to PTO2_PROFILING.
- Switch the [ScopeStats] emit from a raw unified_log_info_v call to
  LOG_INFO_V5 (adds the file:line prefix; emit semantics unchanged).
- Wire --enable-scope-stats through conftest.py + scene_test.py
  (pytest + standalone + child-cmd) without folding into
  diagnostics_any() — scope_stats has no output_prefix dependency.
- Update profiling_levels.md to describe the runtime-flag form.

Known limitation: workloads with hot-loop scopes (e.g. paged_attention)
can hit a flow-control deadlock when --enable-scope-stats is on, because
the per-scope V5 emit IO slows the orchestrator below scheduler drain
rate. Next commit will replace per-scope emit with an aggregated
end-of-run summary so the hot path has no log IO.
@doraemonmj doraemonmj force-pushed the scopememoryuse branch 2 times, most recently from 3f3d0bb to e72ddd5 Compare May 26, 2026 14:57
Per-scope_end LOG_INFO_V5 in scope_stats_on_end starved the
orchestrator (8469 calls @ ~12us each in paged_attention), pushing
ring 2 task window past its spin cap and tripping
PTO2_ERROR_FLOW_CONTROL_DEADLOCK whenever --enable-scope-stats was on.

- Device side: collector now appends ScopeStatsRecord into a
  host-allocated ScopeStatsBuffer (shared header + ring of records);
  on_end / on_fatal do a struct store + counter bump, zero IO
- Host side: ScopeStatsHostBuffer (header-only) owns the buffer and
  dumps <output_prefix>/scope_stats.json next to the other DFX outputs
  (l2_perf_records.json / deps.json), with used/cap formatted as
  strings to match the prior log shape
- Transport: a2a3 maps the region host-visible via halHostRegister;
  a5 keeps a host shadow and refreshes it via rtMemcpy DEVICE_TO_HOST
  at dump time. set_platform_scope_stats_base mirrors the existing
  set_platform_*_base hooks; kernel_args carries the device pointer
- Wiring: DeviceRunner gains init_scope_stats() / dump call after
  stream sync / finalize, mirroring init_dep_gen; CallConfig and
  scene_test pull --enable-scope-stats into diagnostics_any/on so the
  output_prefix contract applies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant