Skip to content

Add: tiered perf_level (0-4) for --enable-l2-swimlane on a5#841

Open
indigo1973 wants to merge 1 commit into
hw-native-sys:mainfrom
indigo1973:swim_0521
Open

Add: tiered perf_level (0-4) for --enable-l2-swimlane on a5#841
indigo1973 wants to merge 1 commit into
hw-native-sys:mainfrom
indigo1973:swim_0521

Conversation

@indigo1973
Copy link
Copy Markdown
Contributor

@indigo1973 indigo1973 commented May 21, 2026

Add: tiered perf_level (0-4) for --enable-l2-swimlane on a5

Port the tiered L2 swimlane perf_level feature from #782 (a2a3-only)
to the a5 platform, so a5 onboard and a5sim now honor the integer
perf_level (0-4) instead of treating --enable-l2-swimlane as a plain
boolean.

Mirror the a2a3 wiring on a5:

  • L2PerfDataHeader::l2_perf_level carries the level into shared memory;
    AICPU promotes it in l2_perf_aicpu_init and exposes it via
    get_l2_perf_level().
  • Host-side L2PerfCollector caches the level to gate JSON sections and
    stamps the JSON "version" field directly from perf_level.
  • Apply level gates throughout AICPU code paths: skip dispatch/finish
    timestamps and fanout copies below AICPU_TIMING, scheduler phase
    records below SCHED_PHASES, and orchestrator phase records below
    ORCH_PHASES.
  • Plumb perf_level through DeviceRunner / pto_runtime_c_api on both
    onboard and sim backends.
  • Move l2_perf_aicpu_init out of the dispatch one-time-init block into
    SchedulerContext::init() in scheduler_cold_path.cpp, matching a2a3
    so the orchestrator thread reads a promoted level when caching
    rt->orchestrator.l2_perf_level.
  • Align l2_perf_aicpu_record_phase to a2a3 byte-for-byte: remove the
    end-of-function wmb() and the 3 dropped-path wmbs (all introduced
    by #777, none present in a2a3), and unify the accounting comment
    • brace style. Measured ~1.1 ms reduction in L4 orch_cost on
      paged_attention_unroll Case1.
  • Align l2_perf_aicpu_complete_record with a2a3: add thread_idx
    parameter (routed from both host_build_graph and tensormap_and_ringbuffer
    callers), introduce an AICPU-private s_perf_records_buffers[] cache as
    the records-buffer SoT, rename switch_buffer -> switch_records_buffer
    and rotate after the write so the just-committed record is preserved,
    and surface ring/task_id mismatch as a dedicated LOG_ERROR
    (completion-before-dispatch invariant violation) separate from
    capacity drops. init / flush_buffers maintain s_perf_records_buffers[]
    in lockstep with state->current_buf_ptr so flush deterministically
    halts subsequent commits.

Update docs (l2-swimlane-profiling.md, profiling_levels.md,
testing.md) to drop the "a5 is boolean-only" caveat and document the
unified integer interface across a2a3 and a5.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a tiered performance profiling system (L2PerfLevel 0–4) for the a5 platform, transitioning from a binary toggle to granular collection levels including AICore timing, AICPU timing, scheduler phases, and orchestrator phases. The changes involve updating the shared-memory handshake, gating timing and fanout collection in the AICPU executor and scheduler, and updating documentation. A critical feedback point identifies a memory ordering risk in the phase recording logic where a removed memory barrier should be replaced with a store barrier to ensure data consistency on weak memory model architectures.

Comment thread src/a5/platform/src/aicpu/l2_perf_collector_aicpu.cpp
@indigo1973 indigo1973 force-pushed the swim_0521 branch 2 times, most recently from 5440386 to 8338184 Compare May 25, 2026 06:43
Comment thread docs/dfx/l2-swimlane-profiling.md Outdated
Port the tiered L2 swimlane perf_level feature from [hw-native-sys#782](hw-native-sys#782) (a2a3-only)
to the a5 platform, so a5 onboard and a5sim now honor the integer
perf_level (0-4) instead of treating --enable-l2-swimlane as a plain
boolean.

Mirror the a2a3 wiring on a5:
- L2PerfDataHeader::l2_perf_level carries the level into shared memory;
  AICPU promotes it in l2_perf_aicpu_init and exposes it via
  get_l2_perf_level().
- Host-side L2PerfCollector caches the level to gate JSON sections and
  stamps the JSON "version" field directly from perf_level.
- Apply level gates throughout AICPU code paths: skip dispatch/finish
  timestamps and fanout copies below AICPU_TIMING, scheduler phase
  records below SCHED_PHASES, and orchestrator phase records below
  ORCH_PHASES.
- Plumb perf_level through DeviceRunner / pto_runtime_c_api on both
  onboard and sim backends.
- Move l2_perf_aicpu_init out of the dispatch one-time-init block into
  SchedulerContext::init() in scheduler_cold_path.cpp, matching a2a3
  so the orchestrator thread reads a promoted level when caching
  rt->orchestrator.l2_perf_level.
- Align l2_perf_aicpu_record_phase to a2a3 byte-for-byte: remove the
  end-of-function wmb() and the 3 dropped-path wmbs (all introduced
  by [hw-native-sys#777](hw-native-sys#777), none present in a2a3), and unify the accounting comment
  + brace style. Measured ~1.1 ms reduction in L4 orch_cost on
  paged_attention_unroll Case1.
- Align l2_perf_aicpu_complete_record with a2a3: add thread_idx
  parameter (routed from both host_build_graph and tensormap_and_ringbuffer
  callers), introduce an AICPU-private s_perf_records_buffers[] cache as
  the records-buffer SoT, rename switch_buffer -> switch_records_buffer
  and rotate after the write so the just-committed record is preserved,
  and surface ring/task_id mismatch as a dedicated LOG_ERROR
  (completion-before-dispatch invariant violation) separate from
  capacity drops. init / flush_buffers maintain s_perf_records_buffers[]
  in lockstep with state->current_buf_ptr so flush deterministically
  halts subsequent commits.

Update docs (l2-swimlane-profiling.md, profiling_levels.md,
testing.md) to drop the "a5 is boolean-only" caveat and document the
unified integer interface across a2a3 and a5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants