Skip to content

Refactor: separate orch phase-stats and swim-lane gating#857

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:refactor/separate-orch-phase-stats-and-swim-lane-g
May 26, 2026
Merged

Refactor: separate orch phase-stats and swim-lane gating#857
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:refactor/separate-orch-phase-stats-and-swim-lane-g

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 26, 2026

Summary

Split the two profiling concerns in the PTO2_ORCH_PROFILING branch of
CYCLE_COUNT_LAP_RECORD so each is gated independently:

  • Phase-statistics accumulation (acc += _t1 - _t0) — gated only by
    PTO2_ORCH_PROFILING at compile time, unconditional at runtime.
  • Swim-lane GM writes — gated only by orch->l2_perf_level >= L2PerfLevel::ORCH_PHASES at runtime, mirroring the
    PTO2_PROFILING branch and the matching reads in
    l2_perf_collector.cpp / aicpu_executor.cpp.

Why

Previously the PTO2_ORCH_PROFILING branch bundled cycle accumulation
and swim-lane writes behind a single compile-time gate, with two
consequences:

  1. Callers could not get phase totals without paying GM-store cost on
    every phase boundary.
  2. The swim-lane write happened between the _t1 capture and the
    _t0 reassignment, so its cost leaked into the next phase's
    accumulator and distorted the totals the flag exists to measure.

After this change the swim-lane write is followed by a fresh
get_sys_cnt_aicpu() for _t0, so its GM-store cost is excluded from
the next phase's accumulator.

Mirrored across a2a3 and a5 orchestrator implementations.

Testing

  • Default sim build (compile-time gate stays off — no behavior change)
  • PTO2_ORCH_PROFILING=1 build on hardware: phase totals stable
    across runs regardless of --enable-l2-swimlane level
  • --enable-l2-swimlane 4 with PTO2_ORCH_PROFILING=1: swim-lane
    records present, totals not inflated by GM-store cost

The PTO2_ORCH_PROFILING branch of CYCLE_COUNT_LAP_RECORD previously
bundled per-phase cycle accumulation and swim-lane GM writes behind a
single compile-time gate. This had two consequences:

- Callers could not get phase totals without paying GM-store cost.
- The swim-lane write happened between the _t1 capture and the _t0
  reassignment, so its cost leaked into the *next* phase's accumulator
  and distorted the totals the flag exists to measure.

Split the two concerns so each is gated independently:

- Phase-statistics accumulation (`acc += _t1 - _t0`) is gated only by
  PTO2_ORCH_PROFILING at compile time, unconditional at runtime.
- Swim-lane recording is gated only by `orch->l2_perf_level >=
  L2PerfLevel::ORCH_PHASES` at runtime, mirroring the PTO2_PROFILING
  branch and the matching reads in l2_perf_collector / aicpu_executor.
- When the swim-lane write fires, _t0 is re-sampled with a fresh
  get_sys_cnt_aicpu() *after* the write so the GM-store cost is
  excluded from the next phase's accumulator.

Mirrored across a2a3 and a5 orchestrator implementations.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the profiling macros in pto_orchestrator.cpp across both the a2a3 and a5 runtimes to conditionally gate swim-lane recording at runtime based on l2_perf_level and re-sample the cycle counter after recording to exclude write overhead. The review feedback points out a fragile implicit dependency in the CYCLE_COUNT_START macro, which relies on a local variable named orch being present in the scope. It is recommended to use this->l2_perf_level instead to make the macro self-contained and prevent potential compilation errors.

@ChaoWao ChaoWao merged commit 7035d2a into hw-native-sys:main May 26, 2026
15 checks passed
@ChaoWao ChaoWao deleted the refactor/separate-orch-phase-stats-and-swim-lane-g branch May 26, 2026 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant