Refactor: separate orch phase-stats and swim-lane gating#857
Merged
ChaoWao merged 1 commit intoMay 26, 2026
Merged
Conversation
The PTO2_ORCH_PROFILING branch of CYCLE_COUNT_LAP_RECORD previously bundled per-phase cycle accumulation and swim-lane GM writes behind a single compile-time gate. This had two consequences: - Callers could not get phase totals without paying GM-store cost. - The swim-lane write happened between the _t1 capture and the _t0 reassignment, so its cost leaked into the *next* phase's accumulator and distorted the totals the flag exists to measure. Split the two concerns so each is gated independently: - Phase-statistics accumulation (`acc += _t1 - _t0`) is gated only by PTO2_ORCH_PROFILING at compile time, unconditional at runtime. - Swim-lane recording is gated only by `orch->l2_perf_level >= L2PerfLevel::ORCH_PHASES` at runtime, mirroring the PTO2_PROFILING branch and the matching reads in l2_perf_collector / aicpu_executor. - When the swim-lane write fires, _t0 is re-sampled with a fresh get_sys_cnt_aicpu() *after* the write so the GM-store cost is excluded from the next phase's accumulator. Mirrored across a2a3 and a5 orchestrator implementations.
There was a problem hiding this comment.
Code Review
This pull request updates the profiling macros in pto_orchestrator.cpp across both the a2a3 and a5 runtimes to conditionally gate swim-lane recording at runtime based on l2_perf_level and re-sample the cycle counter after recording to exclude write overhead. The review feedback points out a fragile implicit dependency in the CYCLE_COUNT_START macro, which relies on a local variable named orch being present in the scope. It is recommended to use this->l2_perf_level instead to make the macro self-contained and prevent potential compilation errors.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Split the two profiling concerns in the
PTO2_ORCH_PROFILINGbranch ofCYCLE_COUNT_LAP_RECORDso each is gated independently:acc += _t1 - _t0) — gated only byPTO2_ORCH_PROFILINGat compile time, unconditional at runtime.orch->l2_perf_level >= L2PerfLevel::ORCH_PHASESat runtime, mirroring thePTO2_PROFILINGbranch and the matching reads inl2_perf_collector.cpp/aicpu_executor.cpp.Why
Previously the
PTO2_ORCH_PROFILINGbranch bundled cycle accumulationand swim-lane writes behind a single compile-time gate, with two
consequences:
every phase boundary.
_t1capture and the_t0reassignment, so its cost leaked into the next phase'saccumulator and distorted the totals the flag exists to measure.
After this change the swim-lane write is followed by a fresh
get_sys_cnt_aicpu()for_t0, so its GM-store cost is excluded fromthe next phase's accumulator.
Mirrored across
a2a3anda5orchestrator implementations.Testing
PTO2_ORCH_PROFILING=1build on hardware: phase totals stableacross runs regardless of
--enable-l2-swimlanelevel--enable-l2-swimlane 4withPTO2_ORCH_PROFILING=1: swim-lanerecords present, totals not inflated by GM-store cost