Add: tiered perf_level (0-4) for --enable-l2-swimlane on a5#841
Open
indigo1973 wants to merge 1 commit into
Open
Add: tiered perf_level (0-4) for --enable-l2-swimlane on a5#841indigo1973 wants to merge 1 commit into
indigo1973 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a tiered performance profiling system (L2PerfLevel 0–4) for the a5 platform, transitioning from a binary toggle to granular collection levels including AICore timing, AICPU timing, scheduler phases, and orchestrator phases. The changes involve updating the shared-memory handshake, gating timing and fanout collection in the AICPU executor and scheduler, and updating documentation. A critical feedback point identifies a memory ordering risk in the phase recording logic where a removed memory barrier should be replaced with a store barrier to ensure data consistency on weak memory model architectures.
5440386 to
8338184
Compare
Port the tiered L2 swimlane perf_level feature from [hw-native-sys#782](hw-native-sys#782) (a2a3-only) to the a5 platform, so a5 onboard and a5sim now honor the integer perf_level (0-4) instead of treating --enable-l2-swimlane as a plain boolean. Mirror the a2a3 wiring on a5: - L2PerfDataHeader::l2_perf_level carries the level into shared memory; AICPU promotes it in l2_perf_aicpu_init and exposes it via get_l2_perf_level(). - Host-side L2PerfCollector caches the level to gate JSON sections and stamps the JSON "version" field directly from perf_level. - Apply level gates throughout AICPU code paths: skip dispatch/finish timestamps and fanout copies below AICPU_TIMING, scheduler phase records below SCHED_PHASES, and orchestrator phase records below ORCH_PHASES. - Plumb perf_level through DeviceRunner / pto_runtime_c_api on both onboard and sim backends. - Move l2_perf_aicpu_init out of the dispatch one-time-init block into SchedulerContext::init() in scheduler_cold_path.cpp, matching a2a3 so the orchestrator thread reads a promoted level when caching rt->orchestrator.l2_perf_level. - Align l2_perf_aicpu_record_phase to a2a3 byte-for-byte: remove the end-of-function wmb() and the 3 dropped-path wmbs (all introduced by [hw-native-sys#777](hw-native-sys#777), none present in a2a3), and unify the accounting comment + brace style. Measured ~1.1 ms reduction in L4 orch_cost on paged_attention_unroll Case1. - Align l2_perf_aicpu_complete_record with a2a3: add thread_idx parameter (routed from both host_build_graph and tensormap_and_ringbuffer callers), introduce an AICPU-private s_perf_records_buffers[] cache as the records-buffer SoT, rename switch_buffer -> switch_records_buffer and rotate after the write so the just-committed record is preserved, and surface ring/task_id mismatch as a dedicated LOG_ERROR (completion-before-dispatch invariant violation) separate from capacity drops. init / flush_buffers maintain s_perf_records_buffers[] in lockstep with state->current_buf_ptr so flush deterministically halts subsequent commits. Update docs (l2-swimlane-profiling.md, profiling_levels.md, testing.md) to drop the "a5 is boolean-only" caveat and document the unified integer interface across a2a3 and a5.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add: tiered perf_level (0-4) for --enable-l2-swimlane on a5
Port the tiered L2 swimlane perf_level feature from #782 (a2a3-only)
to the a5 platform, so a5 onboard and a5sim now honor the integer
perf_level (0-4) instead of treating --enable-l2-swimlane as a plain
boolean.
Mirror the a2a3 wiring on a5:
AICPU promotes it in l2_perf_aicpu_init and exposes it via
get_l2_perf_level().
stamps the JSON "version" field directly from perf_level.
timestamps and fanout copies below AICPU_TIMING, scheduler phase
records below SCHED_PHASES, and orchestrator phase records below
ORCH_PHASES.
onboard and sim backends.
SchedulerContext::init() in scheduler_cold_path.cpp, matching a2a3
so the orchestrator thread reads a promoted level when caching
rt->orchestrator.l2_perf_level.
end-of-function wmb() and the 3 dropped-path wmbs (all introduced
by #777, none present in a2a3), and unify the accounting comment
paged_attention_unroll Case1.
parameter (routed from both host_build_graph and tensormap_and_ringbuffer
callers), introduce an AICPU-private s_perf_records_buffers[] cache as
the records-buffer SoT, rename switch_buffer -> switch_records_buffer
and rotate after the write so the just-committed record is preserved,
and surface ring/task_id mismatch as a dedicated LOG_ERROR
(completion-before-dispatch invariant violation) separate from
capacity drops. init / flush_buffers maintain s_perf_records_buffers[]
in lockstep with state->current_buf_ptr so flush deterministically
halts subsequent commits.
Update docs (l2-swimlane-profiling.md, profiling_levels.md,
testing.md) to drop the "a5 is boolean-only" caveat and document the
unified integer interface across a2a3 and a5.