Add: port comm + deferred completion to a5 onboard#823
Open
jvjhfhg wants to merge 3 commits into
Open
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements the HCCL backend for distributed communication on the a5 platform, replacing previous stubs with a functional implementation using ACL IPC primitives. It introduces a symmetric memory pool, updates the DeviceRunner for ACL lifecycle management, and refactors the runtime scheduler to support both counter-based and SDMA event record completion types. Additionally, header guards are modernized to "#pragma once" across several files. Feedback identifies a high-severity issue in the scheduler where the async_ctx.completion_entries array lacks necessary cache invalidation before processing, potentially leading to stale data reads from Global Memory.
f95d629 to
e1acbb6
Compare
- Mirror comm_hccl.cpp from a2a3 onboard host (HCCL backend with DIY
IPC windows). SDMA workspace overlay is added in the follow-up
commit so this base alone does not depend on PTO_ISA_ROOT or
libnnopbase, and does not invoke aclnnShmemSdmaStarsQuery at
comm_init -- which keeps non-SDMA comm demos unaffected by the
current CANN-9.x SDMA-on-a5 gap.
- Graft ensure_acl_ready / create_comm_stream / destroy_comm_stream
into a5 DeviceRunner and gate aclrtResetDevice + aclFinalize on
acl_ready_ in finalize(); preserve raw rtDeviceReset for pure
rt-layer callers.
- Replace pto_runtime_c_api.cpp comm/ACL stubs with forwarding
implementations; comm_* C ABI now comes from comm_hccl.cpp.
- Upgrade a5 trb deferred-completion runtime from counter-only to
pluggable backend-ops design: CompletionCondition gains
completion_type/addr/retired fields, CompletionBackendOps table
routes COMPLETION_TYPE_{COUNTER,SDMA_EVENT_RECORD}, scheduler
invalidates counter cache lines before polling and retires
satisfied conditions.
- Copy backend/sdma/{kernel,scheduler}.h to a5 (kernel-side, dormant
until a kernel registers a SDMA condition; a5 pto-isa already
exposes SDMA via PTO_NPU_ARCH_A5).
- a5 onboard CMakeLists adds hcomm find_library (FATAL_ERROR on
miss).
- Fix Stride ambiguity in async_notify_demo kernels (pto:: qualifier
to disambiguate from bisheng's enum class Stride).
- Enable a5 in allreduce_distributed and test_platform_comm platform
marks; parametrize the latter via st_platform.
- Convert ported runtime headers to #pragma once on both arches so
aicore_completion_mailbox.h / pto_completion_token.h /
pto_async_{wait,kernel_api}.h / backend/sdma/*.h are now byte-
identical across a2a3 and a5.
Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all
clean. No hardware tests run.
The orch-only allocate_domain path (hw-native-sys#817) dropped the comm_alloc_windows step, which was the only caller of aclrtDeviceEnablePeerAccess. domain_alloc_via_ipc still skipped enabling P2P on the now-false assumption the base alloc did it, so on a5/CANN-9.x the device-pair route is never opened. The IPC VA import still succeeds (host setup + the alloc/release UT pass), but kernel-level cross-chip writes never land, so peer TWAIT/notification waits spin until PTO2_ERROR_SCHEDULER_TIMEOUT (surfaced as ACL 507018). Add the EnablePeerAccess + PeerAccessStatus poll loop (idempotent, per device-pair) to domain_alloc_via_ipc, mirroring alloc_windows_via_ipc. Applied to both a5 and a2a3 backends (kept byte-identical). a2a3 does not manifest the bug -- it enables the route implicitly -- so its hunk is a defensive safety net. Verified on a5 (cards 2,3): allreduce_distributed[2] and async_notify_demo go from timeout to PASS; comm UTs still PASS. a2a3 (cards 8,9): allreduce[2] PASS pre- and post-fix.
a5 onboard CI exposes only 2 NPUs. The 4-rank allreduce case (device_count(4)) trips the resource-phase pre-flight static check (parallel_scheduler.py), which aborts the entire phase -- taking the 2-rank case (and every other L3 example job) down with it, so a5 onboard got zero L3 example coverage. Split the >2-rank case into its own function so a5 can be dropped via the function-level platforms mark (the harness deselects by that mark, not by per-param marks). 2-rank runs everywhere incl. a5 onboard; 4-rank stays on a2a3 hardware + both sims. Verified on a5 (cards 2,3): 4-rank deselected, 2-rank PASS, no abort.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all clean. No hardware tests run.