Add: port comm + deferred completion to a5 onboard by jvjhfhg · Pull Request #823 · hw-native-sys/simpler

jvjhfhg · 2026-05-19T12:22:42Z

Mirror comm_hccl.cpp from a2a3 onboard host (HCCL backend with DIY IPC windows). SDMA workspace overlay is added in the follow-up commit so this base alone does not depend on PTO_ISA_ROOT or libnnopbase, and does not invoke aclnnShmemSdmaStarsQuery at comm_init -- which keeps non-SDMA comm demos unaffected by the current CANN-9.x SDMA-on-a5 gap.
Graft ensure_acl_ready / create_comm_stream / destroy_comm_stream into a5 DeviceRunner and gate aclrtResetDevice + aclFinalize on acl_ready_ in finalize(); preserve raw rtDeviceReset for pure rt-layer callers.
Replace pto_runtime_c_api.cpp comm/ACL stubs with forwarding implementations; comm_* C ABI now comes from comm_hccl.cpp.
Upgrade a5 trb deferred-completion runtime from counter-only to pluggable backend-ops design: CompletionCondition gains completion_type/addr/retired fields, CompletionBackendOps table routes COMPLETION_TYPE_{COUNTER,SDMA_EVENT_RECORD}, scheduler invalidates counter cache lines before polling and retires satisfied conditions.
Copy backend/sdma/{kernel,scheduler}.h to a5 (kernel-side, dormant until a kernel registers a SDMA condition; a5 pto-isa already exposes SDMA via PTO_NPU_ARCH_A5).
a5 onboard CMakeLists adds hcomm find_library (FATAL_ERROR on miss).
Fix Stride ambiguity in async_notify_demo kernels (pto:: qualifier to disambiguate from bisheng's enum class Stride).
Enable a5 in allreduce_distributed and test_platform_comm platform marks; parametrize the latter via st_platform.
Convert ported runtime headers to #pragma once on both arches so aicore_completion_mailbox.h / pto_completion_token.h / pto_async_{wait,kernel_api}.h / backend/sdma/*.h are now byte- identical across a2a3 and a5.

Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all clean. No hardware tests run.

gemini-code-assist

Code Review

This pull request implements the HCCL backend for distributed communication on the a5 platform, replacing previous stubs with a functional implementation using ACL IPC primitives. It introduces a symmetric memory pool, updates the DeviceRunner for ACL lifecycle management, and refactors the runtime scheduler to support both counter-based and SDMA event record completion types. Additionally, header guards are modernized to "#pragma once" across several files. Feedback identifies a high-severity issue in the scheduler where the async_ctx.completion_entries array lacks necessary cache invalidation before processing, potentially leading to stale data reads from Global Memory.

- Mirror comm_hccl.cpp from a2a3 onboard host (HCCL backend with DIY IPC windows). SDMA workspace overlay is added in the follow-up commit so this base alone does not depend on PTO_ISA_ROOT or libnnopbase, and does not invoke aclnnShmemSdmaStarsQuery at comm_init -- which keeps non-SDMA comm demos unaffected by the current CANN-9.x SDMA-on-a5 gap. - Graft ensure_acl_ready / create_comm_stream / destroy_comm_stream into a5 DeviceRunner and gate aclrtResetDevice + aclFinalize on acl_ready_ in finalize(); preserve raw rtDeviceReset for pure rt-layer callers. - Replace pto_runtime_c_api.cpp comm/ACL stubs with forwarding implementations; comm_* C ABI now comes from comm_hccl.cpp. - Upgrade a5 trb deferred-completion runtime from counter-only to pluggable backend-ops design: CompletionCondition gains completion_type/addr/retired fields, CompletionBackendOps table routes COMPLETION_TYPE_{COUNTER,SDMA_EVENT_RECORD}, scheduler invalidates counter cache lines before polling and retires satisfied conditions. - Copy backend/sdma/{kernel,scheduler}.h to a5 (kernel-side, dormant until a kernel registers a SDMA condition; a5 pto-isa already exposes SDMA via PTO_NPU_ARCH_A5). - a5 onboard CMakeLists adds hcomm find_library (FATAL_ERROR on miss). - Fix Stride ambiguity in async_notify_demo kernels (pto:: qualifier to disambiguate from bisheng's enum class Stride). - Enable a5 in allreduce_distributed and test_platform_comm platform marks; parametrize the latter via st_platform. - Convert ported runtime headers to #pragma once on both arches so aicore_completion_mailbox.h / pto_completion_token.h / pto_async_{wait,kernel_api}.h / backend/sdma/*.h are now byte- identical across a2a3 and a5. Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all clean. No hardware tests run.

The orch-only allocate_domain path (hw-native-sys#817) dropped the comm_alloc_windows step, which was the only caller of aclrtDeviceEnablePeerAccess. domain_alloc_via_ipc still skipped enabling P2P on the now-false assumption the base alloc did it, so on a5/CANN-9.x the device-pair route is never opened. The IPC VA import still succeeds (host setup + the alloc/release UT pass), but kernel-level cross-chip writes never land, so peer TWAIT/notification waits spin until PTO2_ERROR_SCHEDULER_TIMEOUT (surfaced as ACL 507018). Add the EnablePeerAccess + PeerAccessStatus poll loop (idempotent, per device-pair) to domain_alloc_via_ipc, mirroring alloc_windows_via_ipc. Applied to both a5 and a2a3 backends (kept byte-identical). a2a3 does not manifest the bug -- it enables the route implicitly -- so its hunk is a defensive safety net. Verified on a5 (cards 2,3): allreduce_distributed[2] and async_notify_demo go from timeout to PASS; comm UTs still PASS. a2a3 (cards 8,9): allreduce[2] PASS pre- and post-fix.

a5 onboard CI exposes only 2 NPUs. The 4-rank allreduce case (device_count(4)) trips the resource-phase pre-flight static check (parallel_scheduler.py), which aborts the entire phase -- taking the 2-rank case (and every other L3 example job) down with it, so a5 onboard got zero L3 example coverage. Split the >2-rank case into its own function so a5 can be dropped via the function-level platforms mark (the harness deselects by that mark, not by per-param marks). 2-rank runs everywhere incl. a5 onboard; 4-rank stays on a2a3 hardware + both sims. Verified on a5 (cards 2,3): 4-rank deselected, 2-rank PASS, no abort.

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h

jvjhfhg force-pushed the feat/comm-a5 branch 11 times, most recently from f95d629 to e1acbb6 Compare May 21, 2026 06:50

jvjhfhg added 3 commits May 21, 2026 17:00

jvjhfhg force-pushed the feat/comm-a5 branch from 2e905bb to 2b35678 Compare May 21, 2026 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: port comm + deferred completion to a5 onboard#823

Add: port comm + deferred completion to a5 onboard#823
jvjhfhg wants to merge 3 commits into
hw-native-sys:mainfrom
jvjhfhg:feat/comm-a5

jvjhfhg commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jvjhfhg commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant