Feat: AICPU launch via dispatcher upload + Mode B per-task#537
Open
puddingfjz wants to merge 1 commit into
Open
Feat: AICPU launch via dispatcher upload + Mode B per-task#537puddingfjz wants to merge 1 commit into
puddingfjz wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an AicpuLoader abstraction to support both legacy and new CANN 7.0+ interfaces for launching AICPU kernels across the a2a3 and a5 platforms. The implementation includes build system updates, runtime JSON descriptor generation, and integration into the DeviceRunner. Feedback focuses on improving build portability by avoiding hardcoded architecture paths and enhancing the robustness of manual JSON construction. Additionally, the removal of a default parameter in the a2a3 platform's header is identified as a breaking change that violates cross-platform consistency. Suggestions were also made to reduce coupling in the kernel name mapping.
puddingfjz
added a commit
to puddingfjz/simpler
that referenced
this pull request
Apr 13, 2026
- Revert hardcoded aarch64-linux path in CMakeLists.txt, use portable paths - Restore default parameter for launch_aicpu_num in device_runner.h - Add documentation explaining JSON construction and name_mapping design The JSON construction uses manual string concatenation without a library. This is safe because kernel names are controlled strings without special characters, matching pypto's approach for similar AICPU op descriptors. The name_mapping from opType to functionName is specific to the Ascend tile framework kernels and is unlikely to change. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5c35216 to
f30e69c
Compare
d4e918c to
3567417
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
3567417 to
90e71ed
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
90e71ed to
7b9e506
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
7b9e506 to
b4dd9b1
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
b4dd9b1 to
bb65c0c
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
bb65c0c to
f173a99
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
f173a99 to
473d8f6
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
473d8f6 to
2c220d3
Compare
ChaoWao
previously approved these changes
May 22, 2026
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
2c220d3 to
d2e91bf
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
d2e91bf to
13abbd9
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
13abbd9 to
123ca62
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall — only its transient libaicpu_extend_kernels
dlopen.
The runtime SO basename embeds an FNV-1a content fingerprint. Writes
go via atomic tmp+rename inside the dispatcher — no truncation window
visible to concurrent aicpu_scheduler readers. A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (Mode B, no dispatcher hop)
=============================================
LoadAicpuOp.Init() JSON-registers the runtime SO via
rtsBinaryLoadFromFile (cpuKernelMode=0, kernelSo points at the
preinstall basename), then resolves simpler_aicpu_init and
simpler_aicpu_exec to rtFuncHandles via rtsFuncGetByName. JSON is
per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent
multi-chip / multi-worker tests don't race on a shared file. opType
is suffixed with the runtime SO's fingerprint so multiple LoadAicpuOp
instances in the same process register non-colliding entries even
though the underlying symbol names are identical.
Per-task launches call rtsLaunchCpuKernel on the cached rtFuncHandles
— no per-call string marshalling, no global op registry lookups, no
dispatcher hop.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
Mode B requires CANN 7.0+, which all supported targets ship.
- Deletes the legacy AicpuLoader stub
(src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}).
- Widens the aicpu_op_timeout regression test to accept the
Mode B-surfaced error codes in addition to the original 507046.
Reference: PR hw-native-sys#537.
123ca62 to
f6defdb
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall — only its transient libaicpu_extend_kernels
dlopen.
The runtime SO basename embeds an FNV-1a content fingerprint. Writes
go via atomic tmp+rename inside the dispatcher — no truncation window
visible to concurrent aicpu_scheduler readers. A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (Mode B, no dispatcher hop)
=============================================
LoadAicpuOp.Init() JSON-registers the runtime SO via
rtsBinaryLoadFromFile (cpuKernelMode=0, kernelSo points at the
preinstall basename), then resolves simpler_aicpu_init and
simpler_aicpu_exec to rtFuncHandles via rtsFuncGetByName. JSON is
per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent
multi-chip / multi-worker tests don't race on a shared file. opType
is suffixed with the runtime SO's fingerprint so multiple LoadAicpuOp
instances in the same process register non-colliding entries even
though the underlying symbol names are identical.
Per-task launches call rtsLaunchCpuKernel on the cached rtFuncHandles
— no per-call string marshalling, no global op registry lookups, no
dispatcher hop.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
Mode B requires CANN 7.0+, which all supported targets ship.
- Deletes the legacy AicpuLoader stub
(src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}).
- Widens the aicpu_op_timeout regression test to accept the
Mode B-surfaced error codes in addition to the original 507046.
Reference: PR hw-native-sys#537.
f6defdb to
7db123c
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall — only its transient libaicpu_extend_kernels
dlopen.
The runtime SO basename embeds an FNV-1a content fingerprint. Writes
go via atomic tmp+rename inside the dispatcher — no truncation window
visible to concurrent aicpu_scheduler readers. A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (Mode B, no dispatcher hop)
=============================================
LoadAicpuOp.Init() JSON-registers the runtime SO via
rtsBinaryLoadFromFile (cpuKernelMode=0, kernelSo points at the
preinstall basename), then resolves simpler_aicpu_init and
simpler_aicpu_exec to rtFuncHandles via rtsFuncGetByName. JSON is
per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent
multi-chip / multi-worker tests don't race on a shared file. opType
is suffixed with the runtime SO's fingerprint so multiple LoadAicpuOp
instances in the same process register non-colliding entries even
though the underlying symbol names are identical.
Per-task launches call rtsLaunchCpuKernel on the cached rtFuncHandles
— no per-call string marshalling, no global op registry lookups, no
dispatcher hop.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
Mode B requires CANN 7.0+, which all supported targets ship.
- Deletes the legacy AicpuLoader stub
(src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}).
- Widens the aicpu_op_timeout regression test to accept the
Mode B-surfaced error codes in addition to the original 507046.
Reference: PR hw-native-sys#537.
7db123c to
c2f96dd
Compare
hw-native-sys-bot
pushed a commit
to puddingfjz/simpler
that referenced
this pull request
May 25, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall — only its transient libaicpu_extend_kernels
dlopen.
The runtime SO basename embeds an FNV-1a content fingerprint. Writes
go via atomic tmp+rename inside the dispatcher — no truncation window
visible to concurrent aicpu_scheduler readers. A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (Mode B, no dispatcher hop)
=============================================
LoadAicpuOp.Init() JSON-registers the runtime SO via
rtsBinaryLoadFromFile (cpuKernelMode=0, kernelSo points at the
preinstall basename), then resolves simpler_aicpu_init and
simpler_aicpu_exec to rtFuncHandles via rtsFuncGetByName. JSON is
per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent
multi-chip / multi-worker tests don't race on a shared file. opType
is suffixed with the runtime SO's fingerprint so multiple LoadAicpuOp
instances in the same process register non-colliding entries even
though the underlying symbol names are identical.
Per-task launches call rtsLaunchCpuKernel on the cached rtFuncHandles
— no per-call string marshalling, no global op registry lookups, no
dispatcher hop.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
Mode B requires CANN 7.0+, which all supported targets ship.
- Deletes the legacy AicpuLoader stub
(src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}).
- Widens the aicpu_op_timeout regression test to accept the
Mode B-surfaced error codes in addition to the original 507046.
Reference: PR hw-native-sys#537.
521b4e1 to
832de93
Compare
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall — only its transient libaicpu_extend_kernels
dlopen.
The runtime SO basename embeds an FNV-1a content fingerprint. Writes
go via atomic tmp+rename inside the dispatcher — no truncation window
visible to concurrent aicpu_scheduler readers. A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (Mode B, no dispatcher hop)
=============================================
LoadAicpuOp.Init() JSON-registers the runtime SO via
rtsBinaryLoadFromFile (cpuKernelMode=0, kernelSo points at the
preinstall basename), then resolves simpler_aicpu_init and
simpler_aicpu_exec to rtFuncHandles via rtsFuncGetByName. JSON is
per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent
multi-chip / multi-worker tests don't race on a shared file. opType
is suffixed with the runtime SO's fingerprint so multiple LoadAicpuOp
instances in the same process register non-colliding entries even
though the underlying symbol names are identical.
Per-task launches call rtsLaunchCpuKernel on the cached rtFuncHandles
— no per-call string marshalling, no global op registry lookups, no
dispatcher hop.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
Mode B requires CANN 7.0+, which all supported targets ship.
- Deletes the legacy AicpuLoader stub
(src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}).
- Widens the aicpu_op_timeout regression test to accept the
Mode B-surfaced error codes in addition to the original 507046.
Reference: PR hw-native-sys#537.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without tar.gz / sudo pre-deployment.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs(kernel_type =KERNEL_TYPE_AICPU_KFC) targeting CANN's preinstalledlibaicpu_extend_kernels.so.libaicpu_extend_kernelsdlopens our dispatcher and invokes its Init; the dispatcher reads the runtime SO bytes from extendedDeviceArgs(newinner_so_bin/inner_so_lenfields at offsets 120/128, whichlibaicpu_extend_kernelsignores) and writes them to:…using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint. Writes go via atomic tmp+rename inside the dispatcher — no truncation window visible to concurrent
aicpu_schedulerreaders. A process-level fingerprint cache inLoadAicpuOpskips redundantlibaicpu_extend_kernelsinvocations within a single host process — each runtime is bootstrapped at most once per process.Per-task launches (Mode B, no dispatcher hop)
LoadAicpuOp::Init()JSON-registers the runtime SO viartsBinaryLoadFromFile(cpuKernelMode=0,kernelSopoints at the preinstall basename), then resolvessimpler_aicpu_initandsimpler_aicpu_exectortFuncHandles viartsFuncGetByName. JSON is per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent multi-chip / multi-worker tests don't race on a shared file.opTypeis suffixed with the runtime SO's fingerprint so multipleLoadAicpuOpinstances in the same process register non-colliding entries even though the underlying symbol names are identical.Per-task launches call
rtsLaunchCpuKernelon the cachedrtFuncHandles — no per-call string marshalling, no global op registry lookups, no dispatcher hop.Cleanup
BUILD_WITH_NEW_CANNCMake option and all ifdef branches. Mode B requires CANN 7.0+, which all supported targets ship.AicpuLoaderstub (src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}).aicpu_op_timeoutregression test to accept the Mode B-surfaced error codes in addition to the original 507046.Fixes #356.