Skip to content

Refactor: host-build trb runtime arena (a2a3 only)#846

Open
poursoul wants to merge 3 commits into
hw-native-sys:mainfrom
poursoul:refactor-defer-slot-state-bind-to-prepare-task
Open

Refactor: host-build trb runtime arena (a2a3 only)#846
poursoul wants to merge 3 commits into
hw-native-sys:mainfrom
poursoul:refactor-defer-slot-state-bind-to-prepare-task

Conversation

@poursoul
Copy link
Copy Markdown
Collaborator

@poursoul poursoul commented May 22, 2026

Summary

Two-commit refactor on the trb runtime, both authored by @poursoul. The PR
bundles them because the second is built on top of the first; squash-merge
gives a single coherent landing.

  1. fe5d662 — Refactor: defer slot_state payload/task bind to orch::prepare_task

    • Lifts the O(task_window_size) slot bind loop out of RingSchedState::init
      into per-submit prepare_task, making startup independent of window size.
    • Mirrored across both a2a3 and a5 trb runtimes (touches
      pto_orchestrator.cpp, pto_runtime2_types.h, pto_scheduler.cpp on
      each arch).
  2. d33daa5 — Refactor: host-build trb runtime arena, AICPU does only wire + SM reseta2a3 only

    • Moves the entire trb runtime arena layout + data init from AICPU's
      runtime_create_from_sm onto the host. AICPU boot becomes a cheap
      arena-internal pointer wire pass + the SM reset that can't run off-device.
    • Pooled prebuilt image lives in the same DeviceRunner static_arena as
      gm_heap and SM (one rtMalloc per worker), reused across all subsequent
      runs via a single rtMemcpy.
    • Scope is intentionally a2a3-only: src/a5/** is untouched in this
      commit (a5 keeps its current AICPU-side runtime_create_from_sm path).
      The plan is to mirror to a5 in a follow-up PR after this lands and
      stabilizes on a2a3 hardware/sim.

Mechanism (commit 2 / d33daa5)

  • DeviceArena::attach() wraps an externally-owned buffer; re-attach is
    permitted so each AICPU boot can reuse the pooled image.
  • runtime_create_from_sm split into reserve_layout / init_data_from_layout
    / wire_arena_pointers / finalize_after_wire; orchestrator / scheduler /
    tensor_map / ready_queue / spsc gain matching data+wire pairs.
    finalize_after_wire stays AICPU-only since it binds s_runtime_ops.
  • pto2_sm_layout helper computes SM device-side field addresses by pure
    offset arithmetic so host init never dereferences SM.
  • Per-slot SM-side reset moved from RingSchedState::init into
    PTO2SharedMemoryHandle::init_header_per_ring so the AICPU still owns it.
  • New file runtime/shared/pto_runtime2_init.cpp holds the host-able pieces
    lifted out of pto_runtime2.cpp / pto_orchestrator.cpp /
    pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch stay put.
  • DeviceRunner::setup_static_arena now takes a third runtime_arena_size
    region (hbg passes 0 — hbg has no prebuilt runtime arena).

Why a5 is deliberately not touched in this PR

The host-build refactor is a non-trivial reshape of the runtime arena init
path. Keeping a5 on the old AICPU-side path until a2a3 has time on real
hardware lets us validate the new contract (layout/init/wire/finalize phases,
pooled image lifecycle, SM-reset boundary) without making a5 a moving target.
Once stable, the a5 mirror is a mechanical follow-up.

Test plan

  • cpput: 25/25 pass — ready_queue / spsc_queue / scheduler_state /
    task_state / wiring / tensormap UTs migrated to the data+wire API.
    task_allocator.init grew an optional initial_local_task_id (default
    0) so the near-INT32_MAX corner case is still exercised without an SM
    dereference.
  • a2a3sim trb: standalone (dynamic_register variants, L3
    group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass.
  • a2a3sim host_build_graph: 9/9 pass — verifies the shared HostApi
    changes (3-arg setup_static_arena, new acquire_pooled_runtime_arena
    field) don't break hbg.
  • a2a3 hardware: tests/st/.../paged_attention_unroll passes on
    device 9 (--build, pto-isa commit pinned to CI).

Post-review hardening (commit 75f2562)

Address feedback after two independent review passes:

  • pto2_sm_layout::ring_task_descriptors_addr: now takes a per-ring
    task_window_sizes[] array (mirroring the SM API) instead of a single
    uniform value; adds a ring_id range assert. Structurally prevents the
    host-built image from silently disagreeing with the SM layout if anyone
    later introduces per-ring window sizes.
  • DeviceRunner::acquire_pooled_runtime_arena (onboard + sim): now
    returns nullptr when runtime_arena_region_off_ == SIZE_MAX so a stray
    hbg-path call cannot resolve to base + SIZE_MAX.
  • DeviceArena::attach(): documentation rewritten to match real
    behavior (region table is not repopulated; reserve() cannot replay;
    region_size() returns 0); the pre-alignment / non-null / power-of-two
    checks now std::abort() unconditionally instead of relying on
    assert() (which is stripped in release builds).
  • PTO2TensorMap::orch: dead back-pointer field removed (a2a3 never
    dereferences it); wire_arena_pointers loses its parent_orch
    parameter; forward declaration of PTO2OrchestratorState removed.
  • PTO2Runtime::prebuilt_arena_base: dead mirror field removed. The
    host Runtime::prebuilt_arena_base_ is the real source of truth (AICPU
    reads it to locate the pooled buffer before it can dereference the
    image); the image still carries prebuilt_layout, which is consumed.
  • PTO2SchedulerState::init_data_from_layout / RingSchedState:: init_data_from_layout: the unused task_window_size /
    dep_pool_capacity parameters are dropped (scheduler only needs SM base
    • ring index, both window-size-independent). All 4 cpput callsites
      updated.
  • PTO2RingFlowControl::init(): comment added pointing back at
    PTO2TaskAllocator::init's initial_local_task_id default, so future
    changes to fc initial value / boot ordering are flagged in the same
    edit.

Test plan (post-hardening)

  • cpput: 25/25 pass.
  • a2a3sim trb: dummy_task + dynamic_register + L2 trb suite
    pass with --build (forces host + AICPU recompile).

poursoul added 2 commits May 22, 2026 12:22
Move the per-slot payload/task pointer assignments out of the
RingSchedState::init() O(task_window_size) loop and into orch::prepare_task.
Their value is per-slot constant (&task_payloads[slot] /
&task_descriptors[slot]) but writing them at submit time, on the same 64B
slot_state cache line prepare_task is already dirtying, is essentially
free — while removing the only "scale-dependent" pointer assignments from
the init path. ring_id stays in init (its value is per-ring constant, so
rewriting it each submit would only add noise without removing a loop).

Split PTO2TaskSlotState::bind() into bind_ring() (init-time) and
bind_buffers() (per-submit) to make the two call-site shapes explicit.

Mirrored across both a2a3 and a5 trb runtimes.
Previously the AICPU rebuilt the entire trb runtime arena (PTO2Runtime,
orchestrator/scheduler/tensor_map sub-regions, sm_handle wrapper,
mailbox) on every device boot via runtime_create_from_sm. This commit
moves layout + data init onto the host so the AICPU only does a cheap
arena-internal pointer wire pass plus the SM reset that can't run
off-device. Multi-run boots reuse the pooled prebuilt image with a
single rtMemcpy.

Mechanism
- DeviceArena::attach() wraps an externally-owned buffer; re-attach is
  permitted so each AICPU boot can reuse the pooled image.
- runtime_create_from_sm split into reserve_layout / init_data_from_layout
  / wire_arena_pointers / finalize_after_wire. orchestrator / scheduler /
  tensor_map / ready_queue / spsc gain matching data+wire pairs;
  finalize_after_wire stays AICPU-only since it binds s_runtime_ops.
- pto2_sm_layout helper computes SM field device addresses by pure
  offset arithmetic so host init never dereferences SM.
- Per-slot SM-side reset (bind_ring + reset_for_reuse + active_mask)
  moved from RingSchedState::init into
  PTO2SharedMemoryHandle::init_header_per_ring so the AICPU still owns
  it after the split.
- runtime/shared/pto_runtime2_init.cpp — new file holding the host-able
  pieces lifted out of pto_runtime2.cpp / pto_orchestrator.cpp /
  pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch
  stay in place.

Host wiring (runtime_maker.cpp)
- DeviceRunner::setup_static_arena gains a third runtime_arena_size
  region (hbg passes 0). The prebuilt image lives in the same pooled
  backing allocation as gm_heap and SM, keeping worker lifetime to one
  rtMalloc.
- bind_prepared_to_runtime_impl reserves layout on a host arena, sizes
  the pooled regions, runs init_data + wire, stashes prebuilt metadata
  into the rt image, rtMemcpys to device, and records base/offset on
  Runtime so the AICPU boot can find it.

AICPU boot (aicpu_executor.cpp)
- attach the runtime arena to the pooled buffer, take rt from
  base+off_runtime, wire arena-internal pointers, sm_handle->init
  (SM reset including the per-slot fields above), mailbox reset,
  finalize_after_wire (ops table + cluster/aiv counts).

Tests
- cpput: 25/25 pass. ready_queue / spsc_queue / scheduler_state /
  task_state / wiring / tensormap UTs migrated to the data+wire API.
  task_allocator.init grew an optional initial_local_task_id (default
  0) so UTs can still exercise task_id near INT32_MAX without reading
  the SM.
- a2a3sim trb: standalone (dynamic_register variants, L3
  group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass.
- a2a3sim host_build_graph: 9/9 pass (verifies the shared HostApi
  changes don't break hbg).
- a2a3 hardware: tests/st/.../paged_attention_unroll PASS on device 9
  (--build with pto-isa commit pinned to CI).
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a prebuilt-arena fast path for the PTO2 runtime, allowing the host to pre-compute the runtime arena image and upload it to the device. This optimization reduces AICPU boot time by replacing full initialization with a simple attachment and pointer "wiring" phase. Key changes include refactoring the initialization logic for the runtime, orchestrator, and scheduler into separate data-population and pointer-wiring stages, extending the DeviceRunner to manage a pooled runtime arena, and adding an attach method to DeviceArena for externally-owned buffers. Review feedback correctly identified potential undefined behavior in the new acquire_pooled_runtime_arena methods when the arena is not provisioned, suggesting defensive checks against SIZE_MAX offsets.

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp
Comment thread src/a2a3/platform/sim/host/device_runner.cpp
Address review feedback from PR hw-native-sys#846:

- pto2_sm_layout::ring_task_descriptors_addr: take per-ring task_window_sizes[]
  array (mirroring PTO2SharedMemoryHandle's SM API) and assert ring_id range,
  so a future per-ring SM layout cannot silently disagree with the addresses
  the host bakes into the prebuilt image.
- DeviceRunner::acquire_pooled_runtime_arena (onboard + sim): return nullptr
  when runtime_arena_region_off_ == SIZE_MAX so a stray hbg-path call cannot
  resolve to base + SIZE_MAX. Failure is now loud and contained at the
  acquire boundary.
- DeviceArena::attach(): rewrite doc to match real behavior (region table is
  not repopulated after attach, reserve() asserts !committed_ so cannot
  replay, region_size() returns 0); promote the pre-alignment / non-null /
  power-of-two checks from plain assert() to an unconditional abort() so
  release builds still trap on contract violations.
- PTO2TensorMap: drop the dead `orch` back-pointer field (a2a3 never
  dereferences it), strip parent_orch parameter from wire_arena_pointers,
  and remove the now-unused PTO2OrchestratorState forward declaration.
- PTO2RingFlowControl::init(): add a coupling comment so future fc-initial-
  value or boot-order changes flag PTO2TaskAllocator::init's
  initial_local_task_id default in the same edit.
- PTO2SchedulerState::init_data_from_layout / RingSchedState::
  init_data_from_layout: drop the task_window_size / dep_pool_capacity
  parameters that were never consumed (scheduler only needs SM base + ring
  index, both window-size-independent; orchestrator counterpart still takes
  task_window_size for ring_task_descriptors arithmetic). Updated all
  callsites (pto_runtime2_init.cpp + 4 cpput suites).
- PTO2Runtime::prebuilt_arena_base: removed the dead mirror field. The host
  Runtime's prebuilt_arena_base_ is the real source of truth (AICPU reads it
  to locate the pooled buffer *before* dereferencing the image); the
  PTO2Runtime image still carries prebuilt_layout, which the AICPU does
  consume.

cpput: 25/25 pass. a2a3sim trb: dummy_task / dynamic_register / L2 trb
suite pass with --build.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant