Skip to content

Update: auto-resolve CallConfig.block_dim to max stream capacity#850

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:feat/auto-resolve-block-dim-from-stream-limit
May 25, 2026
Merged

Update: auto-resolve CallConfig.block_dim to max stream capacity#850
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:feat/auto-resolve-block-dim-from-stream-limit

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 25, 2026

Summary

  • CallConfig::block_dim default flips from 24 to 0 (sentinel for "auto"); DeviceRunner::run resolves 0 at launch time to the max the AICore stream can host.
  • Onboard runners query aclrtGetStreamResLimit(CUBE_CORE / VECTOR_CORE) and pick min(cube/AIC_per_bd, vector/AIV_per_bd), falling back to PLATFORM_MAX_BLOCKDIM if the ACL call fails. Sim runners short-circuit to PLATFORM_MAX_BLOCKDIM since there is no per-stream resource query.
  • scene_test.py's implicit block_dim fallback (1) is dropped in favour of 0 so cases that omit block_dim exercise the new auto path.

Behavior change

  • Any caller constructing CallConfig() without setting block_dim previously got 24. They now get the stream's max — 36 on a full-die a5 stream, 24 on a2a3, possibly smaller on resource-limited streams.
  • validate_block_dim now applies the PLATFORM_MAX_BLOCKDIM clamp on both the ACL-success path and the ACL-fallback path (previously only the fallback was clamped). This is a defensive narrowing — the runtime handshake/scheduler arrays are statically sized to RUNTIME_MAX_WORKER = PLATFORM_MAX_BLOCKDIM * PLATFORM_CORES_PER_BLOCKDIM, so accepting an ACL-reported value larger than the platform cap would over-run those static arrays. No observed-hardware impact today, but explicit values above the platform cap now error early instead of corrupting later.

Why

Most callers want "use everything the stream allows" and were hand-picking 24 arbitrarily. Centralizing the policy in DeviceRunner removes the magic number from user code and adapts automatically when stream capacity is constrained (CPU partitioning, model-shared streams, etc.).

Test plan

  • Local sim build (pip install --no-build-isolation -e .) passes
  • tests/ut/py/test_chip_worker.py::TestCallConfig updated for new default; pre-commit hooks (clang-tidy / cpplint / pyright) clean
  • spmd_basic ST gains a Case2_AutoBlockDim on both a5 and a2a3 that omits block_dim, so onboard CI exercises query_max_block_dim end-to-end
  • simpler_setup/scene_test.py fallback flipped from 1 to 0 so future cases that omit block_dim also exercise the auto path

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an "auto" resolution feature for block_dim by changing its default value from 24 to 0. When set to 0, the DeviceRunner dynamically determines the maximum allowable block_dim based on stream resource limits or platform constants. Feedback from the reviewer highlights a loss of diagnostic logging information regarding specific core limits during validation and identifies performance redundancies where resource limits are queried multiple times during the launch path.

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp Outdated
Comment thread src/a2a3/platform/onboard/host/device_runner.cpp Outdated
Comment thread src/a5/platform/onboard/host/device_runner.cpp Outdated
Comment thread src/a5/platform/onboard/host/device_runner.cpp Outdated
@ChaoWao ChaoWao force-pushed the feat/auto-resolve-block-dim-from-stream-limit branch 2 times, most recently from d81ca56 to a3cf9ad Compare May 25, 2026 02:14
Flip CallConfig::block_dim default from 24 to 0, and treat 0 as a
sentinel that DeviceRunner resolves at run() time. Onboard runners ask
aclrtGetStreamResLimit (CUBE_CORE / VECTOR_CORE) for the per-stream
cap and pick min(cube/AIC_per_bd, vector/AIV_per_bd); sim runners use
the static PLATFORM_MAX_BLOCKDIM. Existing explicit positive values
still go through the same validation path unchanged.

- Refactor onboard validate_block_dim to share query_max_block_dim
  with the auto-resolution path (a5 + a2a3)
- Sim runners short-circuit block_dim == 0 to PLATFORM_MAX_BLOCKDIM
  before the range check (a5 + a2a3)
- Update getting-started and chip-level-arch examples to show the
  auto default; test_chip_worker now asserts block_dim defaults to 0
- Document the sentinel in CallConfig header and ChipWorker.run docstring
@ChaoWao ChaoWao force-pushed the feat/auto-resolve-block-dim-from-stream-limit branch from a3cf9ad to 08f3fba Compare May 25, 2026 04:34
@ChaoWao ChaoWao merged commit b72df25 into hw-native-sys:main May 25, 2026
15 checks passed
@ChaoWao ChaoWao deleted the feat/auto-resolve-block-dim-from-stream-limit branch May 25, 2026 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant