Update: auto-resolve CallConfig.block_dim to max stream capacity#850
Merged
ChaoWao merged 1 commit intoMay 25, 2026
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an "auto" resolution feature for block_dim by changing its default value from 24 to 0. When set to 0, the DeviceRunner dynamically determines the maximum allowable block_dim based on stream resource limits or platform constants. Feedback from the reviewer highlights a loss of diagnostic logging information regarding specific core limits during validation and identifies performance redundancies where resource limits are queried multiple times during the launch path.
d81ca56 to
a3cf9ad
Compare
Flip CallConfig::block_dim default from 24 to 0, and treat 0 as a sentinel that DeviceRunner resolves at run() time. Onboard runners ask aclrtGetStreamResLimit (CUBE_CORE / VECTOR_CORE) for the per-stream cap and pick min(cube/AIC_per_bd, vector/AIV_per_bd); sim runners use the static PLATFORM_MAX_BLOCKDIM. Existing explicit positive values still go through the same validation path unchanged. - Refactor onboard validate_block_dim to share query_max_block_dim with the auto-resolution path (a5 + a2a3) - Sim runners short-circuit block_dim == 0 to PLATFORM_MAX_BLOCKDIM before the range check (a5 + a2a3) - Update getting-started and chip-level-arch examples to show the auto default; test_chip_worker now asserts block_dim defaults to 0 - Document the sentinel in CallConfig header and ChipWorker.run docstring
a3cf9ad to
08f3fba
Compare
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CallConfig::block_dimdefault flips from24to0(sentinel for "auto");DeviceRunner::runresolves0at launch time to the max the AICore stream can host.aclrtGetStreamResLimit(CUBE_CORE / VECTOR_CORE)and pickmin(cube/AIC_per_bd, vector/AIV_per_bd), falling back toPLATFORM_MAX_BLOCKDIMif the ACL call fails. Sim runners short-circuit toPLATFORM_MAX_BLOCKDIMsince there is no per-stream resource query.scene_test.py's implicitblock_dimfallback (1) is dropped in favour of0so cases that omitblock_dimexercise the new auto path.Behavior change
CallConfig()without settingblock_dimpreviously got24. They now get the stream's max —36on a full-die a5 stream,24on a2a3, possibly smaller on resource-limited streams.validate_block_dimnow applies thePLATFORM_MAX_BLOCKDIMclamp on both the ACL-success path and the ACL-fallback path (previously only the fallback was clamped). This is a defensive narrowing — the runtime handshake/scheduler arrays are statically sized toRUNTIME_MAX_WORKER = PLATFORM_MAX_BLOCKDIM * PLATFORM_CORES_PER_BLOCKDIM, so accepting an ACL-reported value larger than the platform cap would over-run those static arrays. No observed-hardware impact today, but explicit values above the platform cap now error early instead of corrupting later.Why
Most callers want "use everything the stream allows" and were hand-picking
24arbitrarily. Centralizing the policy inDeviceRunnerremoves the magic number from user code and adapts automatically when stream capacity is constrained (CPU partitioning, model-shared streams, etc.).Test plan
pip install --no-build-isolation -e .) passestests/ut/py/test_chip_worker.py::TestCallConfigupdated for new default; pre-commit hooks (clang-tidy / cpplint / pyright) cleanspmd_basicST gains aCase2_AutoBlockDimon both a5 and a2a3 that omitsblock_dim, so onboard CI exercisesquery_max_block_dimend-to-endsimpler_setup/scene_test.pyfallback flipped from1to0so future cases that omitblock_dimalso exercise the auto path