Make NVTE tensor handle pool size configurable by lhb8125 · Pull Request #3090 · NVIDIA/TransformerEngine

lhb8125 · 2026-06-05T07:17:06Z

Summary

Add runtime environment variables to configure the internal NVTETensor and NVTEGroupedTensor handle pool sizes.
Keep the default pool size at 20 MiB to preserve existing behavior.
Improve exhaustion errors to mention the configured pool size and the env var to increase.

Motivation

Large model initialization paths can legitimately create more TE tensor handles than the current fixed-size pool allows, even when GPU and CPU memory are otherwise sufficient. Exposing the pool size as an environment variable avoids downstream source patches for these scale-dependent cases.

Testing

git diff --check

Signed-off-by: hongbinl <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-06-05T07:24:28Z

Greptile Summary

This PR exposes the internal NVTETensor and NVTEGroupedTensor handle pool sizes as runtime environment variables (NVTE_TENSOR_HANDLE_POOL_SIZE_MB / NVTE_GROUPED_TENSOR_HANDLE_POOL_SIZE_MB), defaulting to 20 MiB. It also improves exhaustion error messages to mention the configured pool size and the env var to increase.

transformer_engine/common/transformer_engine.cpp: Adds a standalone GetTensorHandlePoolSizeMB parser with full-string digit validation and overflow guards, and a GetTensorHandlePoolCapacity helper; both allocator classes now use these to populate their MAX_*_NUM members via default member initializers in declaration order.
docs/envvars.rst: Documents the two new env vars under a new "General" subsection with type, default, and description.

Confidence Score: 5/5

Safe to merge — the change is additive, backward-compatible (default stays 20 MiB), and the parsing logic is sound.

The env-var parser uses full-string digit validation with a correct overflow guard, member initialization order in both allocator classes ensures the pool-size field is ready before the capacity field consumes it, and all error paths produce clear diagnostics. No existing behavior changes unless the new env vars are explicitly set.

No files require special attention.

Important Files Changed

Filename	Overview
transformer_engine/common/transformer_engine.cpp	Adds configurable pool-size helpers with correct full-string validation, overflow guards, and proper member-declaration order so TENSOR_HANDLE_POOL_SIZE_MB is initialized before MAX_TENSOR_NUM.
docs/envvars.rst	Documents the two new env vars correctly; defaults and types match the implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Program startup / first tensor alloc] --> B{Singleton init\nTensorAllocator / GroupedTensorAllocator}
    B --> C[GetTensorHandlePoolSizeMB\nenv_var]
    C --> D{env var set?}
    D -- No --> E[return kDefault = 20 MiB]
    D -- Yes --> F[Full-string digit validation\n+ overflow check]
    F -- Invalid --> G[NVTE_CHECK throws\nstartup error]
    F -- Valid --> H[return pool_size_mb]
    E --> I[GetTensorHandlePoolCapacity\npool_size_mb, sizeof Handle]
    H --> I
    I --> J{pool_size_bytes >= handle_size?}
    J -- No --> K[NVTE_CHECK: pool too small]
    J -- Yes --> L[capacity = pool_size_bytes / sizeof Handle]
    L --> M[memory.reserve capacity]
    M --> N[Allocate / Free tensors]
    N --> O{available >= N?}
    O -- No --> P[NVTE_CHECK error\nmentions MB and env var]
    O -- Yes --> Q[Allocate from free_list or emplace_back]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Program startup / first tensor alloc] --> B{Singleton init\nTensorAllocator / GroupedTensorAllocator}
    B --> C[GetTensorHandlePoolSizeMB\nenv_var]
    C --> D{env var set?}
    D -- No --> E[return kDefault = 20 MiB]
    D -- Yes --> F[Full-string digit validation\n+ overflow check]
    F -- Invalid --> G[NVTE_CHECK throws\nstartup error]
    F -- Valid --> H[return pool_size_mb]
    E --> I[GetTensorHandlePoolCapacity\npool_size_mb, sizeof Handle]
    H --> I
    I --> J{pool_size_bytes >= handle_size?}
    J -- No --> K[NVTE_CHECK: pool too small]
    J -- Yes --> L[capacity = pool_size_bytes / sizeof Handle]
    L --> M[memory.reserve capacity]
    M --> N[Allocate / Free tensors]
    N --> O{available >= N?}
    O -- No --> P[NVTE_CHECK error\nmentions MB and env var]
    O -- Yes --> Q[Allocate from free_list or emplace_back]

_{Reviews (2): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

Signed-off-by: hongbinl <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci

ptrendx · 2026-06-09T15:42:16Z

I am not opposed to creating such a variable, but I would really like to see an example of such legitimate use which goes over this limit. Could you run the experiment that is failing for you with https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/transformer_engine.cpp#L487 set to true and send me the log of that?

lhb8125 · 2026-06-16T08:37:59Z

I am not opposed to creating such a variable, but I would really like to see an example of such legitimate use which goes over this limit. Could you run the experiment that is failing for you with https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/transformer_engine.cpp#L487 set to true and send me the log of that?

te_handle_pool_debug_fakepg_20260616-010629_2134338.pruned.log
I attached a pruned version, which replaces the repeated lines with a comment. The case is EP1 for Qwen3-235B training, so the number of local experts is 128, where we adopt full recompute to control the dynamic activation's memory footprint.
I sent the whole log to you via slack.

Make NVTE tensor handle pool size configurable

ec48388

Signed-off-by: hongbinl <hongbinl@nvidia.com>

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 5, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

c3a3285

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread transformer_engine/common/transformer_engine.cpp Outdated

Comment thread transformer_engine/common/transformer_engine.cpp

lhb8125 marked this pull request as draft June 5, 2026 08:17

lhb8125 and others added 2 commits June 8, 2026 08:22

Validate tensor handle pool env vars

5c9ee29

Signed-off-by: hongbinl <hongbinl@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

935054d

for more information, see https://pre-commit.ci

lhb8125 marked this pull request as ready for review June 16, 2026 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make NVTE tensor handle pool size configurable#3090

Make NVTE tensor handle pool size configurable#3090
lhb8125 wants to merge 4 commits into
NVIDIA:mainfrom
lhb8125:codex/nvte-tensor-pool-env-var

lhb8125 commented Jun 5, 2026

Uh oh!

greptile-apps Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ptrendx commented Jun 9, 2026

Uh oh!

lhb8125 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lhb8125 commented Jun 5, 2026

Summary

Motivation

Testing

Uh oh!

greptile-apps Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

ptrendx commented Jun 9, 2026

Uh oh!

lhb8125 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 5, 2026 •

edited

Loading