Skip to content

Make NVTE tensor handle pool size configurable#3090

Open
lhb8125 wants to merge 4 commits into
NVIDIA:mainfrom
lhb8125:codex/nvte-tensor-pool-env-var
Open

Make NVTE tensor handle pool size configurable#3090
lhb8125 wants to merge 4 commits into
NVIDIA:mainfrom
lhb8125:codex/nvte-tensor-pool-env-var

Conversation

@lhb8125

@lhb8125 lhb8125 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add runtime environment variables to configure the internal NVTETensor and NVTEGroupedTensor handle pool sizes.
  • Keep the default pool size at 20 MiB to preserve existing behavior.
  • Improve exhaustion errors to mention the configured pool size and the env var to increase.

Motivation

Large model initialization paths can legitimately create more TE tensor handles than the current fixed-size pool allows, even when GPU and CPU memory are otherwise sufficient. Exposing the pool size as an environment variable avoids downstream source patches for these scale-dependent cases.

Testing

  • git diff --check

Signed-off-by: hongbinl <hongbinl@nvidia.com>
@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 5, 2026
@greptile-apps

greptile-apps Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR exposes the internal NVTETensor and NVTEGroupedTensor handle pool sizes as runtime environment variables (NVTE_TENSOR_HANDLE_POOL_SIZE_MB / NVTE_GROUPED_TENSOR_HANDLE_POOL_SIZE_MB), defaulting to 20 MiB. It also improves exhaustion error messages to mention the configured pool size and the env var to increase.

  • transformer_engine/common/transformer_engine.cpp: Adds a standalone GetTensorHandlePoolSizeMB parser with full-string digit validation and overflow guards, and a GetTensorHandlePoolCapacity helper; both allocator classes now use these to populate their MAX_*_NUM members via default member initializers in declaration order.
  • docs/envvars.rst: Documents the two new env vars under a new "General" subsection with type, default, and description.

Confidence Score: 5/5

Safe to merge — the change is additive, backward-compatible (default stays 20 MiB), and the parsing logic is sound.

The env-var parser uses full-string digit validation with a correct overflow guard, member initialization order in both allocator classes ensures the pool-size field is ready before the capacity field consumes it, and all error paths produce clear diagnostics. No existing behavior changes unless the new env vars are explicitly set.

No files require special attention.

Important Files Changed

Filename Overview
transformer_engine/common/transformer_engine.cpp Adds configurable pool-size helpers with correct full-string validation, overflow guards, and proper member-declaration order so TENSOR_HANDLE_POOL_SIZE_MB is initialized before MAX_TENSOR_NUM.
docs/envvars.rst Documents the two new env vars correctly; defaults and types match the implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Program startup / first tensor alloc] --> B{Singleton init\nTensorAllocator / GroupedTensorAllocator}
    B --> C[GetTensorHandlePoolSizeMB\nenv_var]
    C --> D{env var set?}
    D -- No --> E[return kDefault = 20 MiB]
    D -- Yes --> F[Full-string digit validation\n+ overflow check]
    F -- Invalid --> G[NVTE_CHECK throws\nstartup error]
    F -- Valid --> H[return pool_size_mb]
    E --> I[GetTensorHandlePoolCapacity\npool_size_mb, sizeof Handle]
    H --> I
    I --> J{pool_size_bytes >= handle_size?}
    J -- No --> K[NVTE_CHECK: pool too small]
    J -- Yes --> L[capacity = pool_size_bytes / sizeof Handle]
    L --> M[memory.reserve capacity]
    M --> N[Allocate / Free tensors]
    N --> O{available >= N?}
    O -- No --> P[NVTE_CHECK error\nmentions MB and env var]
    O -- Yes --> Q[Allocate from free_list or emplace_back]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Program startup / first tensor alloc] --> B{Singleton init\nTensorAllocator / GroupedTensorAllocator}
    B --> C[GetTensorHandlePoolSizeMB\nenv_var]
    C --> D{env var set?}
    D -- No --> E[return kDefault = 20 MiB]
    D -- Yes --> F[Full-string digit validation\n+ overflow check]
    F -- Invalid --> G[NVTE_CHECK throws\nstartup error]
    F -- Valid --> H[return pool_size_mb]
    E --> I[GetTensorHandlePoolCapacity\npool_size_mb, sizeof Handle]
    H --> I
    I --> J{pool_size_bytes >= handle_size?}
    J -- No --> K[NVTE_CHECK: pool too small]
    J -- Yes --> L[capacity = pool_size_bytes / sizeof Handle]
    L --> M[memory.reserve capacity]
    M --> N[Allocate / Free tensors]
    N --> O{available >= N?}
    O -- No --> P[NVTE_CHECK error\nmentions MB and env var]
    O -- Yes --> Q[Allocate from free_list or emplace_back]
Loading

Reviews (2): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile

Comment thread transformer_engine/common/transformer_engine.cpp Outdated
Comment thread transformer_engine/common/transformer_engine.cpp
@lhb8125 lhb8125 marked this pull request as draft June 5, 2026 08:17
lhb8125 and others added 2 commits June 8, 2026 08:22
Signed-off-by: hongbinl <hongbinl@nvidia.com>
@ptrendx

ptrendx commented Jun 9, 2026

Copy link
Copy Markdown
Member

I am not opposed to creating such a variable, but I would really like to see an example of such legitimate use which goes over this limit. Could you run the experiment that is failing for you with https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/transformer_engine.cpp#L487 set to true and send me the log of that?

@lhb8125

lhb8125 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

I am not opposed to creating such a variable, but I would really like to see an example of such legitimate use which goes over this limit. Could you run the experiment that is failing for you with https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/transformer_engine.cpp#L487 set to true and send me the log of that?

te_handle_pool_debug_fakepg_20260616-010629_2134338.pruned.log
I attached a pruned version, which replaces the repeated lines with a comment. The case is EP1 for Qwen3-235B training, so the number of local experts is 128, where we adopt full recompute to control the dynamic activation's memory footprint.
I sent the whole log to you via slack.

@lhb8125 lhb8125 marked this pull request as ready for review June 16, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants