Npu adapt megatron by addsubmuldiv · Pull Request #153 · modelscope/twinkle

addsubmuldiv · 2026-04-13T08:39:12Z

PR Type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

Summary

This PR completes Twinkle's NPU Megatron adaptation and targets the Twinkle + Megatron-LM 0.15.3 + MindSpeed 0.15.3 + mcore-bridge stack. The goal is to make the dense / LoRA 8-card training path stable on NPU.

Main changes:

Move MindSpeed bootstrap before mcore_bridge is imported to avoid late patching and early binding of TE / Megatron symbols.
Build MindSpeed runtime args from the current ModelConfig and the runtime parallel topology, then call repatch() when the runtime signature changes.
Fix distributed initialization and metric gathering on NPU:
- add a default PG fallback for single-rank local smoke
- reuse Megatron's Gloo DP group for Python object gathering on NPU
Fix causal mask handling for NPU FlashAttention:
- stop feeding Twinkle's 4D dense causal mask directly into the MindSpeed TE flash path
- let MindSpeed generate the compressed causal mask on the causal NPU path
Complete multi-LoRA compatibility for the NPU Megatron path:
- multi-tenant LoRA training
- multi-tenant save/export flow
- optimizer capability selection cleanup

What Changed

1. MindSpeed runtime bootstrap

Added an NPU-only runtime bootstrap to ensure MindSpeed patching happens before mcore_bridge import.
Unified MindSpeed runtime arg generation into one path so Twinkle and MindSpeed do not read inconsistent runtime state.

2. Process group / metric gather

Fixed default PG initialization for single-rank Megatron smoke.
Changed NPU gather_object() to prefer Megatron's Gloo DP group to avoid hangs in metrics / Python object gathering.
Kept the DP+CP group selection for CP-enabled runs.

3. NPU FlashAttention

Fixed causal attention mask handling on NPU.
For causal NPU paths, no longer pass Twinkle's 4D dense mask directly, avoiding the MindSpeed TE FlashAttention shape mismatch.

4. LoRA / Multi-LoRA

Fixed runtime checks for LoRA finalize so a bare model with ddp_config is not incorrectly treated as a model that can run native finalize.
Cleaned up optimizer capability selection for multi-LoRA so it uses the local bf16 optimizer path that fits the model structure.
Fixed the multi-LoRA save callback signature so the current tenant adapter is correctly passed through during save.

5. Documentation

Updated the NPU support docs with Megatron backend installation and usage guidance.
Added installation notes for Megatron / MindSpeed / mcore-bridge and the matching cookbook smoke entrypoints.

Notes

This PR targets the following version stack:

Megatron-LM 0.15.3
MindSpeed 0.15.3
mcore-bridge
Twinkle NPU environment

gemini-code-assist

Code Review

This pull request introduces comprehensive NPU support for the Megatron backend, featuring documentation updates for environment requirements and a new MindSpeed runtime bootstrap for NPU-specific patching and argument synthesis. Key technical changes include refined process group initialization for single-rank environments, optimized attention mask handling for NPU FlashAttention, and the use of Gloo groups for object gathering on NPU to prevent hangs. Review feedback pointed out a potential initialization error regarding invalid arguments in init_process_group, a hard dependency on megatron-core in utility functions, hardcoded paths in the documentation, and suggested expanding the mask-dropping logic to all causal NPU configurations.

gemini-code-assist · 2026-04-13T08:40:52Z

src/twinkle/model/megatron/megatron.py

+            if backend == 'nccl':
+                init_kwargs['device_id'] = torch.device(Platform.get_local_device())


The torch.distributed.init_process_group function does not accept a device_id argument in standard PyTorch. Including it in init_kwargs will result in a TypeError during initialization. For the nccl backend, device binding is correctly handled by calling torch.cuda.set_device() before initialization, which is already performed via torch_util.set_device() on line 112.

Suggested change

if backend == 'nccl':

init_kwargs['device_id'] = torch.device(Platform.get_local_device())

if backend == 'nccl':

# NCCL binding is handled by torch.cuda.set_device() called earlier

pass

gemini-code-assist · 2026-04-13T08:40:52Z

src/twinkle/utils/framework.py

+                from megatron.core import parallel_state as mpu
+                process_group = mpu.get_data_parallel_group_gloo(
+                    with_context_parallel=getattr(device_mesh, 'cp_world_size', 1) > 1)


This block introduces a hard dependency on megatron-core whenever the platform is NPU. Since twinkle can be used on NPU with other backends (such as pure transformers), this will cause an ImportError for users who have not installed Megatron dependencies. Consider wrapping this import in a try...except block to allow a graceful fallback to the default process group.

Suggested change

from megatron.core import parallel_state as mpu

process_group = mpu.get_data_parallel_group_gloo(

with_context_parallel=getattr(device_mesh, 'cp_world_size', 1) > 1)

try:

from megatron.core import parallel_state as mpu

process_group = mpu.get_data_parallel_group_gloo(

with_context_parallel=getattr(device_mesh, 'cp_world_size', 1) > 1)

except (ImportError, ModuleNotFoundError):

pass

docs/source_en/Usage Guide/NPU-Support.md

gemini-code-assist · 2026-04-13T08:40:52Z

src/twinkle/model/megatron/megatron.py

+                if isinstance(attention_mask, torch.Tensor) and attention_mask.dim() == 4:
+                    unwrapped_model = self.strategy.unwrap_model([model])[0]
+                    attention_mask_type = getattr(unwrapped_model.config, 'attention_mask_type', None)
+                    if attention_mask_type == 'causal' and self.device_mesh.cp_world_size > 1:


The condition self.device_mesh.cp_world_size > 1 might be too restrictive. The comment explains that the 4D dense mask causes failures in aclnnFlashAttentionScore on NPU and is redundant for causal training. If the NPU FlashAttention implementation requires a specific mask shape (or no mask for causal models), this issue likely affects TP-only configurations as well. Removing the CP world size check ensures FlashAttention works correctly on NPU for all parallel configurations where causal attention is used.

Suggested change

if attention_mask_type == 'causal' and self.device_mesh.cp_world_size > 1:

if attention_mask_type == 'causal':

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Copilot

Pull request overview

This PR completes Twinkle’s NPU Megatron integration targeting the Megatron-LM 0.15.3 + MindSpeed 0.15.3 + mcore-bridge stack, focusing on stabilizing 8-card dense/LoRA training on NPU by fixing MindSpeed bootstrap timing, distributed/metric collectives, and NPU FlashAttention mask handling.

Changes:

Add an NPU MindSpeed bootstrap layer to ensure adaptor patching happens before mcore_bridge imports Megatron/TE, and synthesize/refresh MindSpeed runtime args from ModelConfig.
Adjust Megatron initialization for NPU (default PG fallback, Gloo process groups, metrics/object-gather behavior) and fix causal mask handling for NPU FlashAttention.
Update NPU documentation and add Megatron NPU smoke cookbooks/scripts.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/twinkle/utils/framework.py	Prefer Megatron’s Gloo DP group for `all_gather_object` on NPU to avoid HCCL hangs during metric/object collection.
src/twinkle/model/megatron/strategy/megatron.py	NPU-specific Megatron init tweaks (Gloo PG creation, device binding cleanup), MoE sequence-parallel auto-enable, and MindSpeed runtime arg configuration.
src/twinkle/model/megatron/multi_lora_megatron.py	Reorder MindSpeed patching ahead of `mcore_bridge` import for NPU multi-LoRA Megatron path.
src/twinkle/model/megatron/megatron.py	Add default-PG fallback for single-rank smoke, ensure early MindSpeed patching, and drop dense 4D causal masks on NPU causal TE flash path.
src/twinkle/model/megatron/_mindspeed_runtime.py	New module implementing early MindSpeed adaptor patching + runtime args synthesis + conditional repatching.
docs/source_en/Usage Guide/NPU-Support.md	Update NPU dependency guidance, add Megatron backend install steps, and point to Megatron NPU smoke cookbooks.
cookbook/megatron/ascend/tp_npu.py (+ .sh)	Add 8-card TP/PP/DP NPU Megatron smoke script.
cookbook/megatron/ascend/tp_moe_npu.py (+ .sh)	Add 8-card MoE NPU smoke script.
cookbook/megatron/ascend/tp_moe_cp_npu.py (+ .sh)	Add 8-card MoE+CP NPU smoke script (megatron_cp_algo path).

Copilot · 2026-04-13T11:46:22Z

src/twinkle/model/megatron/megatron.py

 from torch.optim import Optimizer
 from torch.optim.lr_scheduler import LRScheduler
-from transformers import PretrainedConfig
+from transformers import PreTrainedConfig


transformers exposes PretrainedConfig (lowercase “t”), not PreTrainedConfig. Importing PreTrainedConfig will raise ImportError at runtime. Please switch the import (and corresponding type hints) back to PretrainedConfig to match the Transformers API and the rest of the codebase.

Suggested change

from transformers import PreTrainedConfig

from transformers import PretrainedConfig

Copilot · 2026-04-13T11:46:22Z

src/twinkle/model/megatron/megatron.py

    def __init__(
        self,
        model_id: str,
-        config: Optional[PretrainedConfig] = None,
+        config: Optional[PreTrainedConfig] = None,
        ddp_config: Optional[Dict[str, Any]] = None,
        device_mesh: Optional[DeviceMesh] = None,
        mixed_precision: Literal['no', 'fp16', 'bf16'] = 'bf16',


The config parameter is annotated as PreTrainedConfig, but Transformers’ config base class is PretrainedConfig. With the current import this will fail to import at runtime; please update the annotation to use PretrainedConfig after fixing the import.

Copilot · 2026-04-13T11:46:23Z

src/twinkle/model/megatron/strategy/megatron.py

 # Copyright (c) ModelScope Contributors. All rights reserved.
 import torch
 import torch.nn as nn
 from transformers import PreTrainedConfig
 from typing import Any, Dict, List, Literal, Optional



transformers uses PretrainedConfig (lowercase “t”), not PreTrainedConfig. This import will fail at runtime and also makes the Megatron strategy inconsistent with other Twinkle model code that uses PretrainedConfig.

Copilot · 2026-04-13T11:46:23Z

src/twinkle/utils/framework.py

+            if Platform.device_prefix() == 'npu':
+                # On NPU, letting Python object collectives use the default HCCL
+                # group previously hung in 8-card metric collection at
+                # ``dist.all_gather_object(...)``. Reuse Megatron's dedicated Gloo
+                # DP group instead. When CP is enabled we must pick the DP+CP
+                # variant, otherwise the rank span for metric aggregation is wrong.
+                try:
+                    from megatron.core import parallel_state as mpu
+                    process_group = mpu.get_data_parallel_group_gloo(
+                        with_context_parallel=getattr(device_mesh, 'cp_world_size', 1) > 1)
+                except (ImportError, ModuleNotFoundError):
+                    pass
            group_size = dist.get_world_size(group=process_group)
            output_objects = [None for _ in range(group_size)]
            dist.all_gather_object(output_objects, object, group=process_group)


On NPU this tries to switch all_gather_object onto Megatron’s Gloo DP group, but if Megatron isn’t importable (or its groups aren’t initialized yet) the code silently falls back to the default group and still calls dist.all_gather_object(...), which your comment notes can hang on HCCL. Consider adding a safe fallback (e.g., create/use a dedicated Gloo process group for object collectives) or raise/log a clear error instead of silently using the default group.

Copilot · 2026-04-13T11:46:23Z

src/twinkle/model/megatron/strategy/megatron.py

+            default_pg = dist.distributed_c10d._get_default_group()
+            if getattr(default_pg, 'bound_device_id', None) is not None:
+                # If the default HCCL PG keeps a bound device id, PyTorch may
+                # propagate that binding into later Gloo subgroup creation. That
+                # breaks the metrics/object-gather path on NPU, so clear it
+                # before Megatron creates its Gloo DP groups.
+                default_pg.bound_device_id = None


This relies on private PyTorch APIs (dist.distributed_c10d._get_default_group()) and unconditionally assigns to default_pg.bound_device_id. Depending on the torch version/backend, bound_device_id may be read-only or absent, which would raise at init time. Please guard both retrieval and assignment with try/except (or hasattr + safe setter) so Megatron init can’t crash on versions where this attribute isn’t writable.

Suggested change

default_pg = dist.distributed_c10d._get_default_group()

if getattr(default_pg, 'bound_device_id', None) is not None:

# If the default HCCL PG keeps a bound device id, PyTorch may

# propagate that binding into later Gloo subgroup creation. That

# breaks the metrics/object-gather path on NPU, so clear it

# before Megatron creates its Gloo DP groups.

default_pg.bound_device_id = None

try:

default_pg = dist.distributed_c10d._get_default_group()

except Exception as e:

logger.debug(f'Failed to get default process group for NPU bound_device_id workaround: {e}')

else:

if getattr(default_pg, 'bound_device_id', None) is not None:

# If the default HCCL PG keeps a bound device id, PyTorch may

# propagate that binding into later Gloo subgroup creation. That

# breaks the metrics/object-gather path on NPU, so clear it

# before Megatron creates its Gloo DP groups.

try:

default_pg.bound_device_id = None

except Exception as e:

logger.debug(

f'Failed to clear default process group bound_device_id for NPU workaround: {e}')

tastelikefeet · 2026-04-14T03:17:36Z

src/twinkle/model/megatron/megatron.py

        self.active_group = _default_adapter_name
        MegatronPeft().__call__()

+    def _ensure_megatron_process_group(self):


上面的self._try_init_process_group() 应该包含了process_group的初始化操作，为什么需要处理第二次呢

这块是有点问题，已经重写相关逻辑

tastelikefeet · 2026-04-14T03:18:52Z

src/twinkle/model/megatron/megatron.py

+            # and padded query positions are ignored by labels == -100. So on
+            # the NPU TE path, drop this dense mask and let MindSpeed build the
+            # compressed causal mask it requires.
+            if Platform.device_prefix() == 'npu':


这个if放入InputProcessor里面是否更合适

这个判断依赖 unwrapped_model.config.attention_mask_type，该属性依赖的是 Megatron 运行时的模型配置，只能在 forward 运行时从模型实例上获取。InputProcessor 是纯数据处理组件，不持有模型引用，无法做这个判断。我把这块封装了一下简洁一点

tastelikefeet · 2026-04-14T03:19:36Z

src/twinkle/utils/framework.py

+                    from megatron.core import parallel_state as mpu
+                    process_group = mpu.get_data_parallel_group_gloo(
+                        with_context_parallel=getattr(device_mesh, 'cp_world_size', 1) > 1)
+                except (ImportError, ModuleNotFoundError):


这个try except应该没有意义吧如果初始化出错了就应该抛错

已按建议改掉

addsubmuldiv added 3 commits April 11, 2026 21:55

adapt npu megatron 0.15

504acae

fix

b955af9

update doc of npu

1adf223

gemini-code-assist bot reviewed Apr 13, 2026

View reviewed changes

addsubmuldiv and others added 4 commits April 13, 2026 16:50

Update docs/source_en/Usage Guide/NPU-Support.md

2f9dd40

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix

b493a22

update cookbook and doc

09ce014

update

917c4f0

addsubmuldiv marked this pull request as ready for review April 13, 2026 11:42

Copilot AI review requested due to automatic review settings April 13, 2026 11:42

Copilot started reviewing on behalf of addsubmuldiv April 13, 2026 11:43 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

tastelikefeet reviewed Apr 14, 2026

View reviewed changes

addsubmuldiv added 3 commits April 14, 2026 15:06

fix

0a52cfb

fix

41b0dcc

fix

2ff82f3

		if backend == 'nccl':
		init_kwargs['device_id'] = torch.device(Platform.get_local_device())

	if attention_mask_type == 'causal' and self.device_mesh.cp_world_size > 1:
	if attention_mask_type == 'causal':

	from transformers import PreTrainedConfig
	from transformers import PretrainedConfig

Conversation

addsubmuldiv commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Summary

What Changed

1. MindSpeed runtime bootstrap

2. Process group / metric gather

3. NPU FlashAttention

4. LoRA / Multi-LoRA

5. Documentation

Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

tastelikefeet Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

addsubmuldiv Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

tastelikefeet Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

addsubmuldiv Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

tastelikefeet Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

addsubmuldiv Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

addsubmuldiv commented Apr 13, 2026 •

edited

Loading