Skip to content

support mtp_decoder_input_detach#37

Merged
Jintao-Huang merged 2 commits intomodelscope:mainfrom
Jintao-Huang:support_mtp_decoder_input_detach
Apr 18, 2026
Merged

support mtp_decoder_input_detach#37
Jintao-Huang merged 2 commits intomodelscope:mainfrom
Jintao-Huang:support_mtp_decoder_input_detach

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
IMAGE_MAX_TOKEN_NUM=1024 \
VIDEO_MAX_TOKEN_NUM=128 \
FPS_MAX_FRAMES=12 \
megatron sft \
    --model Qwen/Qwen3.5-4B \
    --save_safetensors true \
    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
              'AI-ModelScope/alpaca-gpt4-data-en#500' \
              'swift/self-cognition#500' \
              'AI-ModelScope/LaTeX_OCR:human_handwrite#2000' \
    --model_author swift \
    --model_name swift-robot \
    --linear_decoupled_in_proj true \
    --load_from_cache_file true \
    --add_non_thinking_prefix true \
    --fp8_recipe blockwise \
    --fp8_format e4m3 \
    --fp8_param_gather true \
    --split_dataset_ratio 0.01 \
    --tuner_type full \
    --mtp_decoder_input_detach true \
    --tensor_model_parallel_size 2 \
    --micro_batch_size 1 \
    --global_batch_size 2 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --num_train_epochs 1 \
    --packing true \
    --finetune true \
    --freeze_llm false \
    --freeze_vit false \
    --freeze_aligner false \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-6 \
    --output_dir megatron_output/Qwen3.5-4B-FP8 \
    --eval_steps 200 \
    --save_steps 200 \
    --max_length 4096 \
    --dataloader_num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --mtp_num_layers 1 \
    --attention_backend flash

"mtp_decoder_input_detach": true,
截屏2026-04-18 21 16 45

"mtp_decoder_input_detach": false
截屏2026-04-18 21 16 04

@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

#29

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new configuration option mtp_decoder_input_detach to the ModelConfig and implements the corresponding logic in the MTP layer to detach decoder inputs when enabled. It also refactors the apply_module import to the module level for efficiency and improves the DSAIndexer patching logic in patcher.py by renaming classes to avoid name shadowing. I have no feedback to provide as the review comments were purely explanatory or validated the existing implementation.

@Jintao-Huang Jintao-Huang merged commit 7e9a765 into modelscope:main Apr 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants