Skip to content

Add Muon optimizer with 8-bit quantized momentum (bnb.optim.Muon8bit) #1973

@theblackcat102

Description

@theblackcat102

Feature request

Add a Muon optimizer to bitsandbytes.optim whose momentum buffer is stored in bitsandbytes' quantized formats, primarily an 8-bit blockwise variant (Muon8bit), with the existing 32-bit and 4-bit (NF4 / FP4 / NVFP4) paths as siblings.

Motivation

Major open models are now pretrained with Muon (e.g. Kimi, Laguna XS.2), and recent work shows the finetuning optimizer should be consistent with the pretraining optimizer (Liu, Wang & Zhang, arXiv:2605.06654) finds that "full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff." So as Muon-pretrained checkpoints proliferate, users increasingly need a Muon optimizer to finetune them.

Your contribution

I already have implemented a muon version of 32-bit, 8bit and 4 bit version under : https://github.com/theblackcat102/bitsandbytes/tree/feature/muon

From a 50-step smoke test over 96 configurations (Qwen/Qwen3-4B-Base, bf16):

  • 8-bit Muon matches 32-bit Muon — the loss curves nearly overlap across all 50 steps, so 8-bit is a practical default rather than a quality trade-off.
  • Memory: Muon 8-bit uses 13.63 GB allocated vs 23.73 GB for Muon 32-bit (−43%), and −26% peak. The saving comes entirely from quantizing the momentum — in bf16 training, 32-bit Muon is a memory wash vs AdamW (2×bf16 = 1×fp32 = 4 B/param), so the quantized buffer is what makes Muon worthwhile in bitsandbytes terms.
  • NF4 / FP4 / NVFP4 are equivalent on convergence under stable LR; 8-bit captures most of the win (8→4 bit saves only a further ~1.5 GB, since the remaining floor is model weights, not optimizer state).
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions