Feature request
Add a Muon optimizer to bitsandbytes.optim whose momentum buffer is stored in bitsandbytes' quantized formats, primarily an 8-bit blockwise variant (Muon8bit), with the existing 32-bit and 4-bit (NF4 / FP4 / NVFP4) paths as siblings.
Motivation
Major open models are now pretrained with Muon (e.g. Kimi, Laguna XS.2), and recent work shows the finetuning optimizer should be consistent with the pretraining optimizer (Liu, Wang & Zhang, arXiv:2605.06654) finds that "full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff." So as Muon-pretrained checkpoints proliferate, users increasingly need a Muon optimizer to finetune them.
Your contribution
I already have implemented a muon version of 32-bit, 8bit and 4 bit version under : https://github.com/theblackcat102/bitsandbytes/tree/feature/muon
From a 50-step smoke test over 96 configurations (Qwen/Qwen3-4B-Base, bf16):
- 8-bit Muon matches 32-bit Muon — the loss curves nearly overlap across all 50 steps, so 8-bit is a practical default rather than a quality trade-off.
- Memory: Muon 8-bit uses 13.63 GB allocated vs 23.73 GB for Muon 32-bit (−43%), and −26% peak. The saving comes entirely from quantizing the momentum — in bf16 training, 32-bit Muon is a memory wash vs AdamW (2×bf16 = 1×fp32 = 4 B/param), so the quantized buffer is what makes Muon worthwhile in bitsandbytes terms.
- NF4 / FP4 / NVFP4 are equivalent on convergence under stable LR; 8-bit captures most of the win (8→4 bit saves only a further ~1.5 GB, since the remaining floor is model weights, not optimizer state).

Feature request
Add a Muon optimizer to bitsandbytes.optim whose momentum buffer is stored in bitsandbytes' quantized formats, primarily an 8-bit blockwise variant (Muon8bit), with the existing 32-bit and 4-bit (NF4 / FP4 / NVFP4) paths as siblings.
Motivation
Major open models are now pretrained with Muon (e.g. Kimi, Laguna XS.2), and recent work shows the finetuning optimizer should be consistent with the pretraining optimizer (Liu, Wang & Zhang, arXiv:2605.06654) finds that "full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff." So as Muon-pretrained checkpoints proliferate, users increasingly need a Muon optimizer to finetune them.
Your contribution
I already have implemented a muon version of 32-bit, 8bit and 4 bit version under : https://github.com/theblackcat102/bitsandbytes/tree/feature/muon
From a 50-step smoke test over 96 configurations (Qwen/Qwen3-4B-Base, bf16):