[M3 Request]  Demand for a localized "M3-Mini" / Flash variant optimized for 128GB RAM environments

### Capability area

Deployment / inference

### What does M2.7 fail to do for you?

Even with the recent local optimization frameworks, running MiniMax M2.7 on standard 128GB workstation configurations (such as standard Mac Studio setups or dual-GPU rigs totaling 128GB unified/VRAM) is heavily bottlenecked.

The 4-bit MXFP4 and standard 4-bit quantizations hover right around 122GB–129GB. This leaves zero headroom for system memory or the massive KV cache needed for long-context workloads. Running a 4-bit variant often results in active OOM faults or severe context capping (failing to go past low-depth generation), rendering the model's signature deep-context capabilities unusable locally without multiple nodes or cluster setups.

Example Workload: Parsing an entire code repository or a 300k token documentation library locally via vLLM or llama.cpp for multi-step agentic execution.

### What would "good" look like in M3?

A native "M3-Mini", "M3-Flash", or a highly-optimized smaller dense variant that occupies an unquantized footprint small enough to comfortably run quantized (e.g., 4-bit or 8-bit) inside a 128GB RAM envelope while preserving a functional KV-cache headroom.

Performance expectations:

Memory Footprint: Quantized model size staying well under 80GB–90GB to dedicate the remaining 38GB+ to the MiniMax Sparse Attention (MSA) context buffer.

Latency: Fast token-per-second decoding speeds (targeting 40-50 tok/s on mid-tier consumer/prosumer clusters).

Quality: Retention of M3’s core agentic capabilities (tool calling, SWE-bench coding depth) even if broad knowledge capacity is scaled down.

### References

The official [MiniMax Local Deployment Guide](https://platform.minimax.io/docs/guides/local-deploy) explicitly indicates that 4-bit variants require a minimum of 136GB–140GB of memory headroom to function safely, excluding a 128GB machine entirely without aggressive 3-bit performance degradation.

Competitor examples such as Qwen 3.6 35B / 27B combinations or DeepSeek V4 Flash variants, which provide explicit tiers for local deployment footprints.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[M3 Request] Demand for a localized "M3-Mini" / Flash variant optimized for 128GB RAM environments #15

Capability area

What does M2.7 fail to do for you?

What would "good" look like in M3?

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[M3 Request] Demand for a localized "M3-Mini" / Flash variant optimized for 128GB RAM environments #15

Description

Capability area

What does M2.7 fail to do for you?

What would "good" look like in M3?

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions