Skip to content

[M3 Request] Demand for a localized "M3-Mini" / Flash variant optimized for 128GB RAM environments #15

@DeepakJangra239

Description

@DeepakJangra239

Capability area

Deployment / inference

What does M2.7 fail to do for you?

Even with the recent local optimization frameworks, running MiniMax M2.7 on standard 128GB workstation configurations (such as standard Mac Studio setups or dual-GPU rigs totaling 128GB unified/VRAM) is heavily bottlenecked.

The 4-bit MXFP4 and standard 4-bit quantizations hover right around 122GB–129GB. This leaves zero headroom for system memory or the massive KV cache needed for long-context workloads. Running a 4-bit variant often results in active OOM faults or severe context capping (failing to go past low-depth generation), rendering the model's signature deep-context capabilities unusable locally without multiple nodes or cluster setups.

Example Workload: Parsing an entire code repository or a 300k token documentation library locally via vLLM or llama.cpp for multi-step agentic execution.

What would "good" look like in M3?

A native "M3-Mini", "M3-Flash", or a highly-optimized smaller dense variant that occupies an unquantized footprint small enough to comfortably run quantized (e.g., 4-bit or 8-bit) inside a 128GB RAM envelope while preserving a functional KV-cache headroom.

Performance expectations:

Memory Footprint: Quantized model size staying well under 80GB–90GB to dedicate the remaining 38GB+ to the MiniMax Sparse Attention (MSA) context buffer.

Latency: Fast token-per-second decoding speeds (targeting 40-50 tok/s on mid-tier consumer/prosumer clusters).

Quality: Retention of M3’s core agentic capabilities (tool calling, SWE-bench coding depth) even if broad knowledge capacity is scaled down.

References

The official MiniMax Local Deployment Guide explicitly indicates that 4-bit variants require a minimum of 136GB–140GB of memory headroom to function safely, excluding a 128GB machine entirely without aggressive 3-bit performance degradation.

Competitor examples such as Qwen 3.6 35B / 27B combinations or DeepSeek V4 Flash variants, which provide explicit tiers for local deployment footprints.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions