MiniMax M3 is an open-weight LLM launched on June 1, 2026 by Shanghai Hixi Technology. It combines frontier-level coding, a 1-million-token context window, and native multimodality in a single system. It uses a proprietary Sparse Attention (MSA) architecture that cuts per-token compute at 1M context to one-twentieth of the prior generation.

How much does MiniMax M3 cost?

Standard pricing is $0.60 per million input tokens and $2.40 per million output tokens. On OpenRouter, M3 launched with a 50% promotional discount at roughly $0.30 input and $1.20 output per million tokens. Subscription plans start at $20/month for the Plus plan with roughly 1.7 billion tokens.

How does MiniMax M3 compare to Claude Opus and GPT-5.5?

On SWE-Bench Pro, M3 scores 59.0% behind Claude Opus 4.7 (64.3%) but ahead of Gemini 3.1 Pro (54.2%). M3 leads on SVG-Bench (63.7%) and BrowseComp (83.5%). It costs roughly 5% of what Claude Opus charges per task, making it significantly cheaper for long-horizon agentic workflows.

Can I self-host MiniMax M3?

MiniMax describes M3 as open-weight, meaning trained model parameters are available for download and local deployment. Weights and a technical report were expected on Hugging Face around June 11, 2026. vLLM support for the MSA architecture is on MiniMax's roadmap.

MiniMax M3 Review: The First Open-Weight Model to Do Frontier Coding, 1M Context, and Multimodality All at Once

A Shanghai lab quietly dropped the model that closed-source incumbents didn't want to exist.

MiniMax M3 launched on June 1, 2026 — and the developer community has been arguing about it ever since. Not because it's hype. Because it might actually be real.

The pitch: an open-weight model combining frontier-level coding, a 1-million-token context window, and native multimodality — all in one system. Before M3, you could pick one of those things, maybe two. Getting all three meant paying for Claude Opus, GPT-5.5, or Gemini 3.1 Pro. The tradeoffs were baked in. Not anymore.

Here's a full breakdown of what M3 is, how its new attention architecture makes 1M-token context economically viable, what the benchmarks actually say (including the parts worth being skeptical about), and whether it belongs in your stack right now.

What Is MiniMax M3?

MiniMax (formally Shanghai Hixi Technology) is the lab behind the M-series. M2 and M2.7 already had a following among developers who needed something cheaper than the Western frontier models but more capable than Llama-tier options. M3 is their most ambitious release.

Built on MiniMax's proprietary Sparse Attention (MSA) architecture, the M3 API supports up to 1 million tokens of context, with a guaranteed minimum of 512K tokens available. This makes it suitable for long-range agentic AI system tasks, extended coding sessions, and long-form video understanding. The model was pre-trained on over 100 trillion interleaved tokens across text, images, and video — not a text model with vision bolted on later. That distinction matters more than most people realize.

Model weights and a technical report will be published on Hugging Face and GitHub within about ten days, MiniMax says. MiniMax has also updated its in-house agent app, MiniMax Code, which is also set to go open-source. If you're reading this after June 11, check the MiniMax Hugging Face org for the actual files.

The Architecture That Actually Makes 1M Context Work: MSA

This is where most coverage gets lazy. Let's not do that.

Getting a model to have a million-token context window is one thing. Making it fast and cheap enough to actually use at that scale is something completely different. Full attention scales quadratically with sequence length — at 1M tokens, you're not just slow, you're practically bankrupt per inference call.

The architectural centrepiece of M3 is a new sparse attention mechanism. Unlike DeepSeek's Multi-head Latent Attention, MSA works on uncompressed key-values, sidestepping precision-loss issues in long-context inference. The mechanism uses a KV-block selection approach where the model treats KV blocks as the outer loop and queries only the relevant ones — surgical precision instead of scanning an entire library for every answer.

The MSA architecture cuts per-token compute at 1M context to one-twentieth of the prior generation, with more than 9× faster prefill and more than 15× faster decoding.

Early API testing backs up the speed claims on retrieval tasks. Community members on r/LLMDevs have been saying MSA is the real story here — not the SWE-Bench screenshot. Hard to disagree. p95 latency on messy, real-world codebases is still an open question, but the prefill numbers alone change what's possible at scale.

Benchmarks: Read the Numbers AND the Footnotes

Here's where it gets nuanced.

On SWE-Bench Pro — a demanding real-world software engineering test — M3 scores 59.0%, behind Claude Opus 4.7 (64.3%) and GPT-5.5 (58.6%), but ahead of Gemini 3.1 Pro (54.2%). M3 leads on SVG-Bench (63.7%) ahead of Opus 4.7 (62.3%) and Gemini (59.2%), and is ahead of Opus 4.7 on BrowseComp at 83.5%.

The demos are legitimately impressive. M3 was tested autonomously reproducing an ICLR 2025 Outstanding Paper, running for nearly 12 hours, producing 18 commits and 23 experimental figures without human intervention. That is a serious demonstration of long-horizon execution, not just benchmark point-scoring.

It also optimized a CUDA kernel from 7.6% to 71.3% hardware utilization over 24 hours with zero human intervention. That's not theater.

Now the honest part: several results were run on MiniMax's own infrastructure with agent scaffolding, so independent verification is still pending. Some early community evals on YouTube (one reviewer scored 15/20 across four coding projects) show a genuine leap over M2.7, but not consistently beating Claude or GPT-5.5 in actual coding sessions. On DeepSWE, some users report pass@1 around 20% — well below the SWE-Bench Pro headline.

There's also Reddit feedback worth taking seriously: reports of hallucinations and mid-task abandonment on complex flows, occasional instruction-following regressions versus M2.7, and a model that performs worse on abstract reasoning than the frontier closed models. One thread contributor put it bluntly: "sacrificed quality for benches."

None of that makes M3 bad. It makes it a model you should test on your actual workload before you commit to it.

What It Costs (And Why That's the Actual Story)

Standard pricing is $0.60 per million input tokens and $2.40 per million output tokens. On OpenRouter, M3 launched with a temporary 50% promotional discount bringing it to roughly $0.30 per million input tokens and $1.20 output per million tokens.

M3 costs roughly 5% of what Claude Opus charges per task. In long-horizon agentic workflows — where a single job might burn hundreds of thousands of tokens across dozens of tool calls — that difference isn't marginal. It's the difference between a viable product and one that empties your API budget in a day.

Pricing starts at just $20 per month under new subscription token plans. The Plus plan at $20/month gives you roughly 1.7 billion tokens. Max at $50, Ultra at $120. The quota-to-price ratio on those plans is competitive with anything currently on the market. If you're comparing pricing across the broader LLM landscape, our best AI agent coding token plans 2026 roundup breaks down how these tiers stack up against the competition.

Once the weights drop, the economics get more interesting still. Self-hosting means no per-token costs for teams with the infrastructure. vLLM support for MSA is on MiniMax's roadmap — not there yet, but coming.

Actually Native Multimodality

Most multimodal models are text models with a vision adapter grafted on. You can feel it in the outputs — the model treats images like noisy text rather than genuinely integrating visual context.

M3 was pre-trained on 100+ trillion interleaved tokens from day one, including images and video. The model achieves strong performance in coding and agent benchmarks, with autonomous task decomposition, tool invocation, and multi-step reasoning capabilities. On OmniDocBench, M3 outperforms Gemini 3.1 Pro. On VideoMMMU and Video-MME, it processes video at 1 FPS with up to 1,024 frames — roughly 17 minutes of footage per inference call.

The pairing of native video understanding with a 1M-token context window opens up workflows that genuinely didn't exist at this price point before: ingesting entire documentation libraries, analyzing large codebases with mixed text and diagrams, processing recorded meetings alongside their transcripts. That's not a feature checklist. That's a different category of task.

What You Can Try Right Now

The quickest path to M3: OpenRouter or the MiniMax API directly. For a full coding agent experience, MiniMax Code (desktop agent at agent.minimaxi.com/download) is live. OpenCode Zen also has M3 integration for terminal-based workflows.

A few things worth knowing before you start:

Use explicit long-context prompts. The model performs better when you're deliberate about scope. Vague prompts at 500K+ tokens don't end well.

Start with codebase QA or document summarization. This is where the 1M context plus native multimodal combo creates genuine competitive advantage. Complex multi-file refactors are trickier — monitor retry rates.

Toggle thinking mode for agents vs. fast chat. The API supports a reasoning-optimized mode for multi-step flows. For simple queries, skip it — some users find the model verbose when it's on.

Build checkpoints into agent loops. Community reports flag mid-task abandonment on complex flows. M3 sometimes loses the thread after many tool calls. Design your pipelines accordingly. If you're building on top of an AI SaaS boilerplate, this is especially important for your agent architecture.

The Open-Weight Fine Print Worth Reading

MiniMax describes M3 as an open-weight model, but the definition matters here. Open weight means the trained model parameters are made available for download and local deployment. Open source, in the stricter sense, means the training data, training code, and license terms also permit unrestricted commercial use. MiniMax has used a modified-MIT license for prior models, which is closer to open weight than to fully open source.

Read the license before you build a product on top of it.

Enterprise teams with data sensitivity should weigh jurisdiction carefully. MiniMax is a Chinese company subject to China's 2017 National Intelligence Law, which requires organizations to cooperate with state intelligence work when asked. That doesn't mean the model is compromised — but it's a real factor for teams deploying with sensitive data, and honest reviewers shouldn't skip past it. For a wider look at compliance considerations, our SaaS compliance checklist 2026 covers the regulatory landscape you should be aware of.

M3 vs. Claude Opus / GPT-5.5: The Actual Breakdown

Where M3 wins:

Price — not close, it's 10–20× cheaper per token
Context window affordability — 1M tokens you can actually afford to fill and use
BrowseComp and SVG-Bench benchmarks
Open-weight access for self-hosting

Where M3 trails:

SWE-Bench Pro vs. Claude Opus 4.7 (59.0% vs. 64.3%)
Abstract reasoning and complex multi-step instruction following
Latency consistency at scale — some reports put it slower than expected
Production maturity — Claude and GPT-5.5 have years of real-world hardening

The honest summary: for long-horizon agentic workflows where token cost compounds and you can supervise the agent loop, M3 is genuinely competitive. For production systems where reliability is non-negotiable, test it against your specific workload before switching. Benchmarks shaped like benchmarks aren't the same as your codebase.

Why M3 Actually Changes Things

Zoom out for a second.

Before M3, the combination of frontier-level coding, 1M-token context, and native multimodality was a closed-source exclusive. You paid for it per token, you didn't own the weights, and you definitely didn't self-host it. M3 represents a system that pairs a genuinely novel sparse attention architecture with frontier-adjacent benchmark scores at a price point well below Western closed-source competitors. The cost gap is real.

If the MSA architecture holds up under independent scrutiny once the weights ship — and the early evidence suggests it will — then M3 is the most practically significant open-weight release since DeepSeek R1 changed what people thought was possible on reasoning.

The benchmarks are vendor-run. The weights aren't public yet. And yes, some users report regressions versus M2.7 on instruction following. All of that is worth tracking.

But the model is live, the API is cheap, and the architecture claims are testable. That's a better position than most "game-changing" model launches end up in.

Start Testing M3 Today

API: platform.minimaxi.com — standard pricing at $0.60/$2.40 per million tokens; promotional rate on OpenRouter
Desktop agent: agent.minimaxi.com/download — full MiniMax Code experience
Terminal: OpenCode Zen — M3 harness included
Open weights: Check MiniMax's Hugging Face org around June 11 for weights and technical report

When you test, push it where the architecture makes real promises: needle-in-haystack at 500K+ tokens, multi-file refactors across 10+ files, video-plus-code workflows. That's where MSA either earns the headline or doesn't.

Drop your results in the comments. The community evals will matter more than vendor benchmarks here — and the next few weeks will tell us whether M3 is a landmark or a well-marketed preview.

MiniMax M3 is available now via API at platform.minimaxi.com and OpenRouter. Open weights and the full technical report are expected on Hugging Face around June 11, 2026.

Building something with open-weight models? SaaSCity is a free, instant-index directory where every listing becomes a 3D building on a live city map — a permanent indexed page, a DR 40+ backlink, and discoverability that compounds. Drop your project in, claim a plot, and let the next iteration of your work have a real home on the open web. The complete guide to SaaS directory submissions walks through how to stack it with the other 25+ directories worth submitting to for compounding SEO lift.

MiniMax M3 Review: The First Open-Weight Model to Do Frontier Coding, 1M Context, and Multimodality All at Once

What Is MiniMax M3?

The Architecture That Actually Makes 1M Context Work: MSA

Benchmarks: Read the Numbers AND the Footnotes

What It Costs (And Why That's the Actual Story)

Actually Native Multimodality

What You Can Try Right Now

The Open-Weight Fine Print Worth Reading

M3 vs. Claude Opus / GPT-5.5: The Actual Breakdown

Why M3 Actually Changes Things

Start Testing M3 Today

Get your SaaS in front of founders

Founder resources

Related articles

Kimi K3 vs Qwen3.8-Max: China Shipped Two Trillion-Parameter Open Models in One Week

Kimi-K2.7-Code Drops: Moonshot AI's Strongest Open-Source Coding Model Yet (+21.8% on Kimi Code Bench v2)

The "Claw" Era Is Here: Six Agentic AI Systems Quietly Reshaping How Work Gets Done