How MiniMax Sparse Attention Unlocks Million-Token Contexts for AI

Artimouse PrimeJune 17, 2026

0 83 3 minutes read

Attention is the heart of transformer models. But it has a big problem. Its cost grows with the square of the context length. That means doubling the input size quadruples the work. For very long inputs—like hundreds of thousands or millions of tokens—this becomes a huge bottleneck.

MiniMax Sparse Attention (MSA) offers a smart fix. Instead of letting every query look at every past token, it narrows the focus. It divides past tokens into blocks and picks only the most relevant blocks for each query. Then it runs full attention just on those blocks.

This block selection cuts the per-query attention cost from growing with the full context length to a fixed budget of around 2,000 tokens. That’s a huge saving when your context runs into the millions.

Two Branches for Smarter Attention

MSA works with two branches. The first is the Index Branch. It scores all key tokens quickly using a small, extra attention layer. Then it pools these scores into blocks—each block holds 128 tokens. For each query group, it picks the top 16 blocks based on these scores.

The second is the Main Branch. It performs exact softmax attention but only on the selected blocks. This keeps attention precise but slashes the amount of work.

Importantly, the block containing the query token itself is always included. This keeps the local context intact and avoids missing nearby important tokens.

Training a Non-Differentiable Selector

Choosing blocks with a top-k operation is not differentiable. That means you can’t directly train the Index Branch with regular backpropagation. MSA solves this by teaching the Index Branch to mimic the Main Branch’s attention distribution using a KL divergence loss.

Think of the Main Branch as the teacher and the Index Branch as the student. The student learns to predict which blocks the teacher would attend to. To keep training stable, gradients from this loss update only the Index Branch’s projection weights and don’t affect the main model.

They also start training with full attention on both branches for a warmup period. This helps the Index Branch learn good block scoring before it starts selecting blocks exclusively.

Custom GPU Kernels Make It Fast

Sparse attention often saves theoretical compute but struggles with real speed. Memory access and irregular computations can kill any gains. MiniMax tackled this with custom GPU kernels designed specifically for MSA.

One clever trick is skipping the softmax exponentiation during top-k selection. Since softmax preserves order, ranking raw scores works just as well. This saves time in the selection step.

The main attention kernel loops over key-value blocks instead of queries. It gathers queries touching each block and packs them into efficient matrix operations. This boosts arithmetic intensity and matches GPU hardware better.

They also split computation into two phases to handle partial attention results and combine them safely. The full system runs 14 times faster when pre-filling a 1 million token context and nearly 8 times faster when decoding tokens one by one on NVIDIA H800 GPUs.

Strong Quality and Practical Deployment

MSA was tested inside a 109 billion parameter Mixture-of-Experts model trained on 3 trillion tokens of multimodal data. The model kept performance close to full attention across a wide range of tasks. This includes reasoning, math, code generation, and image and video understanding.

Interestingly, training from scratch with sparse attention sometimes improved performance on certain tasks. This suggests the sparse pattern can help the model learn more efficient representations.

MSA also supports converting existing dense models. You can take a trained full-attention model and continue training with sparse attention. This makes it easier for labs to adopt MSA without starting over.

The MiniMax team released an open-weight 1M-token context model called MiniMax-M3 that uses this attention. It scored 59% on a tough software engineering benchmark, showing it can handle real-world coding tasks at scale.

Why This Matters for Future AI

Long-context capability is a major frontier for large language models. Many applications need the model to process huge documents, long conversations, or persistent memory across sessions.

Dense attention hits a wall as context grows. MSA shows it’s possible to keep full softmax attention quality while cutting compute and memory drastically. It’s a practical design that works with existing transformer architectures.

Beyond just theory, MSA’s hardware-aware kernels prove that sparsity can translate to real speedups. This is critical for deploying ultra-long context models in production.

MiniMax Sparse Attention offers a clear path forward. It balances simplicity, quality, and efficiency. And it makes million-token contexts not just a dream but a reality you can run on modern GPUs today.

Based on

Stay connected via Google News

How MiniMax Sparse Attention Unlocks Million-Token Contexts for AI

Two Branches for Smarter Attention

Training a Non-Differentiable Selector

Custom GPU Kernels Make It Fast

Strong Quality and Practical Deployment

Why This Matters for Future AI

Artimouse Prime

Leave a Reply Cancel reply

Meta Launches Astryx Beta with AI Tools for React Design Systems

Apple’s Bold Move for Chinese Memory Chips Sparks Debate

Why Amazon Is Abandoning Human-in-the-Loop AI Oversight

Why Most Americans Doubt AI’s Promise and Fear Its Risks

Mastering Time Series Forecasting and Machine Learning Pipelines in Python

Mastering Time Series Forecasting and Machine Learning Pipelines in Python

How OpenAI Is Bringing AI Into Family Life and Workplaces

The Real Cost of AI Work and Who Pays the Price

The Six-Month Countdown for Open AI Models

Unlocking Forgotten Memories in Fruit Flies with Simple Reminders

OpenAI Launches Mobile Access for Its Coding Platform

Two Branches for Smarter Attention

Training a Non-Differentiable Selector

Custom GPU Kernels Make It Fast

Strong Quality and Practical Deployment

Why This Matters for Future AI

Artimouse Prime

How NFC and AI Are Shaping the Future of Smart Glasses

How AI is speeding up UK council planning and house building

Related Articles

Inside the GPT-5.6 Launch Revolution and What’s Next for AI

China’s Moonshot AI Readies Giant Model to Rival Anthropic and OpenAI

China’s GLM-5.2 Challenges Top AI Leaders with Open Architecture

Liquid AI’s LFM2.5-230M Shakes Up On-Device Language Models

Leave a Reply Cancel reply

Mastering Time Series Forecasting and Machine Learning Pipelines in Python

How OpenAI Is Bringing AI Into Family Life and Workplaces

The Real Cost of AI Work and Who Pays the Price

The Six-Month Countdown for Open AI Models

Unlocking Forgotten Memories in Fruit Flies with Simple Reminders

OpenAI Launches Mobile Access for Its Coding Platform