Maximizing speed: How continuous batching unlocks unprecedented LLM throughput

Maximizing speed: How continuous batching unlocks unprecedented LLM throughput

NewsOctober 27, 2025Artifice Prime

I discussed how PagedAttention cracked the code on LLM memory chaos in Unlocking LLM superpowers: How PagedAttention helps the memory maze. Now let’s review another component known as continuous batching or unified batch scheduling. Think of it as the ultimate warehouse manager for your GPU’s brain, organizing information so efficiently that it eliminates wasted space and sets the stage for something big.

But here’s the thing: A perfectly organized warehouse is only as fast as the logistics system that moves the goods. You can have all the space in the world, but if your trucks are stuck in traffic, nothing gets delivered on time.

That’s where our next game-changer comes in: continuous batching. If PagedAttention gave us the capacity, continuous batching is the nitro boost that sends performance into the stratosphere. It’s the engine that takes LLM serving from pretty fast to absolutely insane speed.

Why old-school batching just doesn’t cut it

To handle multiple users at once, LLM systems bundle requests together. It’s a classic move. The problem? The classic ways of doing it fall apart with the unpredictable, free-flowing nature of language. Imagine you’re at a coffee shop with a group of friends. The barista says, “I’ll make all your drinks at once, but I can’t hand any out until the last one, a complicated, 10-step caramel macchiato, is finished.” You’ve ordered a simple espresso coffee? Tough luck. You’re waiting.

This is the fundamental flaw of traditional batching, known as head-of-line blocking. The entire batch is held hostage by its slowest member. Other critical issues include:

Wasted power: If a request finishes early (like hitting a stop command), it can’t just leave the batch. The GPU sits there, twiddling its transistors, waiting for everyone else to finish.
Inflexible workflow: New requests have to wait for the entire current batch to clear before they can even get started, leading to frustrating delays.

The result? Your expensive, powerful hardware is spending more time waiting than working.

Enter continuous batching: The nonstop conveyor belt

So, how do we fix this? We throw out the wait-for-everyone model and replace it with something far more dynamic. Continuous batching, or iteration-level scheduling, is like swapping that slow coffee shop for a high-speed sushi conveyor belt. Instead of processing a fixed group of requests from start to finish, the system works one token at a time across all active requests. After each tiny step, it takes a microsecond to reassess the situation.

Here’s the magic in action:

Work is scheduled in micro-steps: The GPU processes a single decoding step for all active sequences, then immediately checks the queue.
On-the-fly swaps: The moment a request is done generating, it exits the batch, freeing up its spot. That spot is instantly filled by the next waiting request.
Constant, maxed-out utilization: The GPU never stops. There’s no more idle time. It’s a continuous, flowing river of computation.

This transforms wasted cycles into pure, unadulterated throughput. In real-world terms, this isn’t a minor improvement — it’s a paradigm shift, potentially boosting performance by up to 20 times compared to the old way of doing things.

The dream team: PagedAttention meets continuous batching

Now, you might be wondering, “This sounds chaotic! How do you manage memory when requests are constantly jumping on and off the conveyor belt?” This is where the beautiful synergy comes in. Continuous batching’s dynamic, ever-changing workload demands a memory manager that can keep up. PagedAttention isn’t just compatible; it’s essential.

Flexibility on demand

PagedAttention’s block-based memory system is perfect for this chaos. It can instantly allocate and free small blocks of memory as requests enter and exit, without the nightmare of fragmentation.

Packing a bigger punch

By using memory so efficiently, PagedAttention allows us to fit more active sequences into the same GPU memory. This means our continuous batching conveyor belt can be longer and wider, handling more customers simultaneously.

Together, they are an unstoppable duo. Systems leveraging both techniques have been shown to achieve 2x to 4x higher throughput than other top-tier serving frameworks. It’s a one-two punch that completely redefines what’s possible.

Fine-tuning your speed engine

The best part is that you’re not locked into a single mode. You can tweak the knobs to match your specific needs:

For raw throughput: Crank up the max_num_seqs. This packs more sequences into the batch at once, pushing maximum tokens per second, though it might introduce a bit more latency jitter. Perfect for bulk processing jobs.
For snappy, interactive chat: Set the max_wait_ms (batch wait time) to nearly zero. This prioritizes getting a quick response to a single user over waiting to group them together, ensuring low latency for chat applications.
Balancing act: The block size for memory is a trade-off. A smaller size reduces waste but adds a tiny bit of overhead. A moderate value often hits the sweet spot.

By mastering this token-level scheduling, continuous batching ensures you’re getting every last drop of performance from your hardware. It’s the critical piece that makes deploying powerful LLMs in production not just feasible, but incredibly efficient.

The race for faster AI isn’t just about building bigger models; it’s about building smarter engines to run them. And this is one of the smartest engines out there.

In short: