Unlocking LLM superpowers: How PagedAttention helps the memory maze
Large language models (LLMs) like GPT and PaLM are transforming how we work and interact, powering everything from programming assistants to universal chatbots. But here’s the catch: running these incredibly powerful models, especially as hosted services, is very expensive, often costing 10x times more than a traditional keyword search. A huge part of this cost comes down to inefficient memory management when serving LLMs.
The hidden memory hog: The KV cache
So, at the heart of LLMs is the Transformer model, which generates text one token (word) at a time. To do this efficiently, the model needs to remember the “context” from previous tokens. This memory is stored in something called the Key-Value (KV) cache. Think of it as the LLM’s short-term memory for a conversation.
The problem is, this KV cache is huge and its size for each request grows and shrinks dynamically. Existing systems struggle with this because they usually store the KV cache in a single, contiguous block of memory. This approach leads to two major issues:
1. Memory fragmentation
Internal fragmentation
Systems pre-allocate a large chunk of memory for each request, assuming the maximum possible output length (e.g., 2048 tokens). However, if a request only generates a short output, much of that reserved memory goes unused, leading to significant waste.
External fragmentation
Because different requests reserve chunks of varying sizes, the GPU memory becomes scattered with unusable small gaps, making it hard to fit new requests even if total free memory is available. Our sources show that in existing systems, only 20.4% – 38.2% of KV cache memory is actually used to store token states, with the rest being waste.
2. No memory sharing
Advanced decoding techniques like parallel sampling or beam search often generate multiple outputs from a single prompt, meaning they could share parts of the KV cache. However, existing systems cannot easily share this memory because each sequence’s KV cache is in its own separate, contiguous block.
These inefficiencies severely limit how many requests can be processed simultaneously (the “batch size”), directly hurting the system’s throughput (how many tokens/requests it can handle per second).
The “Aha!” moment: PagedAttention inspired by operating systems
To solve these memory headaches, researchers developed PagedAttention. The brilliant idea behind PagedAttention is inspired by a classic technique from operating systems (OS): virtual memory and paging.
Here’s one analogy that I like to use:
- KV blocks are like pages. Instead of contiguous memory, PagedAttention divides the KV cache of each sequence into small, fixed-size KV blocks. Each block holds the keys and values for a set number of tokens.
- Tokens are like bytes. Individual tokens within the KV cache are like the bytes within a page.
- Requests are like processes. Each LLM request is managed like a process, with its “logical” KV blocks mapped to “physical” KV blocks in GPU memory.
How PagedAttention helps to solve problems?
Near-zero fragmentation
Since KV blocks are not required to be contiguous in physical memory, PagedAttention can dynamically allocate blocks on demand. This virtually eliminates internal fragmentation because memory is only allocated when needed, and external fragmentation is removed because all blocks are the same size.
Flexible memory sharing
PagedAttention enables sharing of KV blocks between different sequences, even across different requests. For example, in parallel sampling or beam search, multiple outputs can share the initial prompt’s KV cache, saving significant memory. It even uses a copy-on-write mechanism (another OS concept) for blocks that need to be modified by different sequences, ensuring efficient sharing without unnecessary duplication.
Introducing vLLM: The high-throughput engine
Built on top of PagedAttention, vLLM is an LLM serving system designed for high throughput. It uses block-level memory management and a sophisticated scheduler that works hand-in-hand with PagedAttention.
vLLM’s key benefits are:
- Near-zero waste in KV cache memory.
- Flexible sharing of KV cache within and across requests.
As a result, vLLM improves the throughput of popular LLMs by 2-4x compared to state-of-the-art systems like FasterTransformer and Orca, without increasing latency. This improvement is even more significant with longer sequences, larger models and complex decoding algorithms. For example, when serving a 13B-parameter LLM, vLLM can process 2.2x more requests concurrently than even an “oracle” version of Orca (which assumes perfect knowledge of output lengths) and 4.3x more than Orca (Max). It also shows substantial memory savings for parallel sampling (6.1% – 9.8%) and beam search (37.6% – 55.2%).
The future of LLM serving
A personal opinion, by intelligently borrowing a page from operating systems, PagedAttention and vLLM are making LLM serve dramatically more efficiently. This means you can easily lower your costs for cloud providers and faster, more responsive LLM applications for all of us. It’s a game-changer that addresses a critical bottleneck, enabling the next generation of LLM-powered services.
This article is published as part of the Foundry Expert Contributor Network.
Want to join?
Original Link:https://www.infoworld.com/article/4055048/unlocking-llm-superpowers-how-pagedattention-helps-the-memory-maze.html
Originally Posted: Thu, 11 Sep 2025 09:00:00 +0000
What do you think?
It is nice to know your opinion. Leave a comment.