How to Boost GPU Efficiency Without Buying New Hardware
Late last year, a major retailer faced a big challenge. They were running a large AI model for search and recommendations, and during peak times, their GPU costs skyrocketed. They had already doubled their GPU count but still saw latency spikes. That’s when they called in an expert to figure out what was really happening behind the scenes.
Profiling Revealed Hidden Workload Patterns
The first step was to analyze how their GPUs were being used. The engineer instrumented the serving system and broke down the GPU utilization during different parts of inference. What they found was eye-opening. During prompt processing, where the model reads the user’s input, the GPUs were running at 92% capacity. All the tensor cores were fully saturated, which is what you want on a high-end GPU costing around $30,000.
However, this high utilization only lasted about 200 milliseconds per request. The next phase, token generation, lasted several seconds but showed a different pattern. During this time, the GPUs’ compute cores dropped to just 30% utilization. Most of the time was spent waiting on memory to read data, not doing calculations. Essentially, the GPUs were underutilized during the longer, more expensive phase, wasting resources and money.
The Bimodal Nature of Large Language Model Inference
This discovery led to a new understanding. Large language models (LLMs) perform two very different tasks in one process. The first is prompt processing, which involves complex matrix calculations that fully load the GPU. The second is token generation, which is mainly memory-bound and requires less compute power. These two phases happen one after the other on the same hardware and within the same scheduling cycle.
This pattern is unusual. Usually, if a workload has two phases with different resource needs, it’s split across different servers or scaled differently. But in LLM inference, both phases happen on the same GPU, making it appear as if the GPU is only partly busy. Most monitoring tools report a single “average” utilization number, which hides this bimodal pattern. For example, a GPU might show 55% utilization overall, but that’s a blend of 92% during prompt processing and 30% during token generation.
This averaging can be misleading. It suggests the GPU is only half busy, but in reality, it’s fully utilized during the brief prompt phase and mostly idle during the longer decoding phase. Recognizing this pattern can help optimize how resources are allocated and managed, potentially saving a lot of money.
Implications for Cost and Performance Optimization
Understanding this bimodal workload pattern opens new ways to improve efficiency. Instead of provisioning for peak compute usage all the time, teams can consider different strategies. For example, they could assign dedicated hardware for the prompt phase, which demands high compute, and different hardware or configurations for token generation, which is memory-bound.
Monitoring tools need to reflect these differences. Relying on a single average utilization figure can hide the true resource needs and lead to overprovisioning. More detailed profiling shows that optimizing for the actual workload phases can drastically reduce GPU costs without sacrificing performance.
In fact, recent research from UC San Diego’s Hao AI Lab confirmed these findings. Their measurements on H100 GPUs showed the same bimodal pattern: high utilization during prompt processing and lower during decoding. Recognizing these patterns is key to smarter infrastructure design, saving money, and improving overall efficiency.
By thinking of LLM inference as two distinct workloads happening on the same hardware, organizations can better match their infrastructure to their real needs. This approach can lead to smarter scaling, cost savings, and faster response times—all without purchasing a single new GPU.















What do you think?
It is nice to know your opinion. Leave a comment.