A vLLM server screams 97% utilization on nvidia-smi for a solid eight minutes. Simultaneously, token throughput craters. Both statements, absurdly, are true. And therein lies the digital snake oil. NVIDIA’s ubiquitous GPU utilization metric isn’t a measure of productive work. Nope. It’s a simple duty-cycle counter. It tells you when something was running on the GPU, not if that something was worth running.
We bumped into this absurdity while running an internal repro of a vLLM latency spike. The hardware? A TensorDock RTX 4090. The software? vLLM 0.18.0, with Qwen2.5-0.5B-Instruct chugging along. For eight minutes, the dashboard was a picture of health. Nvidia-smi wavered between 92-99%, a steady 97% on average. Fans hummed, memory was stable, power draw held at 320W. All systems go, right?
Wrong.
The culprit was an unassuming request: n_completions=8 paired with logprobs=20. This beast expanded each decode step into eight separate sequences, each demanding a full-vocabulary softmax. We’re talking about 150,000 tokens per expansion. Each of these behemoths effectively held every other co-scheduled request hostage for 9-11 seconds. The GPU stayed busy, yes, but it was busy churning through user-invisible garbage. The throughput? Collapsed.
This isn’t some fringe scenario. This is the predictable outcome when your only diagnostic tool is a glorified stopwatch.
NVIDIA’s own documentation helpfully defines GPU-Util as: “percent of time over the past sample period during which one or more kernels was executing on the GPU.” Duty cycle. That’s it. It offers zero insight into whether the kernel is efficient, whether it’s bottlenecked, or if it’s actively hindering other operations. It’s like bragging about how many hours you spent in the gym, without mentioning if you were lifting weights or just staring at the ceiling.
DCGM, NVIDIA’s more advanced toolkit, offers finer granularity with counters like SM_ACTIVE and MEM_COPY_UTIL. These help, slightly. But a kernel running at a pathetic 5% of its peak potential for 100 milliseconds still registers 100% SM_ACTIVE for that interval. The dashboard remains oblivious.
We’ve dissected this pattern across various workloads. High utilization, plummeting throughput, and a dashboard that might as well be a Magic 8-Ball. The common thread? The root cause lives deeper.
The Usual Suspects: Why Your GPU Thinks It’s Busy
-
Prefill/Decode Tango: Frameworks like vLLM, SGLang, and TGI try to batch prefill (input processing) and decode (output generation) on the same hardware. When prefill demands exponentially more compute than decode—a common scenario with long contexts—a single long-context request becomes a traffic jam for all shorter requests. The GPU stays at 100%
SM_ACTIVEbecause the prefill kernels are hogging the shader cores. Meanwhile, decode latency for the waiting requests stretches to infinity. -
Distributed Training Gridlock: Imagine a 4-GPU all-reduce operation. If one GPU is a straggler, the others wait. Those waiting GPUs show 100% utilization because the kernel orchestrating the wait is itself a kernel. The overall throughput is dictated by the slowest rank, not the efficient ones.
-
Dataloader Deadlock: PyTorch’s
DataLoader, when performing index permutation on the main process, can become a single-threaded bottleneck. The GPU dutifully runs the same forward kernel repeatedly, while the launch of the next batch is blocked by acudaStreamSync. The kernel screams, but the next job is stuck in the driveway. -
CPU Core Chaos: vLLM’s engine loop is single-threaded. An OS context switch—a neighboring core’s kernel work, a pesky interrupt, a poorly managed cgroup—can halt the
cudaLaunchKernelcall. We’ve seen p99cudaLaunchKerneltimes stretch to 13.1ms (a gargantuan leap from a typical 16.7us p50), all due to scheduler hiccups. The GPU keeps running whatever kernel was active before the stall, making utilization appear normal. -
Memory Bandwidth Meltdown: A kernel that floods the system with data faster than the SMs can process it will report 100%
SM_ACTIVE. But the real constraint? DRAM bandwidth. Utilization is a red herring; the bottleneck is memory throughput.
In every one of these scenarios, the symptom is depressingly familiar: high utilization, low throughput. The cause, however, hides in the layers beneath.
Finding the Real Bottleneck
So, how do you peel back the layers? Forget the aggregate utilization. Ask the hard question: “What was the GPU actually waiting on, second by second?”
Answering this demands correlating data from multiple sources on the same host, synchronized by timestamp:
- CUDA Runtime API Calls: Monitor events like
cudaLaunchKernel,cudaMemcpyAsync,cudaStreamSynchronize, andcudaDeviceSynchronizevia uprobes onlibcudart.so. - CUDA Driver API Calls: Track
cuLaunchKerneland related driver-level operations using uprobes onlibcuda.so. - Kernel Execution Traces: Dive into the actual kernels being run. Tools like CUPTI or NVIDIA Nsight can provide detailed profiles of kernel duration, occupancy, and resource utilization within the kernel itself.
- Host-Side Activity: Don’t ignore the CPU. Monitor CPU thread activity, context switches, and system calls related to GPU driver interaction.
- Memory Bandwidth: Directly measure DRAM bandwidth usage. This is often exposed via DCGM or specific profiling tools.
By weaving these threads together, you can finally see the difference between a GPU that’s churning through productive computation and one that’s merely spinning its wheels—a distinction that 97% utilization conveniently obscures.
This isn’t just a theoretical problem; it’s a persistent, frustrating reality in high-performance computing. And as AI workloads grow more complex, the ability to see beyond the simplistic utilization counter will become not just beneficial, but absolutely essential.