The Hidden Cost of GPU Memory Leaks
GPU memory leaks are insidious. Unlike a segfault or an exception, a memory leak does not announce itself. VRAM usage creeps up by a few megabytes per minute — slow enough that you do not notice during a quick check, but fast enough to kill an 8-hour training run.
The worst part? By the time you see the OOM error, you have already lost hours of compute. Your last checkpoint might be from 6 hours ago. And the leak might not even be in your code — it could be in a framework, a data loader, or a third-party library.
Where GPU Memory Leaks Come From
In ML training, the most common sources of VRAM leaks are:
- Accumulating tensors — logging or storing intermediate tensors without detaching them from the computation graph.
- Data loader workers — worker processes that allocate GPU memory and do not release it between batches.
- Framework bugs — subtle issues in PyTorch, TensorFlow, or JAX that cause memory to be retained longer than expected.
- Custom CUDA kernels — manual memory management that misses edge cases.
- Evaluation loops — running validation without
torch.no_grad()or equivalent, causing gradients to accumulate.
Why Snapshots Miss Leaks
Running nvidia-smi every few minutes is not enough. A leak that grows at 50 MB/minute looks normal in any individual snapshot — you see 12 GB used and think "that seems right." But over 4 hours, that is an extra 12 GB. On a 24 GB card, that is the difference between completing your run and crashing.
You need trend analysis, not snapshots. Specifically, you need algorithms that track memory over time and detect the slope of growth.
Three Approaches to Leak Detection
gpulse uses three complementary algorithms to catch different types of leaks:
- Linear Trend Detection — fits a least-squares regression to VRAM usage over a sliding window. If the slope is consistently positive and the R-squared value is high, it flags a leak and calculates the estimated time until OOM.
- Spike Detection — catches sudden, large memory increases within a short time window. This is useful for leaks that are not linear but come in bursts (e.g., a function that allocates and fails to free every N iterations).
- Composite Scoring — runs both detectors and returns the highest-confidence result, reducing false positives while catching a wider range of patterns.
You can explore these in the leak detection documentation and see them in action via the Predict view.
The Dollar Cost
A single A100 on AWS costs roughly $30/hour. An 8-hour training run that crashes at hour 7 due to a leak costs $210 in wasted compute — plus the engineering time to debug and rerun. For teams running multiple experiments in parallel, this can add up to thousands per month.
Catching a leak in the first 15 minutes costs you nothing. Catching it 7 hours later costs you real money.
Set up monitoring before your next training run. The getting started guide takes under five minutes.
Try gpulse free
brew tap gpulseai/gpulse && brew install gpulse