Back to blog
Technical Leak Detection Algorithms

How We Detect GPU Memory Leaks Before They Crash Your Run

Karthik Kirubakaran 4 min read

Detecting GPU memory leaks in real time is a surprisingly hard problem. Memory usage during ML training is naturally noisy — allocation patterns vary with batch size, model architecture, and training phase. A naive "memory went up" check would produce constant false positives. You need algorithms that distinguish between normal allocation patterns and genuine leaks.

This post explains the three detection approaches we use in gpulse, why each one exists, and how they complement each other.

Algorithm 1: Linear Trend Detection

The core idea is simple: if GPU memory usage is growing at a consistent rate over time, something is probably leaking. We use ordinary least-squares (OLS) linear regression on a sliding window of memory samples.

For each GPU, gpulse maintains a ring buffer of recent memory readings (timestamped). Every update cycle, we fit a line to the data and extract two values:

  • Slope — the rate of memory growth in MB/second. A positive slope means memory is increasing.
  • R-squared (R2) — how well the linear model fits the data. A high R2 (say, above 0.85) means the growth is consistent and linear, not just noise.

If the slope is positive and R2 exceeds a configurable threshold, we flag a leak. The estimated time to OOM is calculated as:

time_to_oom = (total_vram - current_usage) / slope

This works well for the most common type of leak: steady, continuous memory growth from accumulating tensors, unreleased buffers, or framework bugs.

Algorithm 2: Spike Detection

Not all leaks are linear. Some manifest as periodic large allocations that are never freed — for example, a validation loop that allocates 500 MB every epoch and does not release it. Over 20 epochs, that is 10 GB.

The spike detector watches for sudden, large increases in memory within a sliding time window. When the magnitude of an increase exceeds a threshold relative to the recent baseline, it is flagged. The detector tracks:

  • The size of each spike (delta from the previous stable baseline)
  • The frequency of spikes (how many spikes per time window)
  • Whether memory returns to baseline after each spike (normal) or stays elevated (leak)

If memory stays elevated after repeated spikes, the detector flags a step-function leak pattern. This catches issues that the linear detector would miss because the growth is not smooth.

Algorithm 3: Composite Scoring

Running two detectors independently gives you broader coverage but introduces the question: what do you do when they disagree? The composite detector resolves this by combining the confidence scores from both algorithms.

The logic:

  • If both detectors flag a leak, confidence is high and the highest estimated severity is reported.
  • If only the linear detector flags, but with high R2, we trust it — linear leaks are the most common pattern.
  • If only the spike detector flags, we report it but with lower confidence, since spikes can sometimes be normal allocation patterns.
  • If neither flags, the GPU is considered healthy.

The composite approach reduces false positives while catching a wider range of leak patterns than either detector alone. In practice, it produces very few false alarms while catching real leaks within minutes of onset.

Seeing It in Action

The Predict view in gpulse visualizes the output of all three detectors in real time. You can see the regression line overlaid on the memory graph, the spike markers, and the composite confidence score. When a leak is detected, the estimated time to OOM is displayed prominently.

You can also test the detection system using demo mode with simulated leak patterns:

gpulse dashboard --demo nvidia:4

The demo includes GPUs with active leak patterns so you can see the detection in action without risking real hardware. Explore the full leak detection documentation for configuration details and threshold tuning.

What's Next

We are exploring additional detection methods including periodic pattern analysis and anomaly detection using moving averages. The goal is always the same: catch leaks early, minimize false positives, and give you enough time to act before OOM.

Try gpulse free

brew tap gpulseai/gpulse && brew install gpulse

Download options Getting started guide