Monitoring GPUs Across Your Training Cluster
When you go from one machine to many, GPU monitoring gets complicated fast. You have 4 nodes, each with 8 GPUs, and suddenly you are managing 32 GPUs across 4 SSH sessions. Which node has a hot GPU? Which one has a memory leak? Is node 3 actually using the GPUs you allocated, or is it sitting idle?
This is the reality for MLOps teams, platform engineers, and anyone running distributed training. And it is where single-machine tools like nvidia-smi and nvitop break down completely.
The Multi-Machine Visibility Problem
Without fleet monitoring, you end up doing one of the following:
- SSH tab hopping — opening a terminal to each node and running
nvidia-smiin a loop. This does not scale past 3-4 machines. - Custom scripts — writing a bash script that SSHes into each node, collects metrics, and aggregates them. Fragile, hard to maintain, and usually missing the metrics you actually need.
- Heavyweight agents — deploying Prometheus exporters, Grafana dashboards, and all the infrastructure that comes with them. Works well, but takes days to set up properly and requires ongoing maintenance.
None of these give you a real-time, unified view from your terminal.
Fleet Monitoring via SSH
gpulse's Fleet view (available in Pro) connects to remote machines over SSH and aggregates GPU metrics into a single dashboard. No agents to install on remote machines — gpulse runs locally and pulls metrics over your existing SSH connections.
This means:
- No agent deployment — if you can SSH to a machine, you can monitor it. No installing software on production nodes.
- Bastion host support — connects through jump hosts, so it works with secure network topologies.
- Up to 20 machines — monitor your entire training cluster from one terminal window.
- Unified alerting — get notified when any GPU in the fleet shows a memory leak, thermal throttling, or utilization anomaly.
What Fleet View Shows You
The Fleet view aggregates across machines and shows:
- Health status for each node (healthy, warning, critical)
- Per-GPU metrics across the fleet — VRAM, utilization, temperature, power
- Leak detection alerts flagged per-GPU across all machines
- Node-level summary statistics (total VRAM, average utilization)
You can drill into any node to see its individual Detail view, then jump back to the fleet overview — all without leaving gpulse.
When to Use Fleet vs. Prometheus
Fleet monitoring via SSH is ideal for teams with up to 20 machines who want immediate visibility without infrastructure overhead. If you need long-term metric storage, custom dashboards, or integration with existing Prometheus/Grafana stacks, gpulse also exports Prometheus metrics that you can scrape and store.
The two approaches are complementary, not competing. Use Fleet for real-time awareness, Prometheus for historical analysis and alerting at scale.
To get started with local monitoring, see the getting started guide. Fleet features are available in gpulse Pro — join the waitlist to get access.
Try gpulse free
brew tap gpulseai/gpulse && brew install gpulse