Multi-GPU Monitoring

gpulse is built for systems with multiple GPUs. This guide covers the views and workflows for monitoring two or more devices simultaneously.

Grid View Overview

Grid view is the default and the best starting point for any multi-GPU system. Press g to switch to it from any other view.

Each GPU tile shows:

  • GPU index and model name
  • Memory bar: used / total with percentage
  • Utilization bar: current SM occupancy
  • Temperature and power draw
  • A colour-coded health indicator

Colour-Coded Health

Colour Meaning
Green All metrics within normal ranges
Yellow At least one metric in the warning range (e.g., temperature 70-85 C, memory > 80%)
Red At least one metric critical (e.g., temperature > 85 C, memory > 95%, ECC uncorrected error)

Scanning the grid top-to-bottom lets you spot an outlier GPU at a glance without reading every number.

Sorting

In Grid and List views, press o to cycle through sort orders:

Sort Order Description
Index GPU 0, 1, 2... (default)
Memory Used Highest memory consumer first
Utilization Highest compute load first
Temperature Hottest GPU first
Name Alphabetical by model name

Detail Deep-Dive

To investigate a single GPU:

  1. In Grid or List view, use Up / Down to highlight the GPU
  2. Press Enter to select it, then d for Detail view

Detail view divides the screen into four quadrants:

Top-left

Memory utilization timeline (last N seconds of history)

Top-right

GPU compute utilization timeline

Bottom-left

Temperature and power readings with sparklines

Bottom-right

Live process table — PID, name, memory, and user

Press g or v to return to the multi-GPU overview.

Compare View

Compare view places two or more GPUs side-by-side with matching metric rows so you can spot imbalances in a distributed training job. Press c to open it.

Typical use cases:

  • Verifying all GPUs in a data-parallel training run consume similar memory and compute
  • Identifying a "slow GPU" causing others to block at synchronisation barriers
  • Checking tensor-parallel model layer splits across devices

Use Left / Right to change the comparison target.

Topology View

Press t to open Topology view. It renders a diagram of the physical interconnect between GPUs, including:

  • PCIe links: bandwidth class (x8, x16) and CPU socket attachment
  • NVLink bridges: direct GPU-to-GPU links and negotiated bandwidth

Two GPUs connected via NVLink can exchange tensors at 600 GB/s (NVLink 4.0), while GPUs on opposite NUMA nodes over PCIe may see 10-20x lower effective bandwidth. If distributed training is unexpectedly slow, check Topology view for the interconnect path.

16+ GPU Systems (Pagination)

On systems with more than 8 GPUs, Grid view paginates automatically.

Key Action
PgDn Next page of GPUs
PgUp Previous page of GPUs

The status bar shows the current page (e.g., GPUs 9-16 of 64). All metrics continue updating for off-screen GPUs. For 16+ GPU systems, consider List view (v) as a denser alternative.