Performance Tuning

Optimize gpulse for your workload and system configuration with these tuning recommendations.

Performance Goals

gpulse is designed to meet these targets:

Metric Target
GPU Overhead < 1% impact on GPU workloads
Memory Usage < 50 MB (single GPU), < 150 MB (32 GPUs)
Response Time < 2s for CLI commands, < 100ms for analysis
Reliability 99.9% uptime in daemon mode

Monitoring Interval Tuning

The monitoring interval is the most impactful performance setting.

toml
[monitoring]
# Default: 30 seconds
# Low overhead: 60-120 seconds
# High frequency: 5-15 seconds
interval-seconds = 30
Scenario Interval CPU Impact
Development / Testing 5-10 seconds ~3-5%
Production (normal) 30-60 seconds ~1-2%
Production (low-impact) 120 seconds < 1%

History Size

toml
[monitoring]
# Default: 1000 snapshots
# Low memory: 500
# Extended analysis: 2000-5000
history-size = 1000
GPU Count Recommended Snapshots RAM Impact
Single GPU 1000-2000 ~20-40 MB
4-8 GPUs 500-1000 ~10-20 MB
16+ GPUs 300-500 ~6-10 MB
Memory-constrained 200-300 ~4-6 MB

Database Performance

Enable WAL Mode

Write-Ahead Logging significantly improves concurrent access — 2-3x write throughput and reduced lock contention.

toml
[storage]
wal-mode = true

Cache Size

toml
[storage]
# Default: 20000 KB (20 MB)
# High throughput: 50000-100000 KB
# Memory-constrained: 10000 KB
cache-size-kb = 20000

Database Location

Place the database on fast storage for best results.

toml
[storage]
# NVMe SSD recommended; avoid HDD
database-path = "/fast/ssd/gpu_monitoring.db"

Leak Detection Tuning

Minimum Samples

Fewer samples means faster detection but more false positives. More samples means slower but more accurate detection.

toml
[detection]
# Fast detection: 10-15
# Balanced (default): 20
# Accurate: 30-50
minimum-samples = 20

Analysis Window

toml
[detection]
# Quick detection: 0.5 hours
# Default: 1.0 hours
# Conservative: 2-4 hours
analysis-window-hours = 1.0

Scenario-Based Configurations

Development Environment

Goal: Quick feedback, maximum detail

toml
[monitoring]
interval-seconds = 5
history-size = 2000
enable-process-attribution = true

[detection]
minimum-samples = 15
confidence-threshold = 0.7
analysis-window-hours = 0.5

Production ML Training

Goal: Minimal overhead, accurate long-term detection

toml
[monitoring]
interval-seconds = 120
history-size = 500
enable-process-attribution = false

[detection]
minimum-samples = 30
confidence-threshold = 0.85
analysis-window-hours = 4.0

[storage]
wal-mode = true
cache-size-kb = 50000

Multi-GPU Server (32 GPUs)

Goal: Monitor many GPUs efficiently

toml
[monitoring]
interval-seconds = 60
history-size = 300
enable-process-attribution = false
batch-insert-size = 1000

[storage]
wal-mode = true
cache-size-kb = 100000
connection-pool-size = 50

[detection]
minimum-samples = 50
analysis-window-hours = 12.0

Real-Time Leak Detection

Goal: Fastest possible leak detection

toml
[monitoring]
interval-seconds = 5
history-size = 5000
enable-process-attribution = true

[detection]
minimum-samples = 10
confidence-threshold = 0.7
analysis-window-hours = 0.25
enable-multiple-strategies = true

[storage]
wal-mode = true
cache-size-kb = 100000

Best Practices

  1. Start with defaults — tune only when needed
  2. Monitor before and after — measure the impact of each change
  3. Tune one parameter at a time — isolate effects
  4. Document changes — track what you changed and why
  5. Test under load — verify performance under realistic conditions
  6. Use profiling tools — don't guess, measure
  7. Review regularly — performance needs change over time

Troubleshooting

High CPU Usage

Increase interval-seconds, disable enable-process-attribution, or reduce history-size.

High Memory Usage

Reduce history-size, enable snapshot-retention-days, or reduce cache-size-kb.

Slow Database Queries

Enable WAL mode, run VACUUM, increase cache-size-kb, or apply retention policies.