Performance Tuning
Optimize gpulse for your workload and system configuration with these tuning recommendations.
Performance Goals
gpulse is designed to meet these targets:
| Metric | Target |
|---|---|
| GPU Overhead | < 1% impact on GPU workloads |
| Memory Usage | < 50 MB (single GPU), < 150 MB (32 GPUs) |
| Response Time | < 2s for CLI commands, < 100ms for analysis |
| Reliability | 99.9% uptime in daemon mode |
Monitoring Interval Tuning
The monitoring interval is the most impactful performance setting.
[monitoring]
# Default: 30 seconds
# Low overhead: 60-120 seconds
# High frequency: 5-15 seconds
interval-seconds = 30 | Scenario | Interval | CPU Impact |
|---|---|---|
| Development / Testing | 5-10 seconds | ~3-5% |
| Production (normal) | 30-60 seconds | ~1-2% |
| Production (low-impact) | 120 seconds | < 1% |
History Size
[monitoring]
# Default: 1000 snapshots
# Low memory: 500
# Extended analysis: 2000-5000
history-size = 1000 | GPU Count | Recommended Snapshots | RAM Impact |
|---|---|---|
| Single GPU | 1000-2000 | ~20-40 MB |
| 4-8 GPUs | 500-1000 | ~10-20 MB |
| 16+ GPUs | 300-500 | ~6-10 MB |
| Memory-constrained | 200-300 | ~4-6 MB |
Database Performance
Enable WAL Mode
Write-Ahead Logging significantly improves concurrent access — 2-3x write throughput and reduced lock contention.
[storage]
wal-mode = true Cache Size
[storage]
# Default: 20000 KB (20 MB)
# High throughput: 50000-100000 KB
# Memory-constrained: 10000 KB
cache-size-kb = 20000 Database Location
Place the database on fast storage for best results.
[storage]
# NVMe SSD recommended; avoid HDD
database-path = "/fast/ssd/gpu_monitoring.db" Leak Detection Tuning
Minimum Samples
Fewer samples means faster detection but more false positives. More samples means slower but more accurate detection.
[detection]
# Fast detection: 10-15
# Balanced (default): 20
# Accurate: 30-50
minimum-samples = 20 Analysis Window
[detection]
# Quick detection: 0.5 hours
# Default: 1.0 hours
# Conservative: 2-4 hours
analysis-window-hours = 1.0 Scenario-Based Configurations
Development Environment
Goal: Quick feedback, maximum detail
[monitoring]
interval-seconds = 5
history-size = 2000
enable-process-attribution = true
[detection]
minimum-samples = 15
confidence-threshold = 0.7
analysis-window-hours = 0.5 Production ML Training
Goal: Minimal overhead, accurate long-term detection
[monitoring]
interval-seconds = 120
history-size = 500
enable-process-attribution = false
[detection]
minimum-samples = 30
confidence-threshold = 0.85
analysis-window-hours = 4.0
[storage]
wal-mode = true
cache-size-kb = 50000 Multi-GPU Server (32 GPUs)
Goal: Monitor many GPUs efficiently
[monitoring]
interval-seconds = 60
history-size = 300
enable-process-attribution = false
batch-insert-size = 1000
[storage]
wal-mode = true
cache-size-kb = 100000
connection-pool-size = 50
[detection]
minimum-samples = 50
analysis-window-hours = 12.0 Real-Time Leak Detection
Goal: Fastest possible leak detection
[monitoring]
interval-seconds = 5
history-size = 5000
enable-process-attribution = true
[detection]
minimum-samples = 10
confidence-threshold = 0.7
analysis-window-hours = 0.25
enable-multiple-strategies = true
[storage]
wal-mode = true
cache-size-kb = 100000 Best Practices
- Start with defaults — tune only when needed
- Monitor before and after — measure the impact of each change
- Tune one parameter at a time — isolate effects
- Document changes — track what you changed and why
- Test under load — verify performance under realistic conditions
- Use profiling tools — don't guess, measure
- Review regularly — performance needs change over time
Troubleshooting
High CPU Usage
Increase interval-seconds, disable enable-process-attribution, or reduce history-size.
High Memory Usage
Reduce history-size, enable snapshot-retention-days, or reduce cache-size-kb.
Slow Database Queries
Enable WAL mode, run VACUUM, increase cache-size-kb, or apply retention policies.