Performance Tuning

Optimize gpulse for your workload and system configuration with these tuning recommendations.

Performance Goals

gpulse is designed to meet these targets:

Metric	Target
GPU Overhead	< 1% impact on GPU workloads
Memory Usage	< 50 MB (single GPU), < 150 MB (32 GPUs)
Response Time	< 2s for CLI commands, < 100ms for analysis
Reliability	99.9% uptime in daemon mode

Monitoring Interval Tuning

The monitoring interval is the most impactful performance setting.

toml

[monitoring]
# Default: 30 seconds
# Low overhead: 60-120 seconds
# High frequency: 5-15 seconds
interval-seconds = 30

Scenario	Interval	CPU Impact
Development / Testing	5-10 seconds	~3-5%
Production (normal)	30-60 seconds	~1-2%
Production (low-impact)	120 seconds	< 1%

History Size

toml

[monitoring]
# Default: 1000 snapshots
# Low memory: 500
# Extended analysis: 2000-5000
history-size = 1000

GPU Count	Recommended Snapshots	RAM Impact
Single GPU	1000-2000	~20-40 MB
4-8 GPUs	500-1000	~10-20 MB
16+ GPUs	300-500	~6-10 MB
Memory-constrained	200-300	~4-6 MB

Database Performance

Enable WAL Mode

Write-Ahead Logging significantly improves concurrent access — 2-3x write throughput and reduced lock contention.

toml

[storage]
wal-mode = true

Cache Size

toml

[storage]
# Default: 20000 KB (20 MB)
# High throughput: 50000-100000 KB
# Memory-constrained: 10000 KB
cache-size-kb = 20000

Database Location

Place the database on fast storage for best results.

toml

[storage]
# NVMe SSD recommended; avoid HDD
database-path = "/fast/ssd/gpu_monitoring.db"

Leak Detection Tuning

Minimum Samples

Fewer samples means faster detection but more false positives. More samples means slower but more accurate detection.

toml

[detection]
# Fast detection: 10-15
# Balanced (default): 20
# Accurate: 30-50
minimum-samples = 20

Analysis Window

toml

[detection]
# Quick detection: 0.5 hours
# Default: 1.0 hours
# Conservative: 2-4 hours
analysis-window-hours = 1.0

Scenario-Based Configurations

Development Environment

Goal: Quick feedback, maximum detail

toml

[monitoring]
interval-seconds = 5
history-size = 2000
enable-process-attribution = true

[detection]
minimum-samples = 15
confidence-threshold = 0.7
analysis-window-hours = 0.5

Production ML Training

Goal: Minimal overhead, accurate long-term detection

toml

[monitoring]
interval-seconds = 120
history-size = 500
enable-process-attribution = false

[detection]
minimum-samples = 30
confidence-threshold = 0.85
analysis-window-hours = 4.0

[storage]
wal-mode = true
cache-size-kb = 50000

Multi-GPU Server (32 GPUs)

Goal: Monitor many GPUs efficiently

toml

[monitoring]
interval-seconds = 60
history-size = 300
enable-process-attribution = false
batch-insert-size = 1000

[storage]
wal-mode = true
cache-size-kb = 100000
connection-pool-size = 50

[detection]
minimum-samples = 50
analysis-window-hours = 12.0

Real-Time Leak Detection

Goal: Fastest possible leak detection

toml

[monitoring]
interval-seconds = 5
history-size = 5000
enable-process-attribution = true

[detection]
minimum-samples = 10
confidence-threshold = 0.7
analysis-window-hours = 0.25
enable-multiple-strategies = true

[storage]
wal-mode = true
cache-size-kb = 100000

Best Practices

Start with defaults — tune only when needed
Monitor before and after — measure the impact of each change
Tune one parameter at a time — isolate effects
Document changes — track what you changed and why
Test under load — verify performance under realistic conditions
Use profiling tools — don't guess, measure
Review regularly — performance needs change over time

Troubleshooting

High CPU Usage

Increase interval-seconds, disable enable-process-attribution, or reduce history-size.

High Memory Usage

Reduce history-size, enable snapshot-retention-days, or reduce cache-size-kb.

Slow Database Queries

Enable WAL mode, run VACUUM, increase cache-size-kb, or apply retention policies.