Metrics Overview
This document describes how rouser collects system metrics from various sources.
Overview
rouser monitors four main categories of system metrics:
| Metric | Source | Polling | Description |
|---|---|---|---|
| CPU Usage | /proc/stat |
Top-level update_interval config |
Per-core maximum and total average CPU usage (frequency-weighted) |
| GPU Usage | NVML (libnvidia-ml.so) or /sys/class/drm/ |
Configurable via top-level update_interval |
GPU utilization percentage per device |
| Network I/O | /proc/net/dev |
Configurable via top-level update_interval |
Network throughput in Mbps |
| Disk Activity | /proc/diskstats |
Configurable via top-level update_interval |
Disk read/write in MB/s |
All metrics are collected asynchronously using Tokio's async runtime.
CPU Usage
Data Source: /proc/stat
rouser reads CPU statistics from the virtual /proc/stat file, which provides system-wide CPU usage information.
File Format
cpu 234567 12345 2345678 789012345 123456 1234 567 8901 0 0
cpu0 23456 1234 234567 789012 12345 123 56 789 0 0
cpu1 23456 1234 234567 789012 12345 123 56 789 0 0
The first line (starting with cpu) contains aggregate statistics for all cores. Subsequent lines (cpu0, cpu1, etc.) contain per-core statistics.
Fields (Verified Against Kernel Documentation)
| Field | Description |
|---|---|
user |
Normal processes executing in user mode (jiffies) |
nice |
Niced processes executing in user mode (jiffies) |
system |
Processes executing in kernel mode (jiffies) |
idle |
Time spent in idle task (jiffies) |
iowait |
Time waiting for I/O to complete (jiffies) |
irq |
Time servicing hardware interrupts (jiffies) |
softirq |
Time servicing software interrupts (jiffies) |
steal |
Stolen time in virtual machines (jiffies) |
guest |
Time spent running guest OS (jiffies) |
guest_nice |
Time spent running nice guest OS (jiffies, Linux 2.6.24+) |
Calculation Method
rouser uses a two-sample delta approach to calculate CPU usage:
- Read current values from
/proc/stat - Wait for polling interval (set via top-level
update_interval) - Read new values
- Calculate the difference between samples
- Compute usage percentage based on idle vs non-idle time
Formula
total_time = user + nice + system + idle + iowait + irq + softirq + steal + guest
non_idle_time = user + nice + system + iowait + irq + softirq + steal + guest
cpu_usage = (non_idle_time / total_time) * 100.0
Note: guest_nice is not included as it represents time already counted in guest.
Edge Cases
First Sample
On the first read, there's no baseline for comparison. rouser: - Initializes with the first sample - Returns 0% usage for the first poll cycle - Begins actual calculation from the second sample
Jiffies Overflow
On long-running systems, jiffies can overflow (wrap from u64::MAX back to 0). rouser handles this by:
- Using u64 for jiffies (can run ~580 years at 100 Hz before overflow)
- Detecting wraparound via negative deltas
- Applying correction: correction = u64::MAX - b + a + 1
Implementation
pub struct CpuCollector {
last_stats: Option<CpuStats>,
last_time: Option<SystemTime>,
}
impl CpuCollector {
pub async fn collect_cpu_usage(&self) -> f64 {
let stats = self.read_stats().await;
let now = SystemTime::now();
match (&self.last_stats, &self.last_time) {
(Some(prev), Some(prev_time)) => {
let delta_total = total_delta(&stats, prev);
let delta_non_idle = non_idle_delta(&stats, prev);
if delta_total > 0 {
(delta_non_idle as f64 / delta_total as f64) * 100.0
} else {
0.0
}
}
_ => {
// First sample
0.0
}
}
}
}
GPU Usage
Data Sources by GPU Type
rouser supports multiple GPU vendors with different data sources.
NVIDIA GPUs
Primary Source: NVML library (libnvidia-ml.so) — the same API used by nvidia-smi and nvtop
# NVML is loaded dynamically from the NVIDIA driver package
# No separate binary required — accessed via libnvidia-ml.so.1
ldconfig -p | grep libnvidia-ml
Implementation:
- Dynamically loads libnvidia-ml.so at runtime (graceful fallback if unavailable)
- Enumerates GPUs by index, matches to sysfs cards via PCI bus ID
- Uses nvmlDeviceGetUtilizationRates() for per-GPU compute utilization
- Falls back to 0% if NVML is unavailable or GPU not supported
AMD/Intel GPUs
Primary Source: /sys/class/drm/cardX/device/gpu_busy_percent
# Check GPU usage (if available)
cat /sys/class/drm/card0/device/gpu_busy_percent
# AMD: rocm-smi provides similar data via ROCm tools
rocm-smi --showgpuutilization
Requirements:
- AMD amdgpu drivers or Intel i915/xe driver installed
- Access to /sys/class/drm/cardX/device/gpu_busy_percent (readable by the rouser process user)
Implementation:
- Scans /sys/class/drm/ directory for GPU devices
- Reads gpu_busy_percent from each device
- Reports per-device utilization independently (no averaging)
- Falls back to 0% if no GPUs detected
Driver Measurement Differences
NVML, amdgpu, and i915 all report a 0–100% value but measure different things under the hood. NVIDIA's SM kernel utilization, AMD's aggregate IP core activity via SMU firmware, and Intel's GT engine ticks are not directly comparable as percentages. See GPU Usage Measurement for a detailed breakdown of what each driver reports and why this doesn't affect rouser's sleep inhibition behavior in practice.
Aggregation Strategy — Per-Device Reporting Over Averaging
rouser reports each physical GPU individually rather than aggregating across devices. Each detected GPU is compared independently against the configured threshold:
card0(nvidia): 95% ← above 90% threshold → inhibits sleep
card1(amdgpu): 78% ← below 90% threshold → does not inhibit alone
A single GPU exceeding its threshold triggers inhibition regardless of other GPUs' states. This provides accurate per-GPU logging and prevents one low-usage card from masking a high-usage card's activity.
EMA Smoothing: Each device has independent EMA smoothing applied to its readings before comparison against the threshold. The ema_alpha value in [metrics.gpu] controls smoothing strength uniformly across all GPUs.
Network I/O
Data Source: /proc/net/dev
rouser reads network statistics from the virtual /proc/net/dev file.
File Format
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
lo: 1234567 12345 0 0 0 0 0 0 1234567 12345 0 0 0 0 0 0
eth0: 9876543 987654 0 0 0 0 0 0 8765432 876543 0 0 0 0 0 0
Calculation Method
- Read current byte/packet counts for each interface
- Wait for polling interval
- Read new values
- Calculate delta (bytes transferred during interval)
- Convert to Mbps:
(delta_bytes * 8) / interval_seconds / 1,000,000
Formula
bytes_transferred = current_bytes - previous_bytes
interval_seconds = current_time - previous_time
network_io_mbps = (bytes_transferred * 8.0) / interval_seconds / 1_000_000.0
Interface Filtering
rouser supports two filtering strategies:
Exclude Interfaces
No interfaces are excluded by default — all available interfaces are monitored. To exclude specific interfaces (e.g., loopback):
[network]
exclude_interfaces = ["lo"]
Include Only Specific Interfaces
Monitor only specified interfaces:
[network]
include_interfaces = ["eth0", "ens192"]
Note: If both include_interfaces and exclude_interfaces are specified, include_interfaces takes precedence.
Edge Cases
Interface Changes
If an interface disappears between polls: - rouser continues with available interfaces - Logs a warning at debug level - Does not fail the metric collection
No Interfaces Available
If no interfaces are available (highly unlikely): - Returns 0 Mbps network usage - Logs an error - Continues operation
Disk Activity
Data Source: /proc/diskstats
rouser reads disk statistics from the virtual /proc/diskstats file.
File Format
259:0 md0 rw 1000 5000 100000 500000 0 0 0 0 100 2000 3000
259:1 md1 rw 500 2500 50000 250000 0 0 0 0 50 1000 1500
Fields (simplified): major minor name reads_merged sectors_read writes_merged sectors_written ...
Calculation Method
- Read current sector counts for each device
- Wait for polling interval
- Read new values
- Calculate delta (sectors transferred)
- Convert to MB/s:
(delta_sectors * 512) / interval_seconds / 1,000,000
Formula
sectors_transferred = current_sectors - previous_sectors
bytes_transferred = sectors_transferred * 512 # 512 bytes per sector
interval_seconds = current_time - previous_time
disk_activity_mbps = (bytes_transferred as f64) / interval_seconds / 1_000_000.0
Device Filtering
rouser excludes virtual/simulated devices by default:
[disk]
exclude_device_prefixes = ["loop", "fd", "sr", "cdrom"]
Excluded Devices
| Prefix | Type | Reason |
|---|---|---|
loop |
Loop device | Virtual block device for disk images |
fd |
File descriptor backend | Virtual block device |
sr |
SCSI CD-ROM | Optical drive (rarely active) |
cdrom |
CD-ROM | Optical drive (rarely active) |
Included Devices
| Prefix | Type | Notes |
|---|---|---|
sdX |
SATA/SAS disk | Real storage devices |
nvmeX |
NVMe SSD | Real storage devices |
dm-X |
Device mapper/LVM | LVM volumes (real storage) |
vdX |
VirtIO | KVM/virtio block devices |
Edge Cases
Device Changes
If a device disappears between polls: - rouser continues with available devices - Logs a warning at debug level - Does not fail the metric collection
No Devices Available
If no devices are available (highly unlikely): - Returns 0 MB/s disk activity - Logs an error - Continues operation
Collection Timing
Polling Interval
The polling interval is a top-level config field:
update_interval = "1s" # Time between metric collection cycles (default)
log_level = "info"
...
Trade-offs: - Shorter interval: More responsive to activity, higher CPU usage - Longer interval: Lower CPU usage, less responsive
Asynchronous Collection
rouser uses Tokio's async runtime to collect metrics concurrently:
// Collect all metrics concurrently
let (cpu, gpu, network, disk) = tokio::join!(
cpu_collector.collect(),
gpu_collector.collect(),
network_collector.collect(),
disk_collector.collect()
);
This approach minimizes the time spent collecting metrics, reducing the impact on system performance.
Error Handling
Graceful Degradation
If a metric source is unavailable:
| Metric | Fallback Behavior |
|---|---|
| CPU | Always available via /proc/stat |
| GPU | Returns 0% usage, logs warning |
| Network | Monitors available interfaces, continues |
| Disk | Monitors available devices, continues |
Logging
Errors are logged at appropriate levels:
// Warning for non-critical issues
warn!("GPU metrics unavailable, using 0%");
// Error for critical issues
error!("No disk devices available");
Performance Considerations
Resource Usage
| Metric | Memory Impact | CPU Impact | Disk I/O |
|---|---|---|---|
| CPU | ~100 bytes | Negligible | None (/proc) |
| GPU | ~1 KB | Very low (NVML library call, no subprocess) | None |
| Network | ~100 bytes per interface | Negligible | None (/proc) |
| Disk | ~50 bytes per device | Negligible | None (/proc) |
Optimization Tips
To reduce resource usage:
# Increase polling interval (configurable via top-level update_interval)
update_interval = "10s" # Less responsive but lower overhead
[metrics.network]
exclude_interfaces = ["lo", "docker0", "virbr0"] # Exclude virtual interfaces
[metrics.disk]
exclude_device_prefixes = ["loop", "fd", "sr", "cdrom", "nbd"] # Exclude more virtual devices
See Also
- GPU Usage Measurement — Driver measurement differences (NVML vs amdgpu vs i915/xe)
- Averaging Explained - Understanding threshold calculations with EMA
- Configuration Reference - Configuration options for metrics