Skip to content

System Hardware Monitoring

During experiment tracking, SwanLab automatically monitors machine hardware resources and records them in the System Charts. Currently supported hardware:

HardwareInfo LoggingResource MonitoringScript
NVIDIA GPUnvidia.py
Ascend NPUascend.py
Cambricon MLUcambricon.py
Kunlunxin XPUkunlunxin.py
CPUcpu.py
Memorymemory.py
Diskdisk.py
Networknetwork.py

System Monitoring Metrics

SwanLab automatically monitors hardware resources on the machine running the experiment and generates charts for each metric, displayed under the System Charts tab.

Sampling Strategy & Frequency: SwanLab dynamically adjusts hardware data collection frequency based on experiment duration to balance granularity and system performance. Sampling frequencies:

Data Points CollectedSampling Frequency
0~10Every 10 seconds
10~50Every 30 seconds
50+Every 60 seconds

SwanLab monitors GPU, NPU, CPU, system memory, disk I/O, and network metrics relevant to training processes. Below are detailed descriptions of each component.

GPU (NVIDIA)

On multi-GPU machines, each GPU's metrics are recorded separately, displayed as individual lines in charts.

MetricDescription
GPU Memory Allocated (%)GPU memory utilization – Percentage of VRAM used.
GPU Memory Allocated (MB)GPU memory usage – VRAM consumption in MB. Chart Y-axis capped at the maximum VRAM across GPUs.
GPU Utilization (%)GPU utilization – Percentage of computational resources used.
GPU Temperature (℃)GPU temperature in Celsius.
GPU Power Usage (W)GPU power consumption in watts.
GPU Time Spent Accessing Memory (%)Memory access time – Percentage of time spent accessing VRAM.

NPU (Ascend)

On multi-NPU machines, each NPU's metrics are recorded separately.

MetricDescription
NPU Utilization (%)NPU computational utilization.
NPU Memory Allocated (%)NPU memory utilization.
NPU Temperature (℃)NPU temperature in Celsius.

MLU (Cambricon)

On multi-MLU machines, each MLU's metrics are recorded separately.

MetricDescription
MLU Utilization (%)MLU computational utilization.
MLU Memory Allocated (%)MLU memory utilization.
MLU Temperature (℃)MLU temperature in Celsius.
MLU Power (W)MLU power draw in watts.

XPU (Kunlunxin)

On multi-XPU machines, each XPU's metrics are recorded separately.

MetricDescription
XPU Utilization (%)XPU computational utilization.
XPU Memory Allocated (%)XPU memory utilization.
XPU Temperature (℃)XPU temperature in Celsius.
XPU Power (W)XPU power draw in watts.

CPU

MetricDescription
CPU Utilization (%)CPU computational utilization.
Process CPU ThreadsThread count used by the experiment.

Memory

MetricDescription
System Memory Utilization (%)System-wide memory usage percentage.
Process Memory In Use (non-swap) (MB)Physical memory (excluding swap) consumed by the process.
Process Memory Utilization (MB)Allocated memory (including swap) for the process.
Process Memory Available (non-swap) (MB)Available physical memory (excluding swap) for the process.

Disk

MetricDescription
Disk IO Utilization (MB)Disk I/O throughput in MB/s (read/write shown separately).
Disk Utilization (%)Disk usage percentage.

On Linux, monitors root (/) usage; on Windows, monitors system drive (typically C:).

Network

MetricDescription
Network Traffic (KB)Network I/O throughput in KB/s (receive/transmit shown separately).

Network read/write speeds in KB/s, displayed as separate lines for receive/transmit rates.