Logging Distributed Training Metrics
SwanLab supports logging metrics from distributed training experiments, helping you track training progress across multiple GPUs or machines.
Distributed Training Scenarios
In distributed training, training tasks typically run simultaneously across multiple processes, GPUs, or even multiple machines. SwanLab provides the following methods to log these training metrics:
- Log only from the main process: Only log metrics from rank 0 (main/coordinator process)
- Log separately from each process: Each process creates its own experiment, using the
groupparameter to associate them under the same experiment
Note: SwanLab will support logging all processes to a single experiment in the future. Currently, all processes need to be logged as separate experiments.
Method 1: Log Only from Main Process
In distributed training frameworks like PyTorch DDP, you typically only need to log metrics from the main process (rank 0), since loss values, gradients, and parameters are available on the main process.
import os
import torch
import torch.distributed as dist
import swanlab
def main():
# Initialize distributed training
dist.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)
# Initialize SwanLab only on main process
if local_rank == 0:
swanlab.init(
project="distributed_training",
experiment_name="distributed_training",
)
# Training loop
for epoch in range(10):
# ... training code ...
# Log metrics only on main process
if local_rank == 0:
swanlab.log({
"loss": loss.item(),
"accuracy": accuracy.item()
})
# Training complete
if local_rank == 0:
swanlab.finish()
if __name__ == "__main__":
main()Method 2: Log Separately from Each Process
Each process creates its own experiment, using the group parameter to associate them under the same experiment group, and the job_type parameter to distinguish different types of nodes (e.g., "main" and "worker").
import os
import torch
import torch.distributed as dist
import swanlab
def setup(rank, world_size):
"""Initialize distributed training environment"""
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)
def cleanup():
"""Clean up distributed training environment"""
dist.destroy_process_group()
def train(rank, world_size, group_name):
"""Training function for each process"""
setup(rank, world_size)
torch.cuda.set_device(rank)
# Each process initializes its own experiment
swanlab.init(
project="distributed_training",
experiment_name="distributed_experiment",
group=group_name, # Use the same group name to associate with the same experiment
job_type="train" if rank == 0 else "worker" # Distinguish main process from worker processes
)
# Training loop
for epoch in range(10):
# ... training code ...
# Each process logs its own metrics
swanlab.log({
"loss": loss.item(),
"epoch": epoch,
"rank": rank
})
swanlab.finish()
cleanup()
def main():
world_size = 2 # Use 2 processes
group_name = f"exp_{swanlab.run.public.run_id[:8]}"
# Launch using torchrun or mpirun
import torch.multiprocessing as mp
mp.spawn(
train,
args=(world_size, group_name),
nprocs=world_size,
join=True
)
if __name__ == "__main__":
main()Using Environment Variables
In addition to setting in code, you can also use environment variables to configure distributed training logging:
import os
# Set in main process
os.environ["SWANLAB_GROUP"] = "distributed_exp_001"
os.environ["SWANLAB_JOB_TYPE"] = "train"
import swanlab
swanlab.init(experiment="distributed_experiment")Supported distributed training environment variables:
| Environment Variable | Description |
|---|---|
SWANLAB_GROUP | Experiment group name for associating multiple experiments |
SWANLAB_JOB_TYPE | Job type, e.g., "train", "eval", "inference" |
SWANLAB_NAME | Experiment name |
SWANLAB_DESCRIPTION | Experiment description |
Multi-Node Training
For training across multiple machines, each process on each machine should:
- Use the same
groupname - Use meaningful
job_typeto distinguish different roles (e.g., "node_0", "node_1")
import os
import swanlab
# Get current node identifier
node_id = os.environ.get("NODE_RANK", "0")
rank = os.environ.get("RANK", "0")
swanlab.init(
experiment="multi_node_training",
group=f"distributed_exp_{os.environ.get('EXP_ID', '001')}",
job_type=f"node_{node_id}_rank_{rank}"
)
swanlab.log({"node": node_id, "rank": rank})
swanlab.finish()FAQ
1. Metric Logging Order
SwanLab does not guarantee the order of multi-process metric logging. It is recommended to handle synchronization logic at the application layer.
2. Stuck at Startup
Ensure swanlab.finish() is correctly called in all processes to end logging.
3. Resource Usage
In multi-process scenarios, each process creates its own network connection. Ensure the system has sufficient file descriptors and network resources.
Best Practices
- Use consistent group names: Ensure all processes in the same distributed training task use the same group
- Set job_type: Use job_type to distinguish different types of processes for easier filtering and analysis
- Log rank information: Include rank information in logs to distinguish metrics from different processes
- End experiments properly: Call
swanlab.finish()after training to ensure data is uploaded correctly