Experiment Tracking#

What to Track#

Hyperparameters: lr, batch size, optimizer, model architecture
Metrics: train/val loss, accuracy, per-epoch
System info: GPU, CUDA version, commit hash
Artifacts: model checkpoints, config files, sample outputs

Weights & Biases (wandb)#

import wandb

wandb.init(project="my-project", config={"lr": 1e-3, "batch_size": 64})

for epoch in range(100):
    # train...
    wandb.log({"train/loss": loss, "val/acc": acc, "epoch": epoch})

wandb.finish()

Provides: live dashboards, hyperparameter sweeps, artifact versioning.

MLflow#

import mlflow

with mlflow.start_run():
    mlflow.log_params({"lr": 1e-3})
    mlflow.log_metrics({"val_acc": 0.95})
    mlflow.pytorch.log_model(model, "model")

Self-hostable, open source, integrates with Databricks.

TensorBoard#

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter('runs/exp1')
writer.add_scalar('Loss/train', loss, step)
writer.add_histogram('weights', model.fc.weight, step)
writer.add_image('sample', img, step)
writer.close()

Launch: tensorboard --logdir runs/

Hydra (Config Management)#

# config.yaml
lr: 1e-3
batch_size: 64
model:
  hidden_dim: 256
  n_layers: 4

@hydra.main(config_path=".", config_name="config")
def main(cfg: DictConfig):
    print(cfg.lr, cfg.model.hidden_dim)

Override from CLI: python train.py lr=1e-4 model.n_layers=6

Best Practices#

Log every run — disk is cheap, reproduce ability is not
Use a random seed and log it
Save config alongside checkpoint
Use git tags or commit hash to link code to experiment
Group related experiments in the same project/experiment group