Experiment Tracking#
What to Track#
- Hyperparameters: lr, batch size, optimizer, model architecture
- Metrics: train/val loss, accuracy, per-epoch
- System info: GPU, CUDA version, commit hash
- Artifacts: model checkpoints, config files, sample outputs
Weights & Biases (wandb)#
import wandb
wandb.init(project="my-project", config={"lr": 1e-3, "batch_size": 64})
for epoch in range(100):
# train...
wandb.log({"train/loss": loss, "val/acc": acc, "epoch": epoch})
wandb.finish()Provides: live dashboards, hyperparameter sweeps, artifact versioning.
MLflow#
import mlflow
with mlflow.start_run():
mlflow.log_params({"lr": 1e-3})
mlflow.log_metrics({"val_acc": 0.95})
mlflow.pytorch.log_model(model, "model")Self-hostable, open source, integrates with Databricks.
TensorBoard#
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/exp1')
writer.add_scalar('Loss/train', loss, step)
writer.add_histogram('weights', model.fc.weight, step)
writer.add_image('sample', img, step)
writer.close()Launch: tensorboard --logdir runs/
Hydra (Config Management)#
# config.yaml
lr: 1e-3
batch_size: 64
model:
hidden_dim: 256
n_layers: 4@hydra.main(config_path=".", config_name="config")
def main(cfg: DictConfig):
print(cfg.lr, cfg.model.hidden_dim)Override from CLI: python train.py lr=1e-4 model.n_layers=6
Best Practices#
- Log every run — disk is cheap, reproduce ability is not
- Use a random seed and log it
- Save config alongside checkpoint
- Use git tags or commit hash to link code to experiment
- Group related experiments in the same project/experiment group