23.8. Optimizing ML Workflows on an AI Cluster Workshop#
23.8.1. Workshop Summary#
This workshop demonstrates how to optimize machine learning workflows for efficient, reproducible training on an AI cluster. Using Torchvision models, such as AlexNet and ResNet, trained on CIFAR-10 and ImageNet-1k as examples, the workshop walks through the challenges and solutions at each stage of a machine learning pipeline - from environment setup and data management to experiment tracking, model training, evaluation, and deployment. A special focus is placed on using Weights & Biases (W&B) for managing experiments and hyperparameter sweeps, as well as implementing checkpointing during training.
23.8.1.1. Prerequisites#
Familiarity with PyTorch
Familiarity with HPC, including SLURM batch job submission
Access to the FASRC cluster (Kempner AI cluster access is not necessary)
23.8.2. Workshop Slides#
To download the “Optimizing ML Workflows on an AI Cluster” workshop slides, click the link below.
Kempner Optimizing ML Workflows on an AI Cluster Workshop