Optimizing ML Workflows on an AI Cluster Workshop

23.8. Optimizing ML Workflows on an AI Cluster Workshop#

23.8.1. Workshop Summary#

This workshop demonstrates how to optimize machine learning workflows for efficient, reproducible training on an AI cluster. Using Torchvision models, such as AlexNet and ResNet, trained on CIFAR-10 and ImageNet-1k as examples, the workshop walks through the challenges and solutions at each stage of a machine learning pipeline - from environment setup and data management to experiment tracking, model training, evaluation, and deployment. A special focus is placed on using Weights & Biases (W&B) for managing experiments and hyperparameter sweeps, as well as implementing checkpointing during training.

23.8.1.1. Prerequisites#

  • Familiarity with PyTorch

  • Familiarity with HPC, including SLURM batch job submission

  • Access to the FASRC cluster (Kempner AI cluster access is not necessary)

23.8.2. Workshop Slides#

To download the “Optimizing ML Workflows on an AI Cluster” workshop slides, click the link below.

Kempner Optimizing ML Workflows on an AI Cluster Workshop