Optimizing ML Workflows on an AI Cluster Workshop

25.8. Optimizing ML Workflows on an AI Cluster Workshop#

25.8.1. Workshop Summary#

This workshop demonstrates how to optimize machine learning workflows for efficient, reproducible training on an AI cluster. Using Torchvision models, such as AlexNet and ResNet, trained on CIFAR-10 and ImageNet-1k as examples, the workshop walks through the challenges and solutions at each stage of a machine learning pipeline - from environment setup and data management to experiment tracking, model training, evaluation, and deployment. A special focus is placed on using Weights & Biases (W&B) for managing experiments and hyperparameter sweeps, as well as implementing checkpointing during training.

25.8.1.1. Prerequisites#

Familiarity with PyTorch
Familiarity with HPC, including SLURM batch job submission
Access to the FASRC cluster (Kempner AI cluster access is not necessary)

25.8.2. Workshop Slides#

To download the “Optimizing ML Workflows on an AI Cluster” workshop slides, click the link below.

Kempner Optimizing ML Workflows on an AI Cluster Workshop

Optimizing ML Workflows on an AI Cluster Workshop

Contents

25.8. Optimizing ML Workflows on an AI Cluster Workshop#

25.8.1. Workshop Summary#

25.8.1.1. Prerequisites#

25.8.2. Workshop Slides#