(s9_workshops_and_trainings:optimizing_ml_workflows)=
# Optimizing ML Workflows on an AI Cluster Workshop


## Workshop Summary

This workshop demonstrates how to optimize machine learning workflows for efficient, reproducible training on an AI cluster. Using [Torchvision models](https://docs.pytorch.org/vision/main/models.html), such as AlexNet and ResNet, trained on [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) and [ImageNet-1k](https://www.image-net.org/index.php) as examples, the workshop walks through the challenges and solutions at each stage of a machine learning pipeline - from environment setup and data management to experiment tracking, model training, evaluation, and deployment. A special focus is placed on using [Weights & Biases (W&B)](https://wandb.ai/home) for managing experiments and hyperparameter sweeps, as well as implementing checkpointing during training.

### Prerequisites

- Familiarity with PyTorch
- Familiarity with HPC, including SLURM batch job submission
- Access to the FASRC cluster (Kempner AI cluster access is not necessary)

## Workshop Slides 

To download the "Optimizing ML Workflows on an AI Cluster" workshop slides, click the link below.

{download}`Kempner Optimizing ML Workflows on an AI Cluster Workshop </_static/workshop/Kempner_Optimizing_ML_Workflows_Workshop.pdf>`

<div style="text-align: center;">
 <iframe src="/_static/workshop/Kempner_Optimizing_ML_Workflows_Workshop.pdf" width="90%" height="460px" style="border: none;"></iframe>
</div>