Kempner Institute Spring 2024 Compute Workshop#
Date: March 28, 2024
Time: 1:00 - 4:00 PM
Location: SEC 2.118
Presenters: Ella Batty, Naeem Khoshnevis, Max Shad
Welcome to the Kempner Institute Spring 2024 Compute Workshop! This workshop is designed to provide an introduction to High-Performance Computing (HPC) and the Kempner Institute AI cluster. The workshop will cover the basics of HPC, including an overview of the Kempner Institute AI cluster architecture and storage tiers. We will also discuss data transfer methods, code synchronization, and software modules. The workshop will include an introduction to job management and monitoring, advanced computing techniques, and support and troubleshooting.
Infrastructure Orientation#
Welcome and Introduction
Cluster Access (Click Here)
Overview of the Kempner Institute Cluster Architecture (Click Here)
Understanding Storage Tiers (Click Here)
Shared Open-Source Data Repositories on Cluster (Click Here)
Good Citizenship on the Cluster (Click Here)
Development#
Cluster Access
SSH Access
ssh <username>@login.rc.fas.harvard.edu
Open OnDemand (demo)
See Accessing the Cluster for full details.
Software Modules in AI cluster
Software modules via module load
module avail module load python
See Software Modules for full details.
Conda/mamba environments
Why use conda environments?
What is mamba?
FASRC uses
mamba
, a drop-in replacement forconda
that is generally much faster.Try it yourself
Try creating a conda environment called
myenv
in your home directory by following these steps. Make it usable in jupyter notebooks with one additional step.Spack
See Spack Package Manager for full details.
Code Synchronization
Using Git:
Step 1: Create a folder for the workshop excercise and navigate to it.
Step 2: Clone the repository: Clone the repository.
git clone https://github.com/KempnerInstitute/intro-compute-march-2024.git
VSCode
Try it yourself
Set up remote development using VSCode by following these steps.
Data Transfer
Scp/rsync: See Data Transfer for full details.
Try it yourself
Navigate to the Data_transfer_example folder here and download
data.npy
to your computer.Use scp or rsync to transfer this data to your home directory on the cluster.
Globus: Follow the steps in Globus to set up endpoints on the cluster and your laptop.
Job Management and Monitoring#
Fairshare Policy and Job Priority Basics (Max) (Click Here)
Example: Check your lab Fairshare score
sshare --account=kempner_grads --all
Example: Check your jobs fairshare in the queue
sprio -l | head -1 && sprio -l | grep $USER
Example: Check all jobs running on kempner partitions
squeue -p kempner -o "%.18i %.9P %.20u %.50j %.8T %.10M %.5D %.20R" | sort -n -k 7
squeue -p kempner_requeue -o "%.18i %.9P %.20u %.50j %.8T %.10M %.5D %.20R" | sort -n -k 7
Example: Fairshare score calculations
scalc
Example: Monitor Fairshare progress through Grafana
SLURM Partitions (Click Here)
FASRC SLURM Partitions (Click Here)
Example: Check SLURM partition settings
scontrol show partition kempner
scontrol show partition kempner_requeue
Example: Check status of all Kempner partitions
spart | awk 'NR==1 || /kempner/'
Example: Check status of nodes within a Kempner partition
lsload | head -n 1 & lsload | grep "8a19"
lsload | head -n 1 & lsload | grep "8a17"
SLURM Interactive Jobs via Open OnDemand and VSCode
Open OnDemand: See Open OnDemand.
VSCode: See Connecting to the FASRC cluster (Compute node).
SLURM Batch Job Submission Basics
See Batch Jobs.
Try it yourself
Navigate to the
SLURM_example_1
directory.
Here we have a python script that is simply occupying the CPU and Memory for a certain amount of time. Take a look at the job submission script run.sh
and the python script cpu_mem_occupy.py
.
Test the job submission script:
You can test the job submission by adding the following command to the run.sh
script:
#SBATCH --test-only
This will tell you what would happen if you submit the job without actually submitting it. (Try it!)
Submit the job:
Drop the --test-only
flag and set the duration to 300 seconds and submit the job using the following command:
sbatch run.sh
Check the job status:
You can check the status of the job using the following command:
squeue -u <username>
or
squeue -u $USER
or
squeue --me
Note that the wrapper squeue command has some delay in updating the status of the job.
Cancel the job:
Resubmit the job and try to cancel the job using the following commands.
Cancel the job using the job id:
scancel <job_id>
Cancel all jobs of the user:
scancel -u <username>
Cancel only pending jobs:
scancel --state=pending -u <username>
SLURM Batch Job Submission Advanced
Array Jobs
See Array Jobs.
Try it yourself
Navigate to the
SLURM_example_2
directory.Take a look at the job submission script
run_array_job.sh
, the python scripthyperparameter_tuning.py
, and the csv filehyperparemters.csv
. Can you figure out what would happen if you run this job?Submit the array job
sbatch run_array_job.sh
Check the status of the job. Look at the output files created (once it runs). Do they match what you would expect?
Useful Slurm commands (Click Here)
Monitoring Job Status and Utilization
Advanced Computing Techniques#
Best practices for HPC efficiency
Introduction to parallel computing (Click Here)
Containerization with Singularity (Click Here)
Distributed Computing and Training (Click Here)
Support and Troubleshooting#
Troubleshooting Common Issues
Support Framework: FASRC and Kempner Engineering Team (Click Here)
Send a ticket to FASRC (
rchelp [at] rc.fas.harvard.edu
)
Closing Remarks and Q&A Session