Kempner Institute Spring 2024 Compute Workshop#

Date: March 28, 2024
Time: 1:00 - 4:00 PM
Location: SEC 2.118
Presenters: Ella Batty, Naeem Khoshnevis, Max Shad

Welcome to the Kempner Institute Spring 2024 Compute Workshop! This workshop is designed to provide an introduction to High-Performance Computing (HPC) and the Kempner Institute AI cluster. The workshop will cover the basics of HPC, including an overview of the Kempner Institute AI cluster architecture and storage tiers. We will also discuss data transfer methods, code synchronization, and software modules. The workshop will include an introduction to job management and monitoring, advanced computing techniques, and support and troubleshooting.

Infrastructure Orientation#

  • Welcome and Introduction

  • Cluster Access (Click Here)

  • Overview of the Kempner Institute Cluster Architecture (Click Here)

  • Understanding Storage Tiers (Click Here)

  • Shared Open-Source Data Repositories on Cluster (Click Here)

  • Good Citizenship on the Cluster (Click Here)

Development#

Cluster Access
  1. SSH Access

    ssh <username>@login.rc.fas.harvard.edu
    
  2. Open OnDemand (demo)

See Accessing the Cluster for full details.

Software Modules in AI cluster
  1. Software modules via module load

    module avail
    module load python
    

    See Software Modules for full details.

  2. Conda/mamba environments

    Why use conda environments?

    ../../_images/why_conda_env_1.png
    ../../_images/why_conda_env_2.png

    What is mamba?

    FASRC uses mamba, a drop-in replacement for conda that is generally much faster.

    Try it yourself

    Try creating a conda environment called myenv in your home directory by following these steps. Make it usable in jupyter notebooks with one additional step.

  3. Spack

See Spack Package Manager for full details.

Code Synchronization

Using Git:

Step 1: Create a folder for the workshop excercise and navigate to it.

Step 2: Clone the repository: Clone the repository.

git clone https://github.com/KempnerInstitute/intro-compute-march-2024.git

VSCode

Try it yourself

Set up remote development using VSCode by following these steps.

Data Transfer

Scp/rsync: See Data Transfer for full details.

Try it yourself

  1. Navigate to the Data_transfer_example folder here and download data.npy to your computer.

  2. Use scp or rsync to transfer this data to your home directory on the cluster.

Globus: Follow the steps in Globus to set up endpoints on the cluster and your laptop.

Job Management and Monitoring#

  • Fairshare Policy and Job Priority Basics (Max) (Click Here)

Example: Check your lab Fairshare score
sshare --account=kempner_grads --all
Example: Check your jobs fairshare in the queue
sprio -l | head -1 &&  sprio -l | grep $USER
Example: Check all jobs running on kempner partitions
squeue -p kempner -o "%.18i %.9P %.20u %.50j %.8T %.10M %.5D %.20R" | sort -n -k 7
squeue -p kempner_requeue -o "%.18i %.9P %.20u %.50j %.8T %.10M %.5D %.20R" | sort -n -k 7
Example: Fairshare score calculations
scalc
Example: Monitor Fairshare progress through Grafana
Example: Check SLURM partition settings
scontrol show partition kempner
scontrol show partition kempner_requeue
Example: Check status of all Kempner partitions
spart | awk 'NR==1 || /kempner/'
Example: Check status of nodes within a Kempner partition
lsload | head -n 1 & lsload | grep "8a19"
lsload | head -n 1 & lsload | grep "8a17"
SLURM Interactive Jobs via Open OnDemand and VSCode
SLURM Batch Job Submission Basics

See Batch Jobs.

Try it yourself

  1. Navigate to the SLURM_example_1 directory.

Here we have a python script that is simply occupying the CPU and Memory for a certain amount of time. Take a look at the job submission script run.sh and the python script cpu_mem_occupy.py.

  1. Test the job submission script:

You can test the job submission by adding the following command to the run.sh script:

#SBATCH --test-only

This will tell you what would happen if you submit the job without actually submitting it. (Try it!)

  1. Submit the job:

Drop the --test-only flag and set the duration to 300 seconds and submit the job using the following command:

sbatch run.sh
  1. Check the job status:

You can check the status of the job using the following command:

squeue -u <username> 

or

squeue -u $USER

or

squeue --me

Note that the wrapper squeue command has some delay in updating the status of the job.

  1. Cancel the job:

Resubmit the job and try to cancel the job using the following commands.

  • Cancel the job using the job id:

    scancel <job_id>
    
  • Cancel all jobs of the user:

    scancel -u <username>
    
  • Cancel only pending jobs:

    scancel --state=pending -u <username>
    
SLURM Batch Job Submission Advanced

Array Jobs

See Array Jobs.

Try it yourself

  1. Navigate to the SLURM_example_2 directory.

  2. Take a look at the job submission script run_array_job.sh, the python script hyperparameter_tuning.py, and the csv file hyperparemters.csv. Can you figure out what would happen if you run this job?

  3. Submit the array job

sbatch run_array_job.sh
  1. Check the status of the job. Look at the output files created (once it runs). Do they match what you would expect?

  • Useful Slurm commands (Click Here)

  • Monitoring Job Status and Utilization

Advanced Computing Techniques#

  • Best practices for HPC efficiency

  • Introduction to parallel computing (Click Here)

  • Containerization with Singularity (Click Here)

  • Distributed Computing and Training (Click Here)

Support and Troubleshooting#

  • Troubleshooting Common Issues

  • Support Framework: FASRC and Kempner Engineering Team (Click Here)

    • Send a ticket to FASRC (rchelp [at] rc.fas.harvard.edu)

  • Closing Remarks and Q&A Session