4.3. Shared Data/Model Repository#
We host several popular ML datasets and models on the Kempner AI cluster. This approach reduces the need for multiple transfers of the same data/model by researchers and provides a central, read-only repository for all Kempner Institute users to access for their ML workflows. Only the admin team has write access, but users can submit requests for popular data/models. After a careful review, we may place a copy in the shared data/model repository. The current path on the cluster is:
MODEL_PATH= /n/holylfs06/LABS/kempner_shared/Everyone/testbed/models
DATA_PATH= /n/holylfs06/LABS/kempner_shared/Everyone/testbed/<data-type>
where <data-type> is audio, vision, code, etc.
Note
We will develop a web interface later for data and model discovery.
4.3.1. The current list of ML models#
CodeLlama
Path:
$MODEL_PATH/CodeLlama-7b-hf(see on HuggingFace)Size: 16 G
EleutherAI
Path:
$MODEL_PATH/pythia-160m-deduped(see on HuggingFaceSize: 435 M
Path:
$MODEL_PATH/pythia-70m-deduped(see on HuggingFaceSize: 195 M
OpenAI
Path:
$MODEL_PATH/gpt2(see on HuggingFace)Size: 4.5 M
Google
Path:
$MODEL_PATH/t5-base(see on HuggingFace)Size: 3.4 M
4.3.2. The current list of ML datasets#
c4_original
Path:
$DATA_PATH/text/c4_originalSubfolders:
preprocessed(434 M)raw(157 G)
Description: The original version of the “Colossal Clean Crawled Corpus” (C4) dataset, designed for training natural language processing models.
dolma
Path:
$DATA_PATH/text/dolmaSubfolders:
preprocessed(6.8 T)raw(5.9 T)
Description: Dolma is a dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
imagenet_winter21_whole
Path:
$DATA_PATH/vision/imagenet_winter21_whole/Subfolders:
`` (1.3 T)
Description: An updated version of the ImageNet dataset, containing a wide variety of annotated images for visual object recognition.