Examples¶
Accessing a node in interactive mode with dcejs¶
On the video below, we illustrate how to get an interactive session on a GPU node for 1 hour.
Accessing a remote VSCode server with dcejs¶
On the video below, we show how to access to a remote VS Code server with dcejs. Dcejs will take care of running code server on the remote node and create the ssh tunnel for accessing it.
Accessing a node in interactive mode with visual studio code¶
You can install locally visual studio code and connect to the DCE using the remote ssh extension. Configuring the remote ssh extension allows you to edit locally files that are present on the remote servers and to submit your jobs from the terminal connected to the gateway.
Below we demonstrate the use of the remote ssh extension to connect to the DCE, allocate a node for an interactive session and start a python script.
Note in the video we are using a key based authentication, which is the recommanded way to authenticate compared to password based authentication. You can configure key based authentication following this guide.
Using a conda virtual environment for running your codes¶
It is important, when you run simulations, to know exactly the environment you have been using for running it, in particular the version of the librairies and binaries you used.
One convenient program to do so is conda. There exists other alternatives such as pipenv, virtualenv, docker, ... but the one we provide on the DCE is conda. In a few words, conda allows you to install specific libraries with specific versions, python version, etc.. and to be relatively independent on the versions that are installed on the node you are using. At the time of writting (july 2022), anaconda 4.10.3 is installed and if you want to know more about it, you can read the conda documentation.
To use conda on the DCE, we adivse you to proceed in two steps :
- create the environment once for all: see for example slurm-conda-setup.sbatch and the required requirements.txt,
- in your running simulations, use the created environment : see for example slurm-process.sbatch
Creating a conda environment can take a while, that is the reason why we adivse you to create it once for all. Once created, you can activate a conda environment from any directory, no need to be in the directory where the creation has been triggered (the conda enviromnents are stored in ~/.conda/envs/
).
For using the sbatch files above, you would run once for all sbatch slurm-conda-setup.sbatch
. And then, you could run your simulations, following the guidelines provided in slurm-process.sbatch
to load your create conda environment.
Tips Once you created the conda environment, you can dump it into an YAML file for easy sharing by exporting your environment with conda env export > environment.yml
. Then you can recreate the environment using conda env create -f environment.yml
.
Starting batch trainings of neural networks¶
Coding and running in interactive mode¶
In the previous example, we started an interactive session. In that interactive you can write your code, test it on small datasets running on the GPUs. But then, you want to submit a long running job, a so-called batch in the slurm language. Let us how we can do that on a practical example.
For illustration, I will be considering training a simple convolutional neural network on the FashionMNIST dataset. We will cover mainly two aspects :
- how to store your temporary data (e.g. raw datasets or preprocessed datasets) on a fast local SSD drive
- how to submit a range of experiments to be processed by the GPU nodes
I provide you with a basic training script train_emnist.py. The script you are provided can be tested in an interactive session like this :
[user@cam10 ~]$ python3 train_emnist.py --model linear --datadir $TMPDIR
Running a batch job¶
When you are done coding your training script, you want to run your experiment, possibly multiple times without having to start an interactive session. This is done in batch mode. In batch mode, you have the additional benefit that the node is released as soon as the job is completed.
Running a batch simulation starts by defining a job, for example the job.batch
script below :
#!/bin/bash
#SBATCH --job-name=emnist
#SBATCH --nodes=1
#SBATCH --partition=gpu_prod_night
#SBATCH --time=1:00:00
#SBATCH --output=logslurms/slurm-%j.out
#SBATCH --error=logslurms/slurm-%j.err
python3 train_emnist.py --model linear --dataset_dir $TMPDIR train
The batch script above requires the directory logslurms
to exist before running it :
[user@cam10 ~]$ mkdir -p logslurms
You can then submit your simulation, for example 10 times, by calling from the frontal node :
[user@cam10 ~]$ sbatch --array=0-10 job.batch
Testing multiple parameters¶
But suppose you want to run multiple experiments with different settings for example performing a random search in the space of hyperparameters. The previous example might not be sufficient. A more advanced approach is to use a script to generate a sbatch
job file. For example using job.py :
#!/usr/bin/python
import os
def makejob(model, nruns):
return f"""#!/bin/bash
#SBATCH --job-name=emnist-{model}
#SBATCH --nodes=1
#SBATCH --partition=gpu_prod_night
#SBATCH --time=1:00:00
#SBATCH --output=logslurms/slurm-%A_%a.out
#SBATCH --error=logslurms/slurm-%A_%a.err
#SBATCH --array=0-{nruns}
python3 train_emnist.py --model {model} --dataset_dir $TMPDIR train
"""
def submit_job(job):
with open('job.sbatch', 'w') as fp:
fp.write(job)
os.system("sbatch job.sbatch")
# Ensure the log directory exists
os.system("mkdir -p logslurms")
# Launch the batch jobs
submit_job(makejob("linear", 10))
submit_job(makejob("cnn", 10))
We there have a python script which is generating bash files to be submitted to sbatch. The sbatch file itself will run the experiment for a maximum of 1 hour on the gpu_prod_night partition using 1 node and running the experiment multiple times with an array. All the sbatch directives are detailed elsewhere in the documentation
You can then execute the job.py
script and see your running jobs :
[user@chome ~]$ python job.py
[user@chome ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
869_[0-10] gpu_prod_night emnist-cnn user PD 0:00 1 (QOSMaxJobsPerUserLimit)
868_[4-10] gpu_prod_night emnist-linear user PD 0:00 1 (QOSMaxJobsPerUserLimit)
868_0 gpu_prod_night emnist-linear user R 6:55 1 tx09
868_1 gpu_prod_night emnist-linear user R 6:55 1 tx10
868_2 gpu_prod_night emnist-linear user R 6:55 1 tx11
868_3 gpu_prod_night emnist-linear user R 6:55 1 tx12
A more advanced sbatch¶
The way to run sbatch in the previous paragraph is interesting for running multiple experiments but there are two pitfalls if you are currently developing your code and your base code changes. Indeed, with the previous way to start sbatch, 1) you have no guarantee of which version of your code was running when it ran and 2) you must not change your code while your simulations are waiting to be executed because it is only when your job will start that your code will be read.
The thing that is missing is a tag to certify the version of your code and we all know a program that does that : the versioning system git. I propose you below a modified version of the job.py
script which 1) checks that you do not have any modified or staged code, therefore you have a commit id which certifies the version of the code, 2) saves the commit id when running the sbatch command and 3) copies and pulls the base code at that specific commit id before running.
The last convenient add-on of the code below is to set up a virtual environment every time we run a simulation. Given a requirements.txt is provided along your GIT, the script below will also take care of setting up the virtual environment.
#!/usr/bin/python
import os
import subprocess
def makejob(commit_id, model, nruns):
return f"""#!/bin/bash
#SBATCH --job-name=emnist-{model}
#SBATCH --nodes=1
#SBATCH --partition=gpu_prod_night
#SBATCH --time=1:00:00
#SBATCH --output=logslurms/slurm-%A_%a.out
#SBATCH --error=logslurms/slurm-%A_%a.err
#SBATCH --array=0-{nruns}
current_dir=`pwd`
echo "Session " {model}_${{SLURM_ARRAY_JOB_ID}}_${{SLURM_ARRAY_TASK_ID}}
echo "Copying the source directory and data"
date
mkdir $TMPDIR/emnist
rsync -r . $TMPDIR/emnist/
echo "Checking out the correct version of the code commit_id {commit_id}"
cd $TMPDIR/emnist/
git checkout {commit_id}
echo "Setting up the virtual environment"
python3 -m pip install virtualenv --user
virtualenv -p python3 venv
source venv/bin/activate
python -m pip install -r requirements.txt
echo "Training"
python3 train_emnist.py --model {model} --dataset_dir $TMPDIR train
if [[ $? != 0 ]]; then
exit -1
fi
# Once the job is finished, you can copy back back
# files from $TMPDIR/emnist to $current_dir
"""
def submit_job(job):
with open('job.sbatch', 'w') as fp:
fp.write(job)
os.system("sbatch job.sbatch")
# Ensure all the modified files have been staged and commited
result = int(subprocess.run("expr $(git diff --name-only | wc -l) + $(git diff --name-only --cached | wc -l)",
shell=True, stdout=subprocess.PIPE).stdout.decode())
if result > 0:
print(f"We found {result} modifications either not staged or not commited")
raise RuntimeError("You must stage and commit every modification before submission ")
commit_id = subprocess.check_output("git log --pretty=format:'%H' -n 1", shell=True).decode()
# Ensure the log directory exists
os.system("mkdir -p logslurms")
# Launch the batch jobs
submit_job(makejob(commit_id, "linear", 10))
submit_job(makejob(commit_id, "cnn", 10))
Running an MPI experiment¶
TBD