Fortunately, ETH provides many of its students with access to Leonhard, a computer cluster with GPUs. Note that only students with certain coursework or projects are granted access. I have spent and lost a fair amount of time due to poor Google results and will describe a setup for a simple workflow.
In order to ssh to the server, you first need to connect to ETH’s VPN.
You will find yourself in your personal home directory right away.
You can install python packages to your personal environment via pip. Quite naturally, you will want to use python virtual environments, though! Conda is not installed by the default. You can download an installer, e.g. minicanoda and run it.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
Once installed, in order to be able to call invoke conda from the command like, you want to add it to your bashrc file. Given that you installed conda in
$HOME, add the following line to
Followingly, clone the git repository of your preference, including a
conda env create -f myenvironment.yml
All the steps so far only have to be executed once. You will need to activate the conda environment every time you instantiate a fresh connection to the cluster and intend to run code in this environment.
conda activate myenvironment
An activated enviroment will automatically be picked up by the submission system. In other words, you activate the environment in you own bash terminal, and can then run your python scripts by submitting
For cpu usage run:
bsub "python script.py --batch_size 2 --n_epochs 50000"
For gpu usage run:
bsub -R "rusage[ngpus_excl_p=1]" "python script.py --batch_size 2 --n_epochs 50000"
In case you have access to more GPUs, you can adapt the parameter. Students will usually only have access to one.
Always make sure to request a conservative amount of time, as your job might be killed when surpassing its threshold. The default duration is 2h. The argument can be either in minutes or hours:minutes format.
bsub -W 8:30 -R "rusage[ngpus_excl_p=1]" "python script.py --batch_size 2 --n_epochs 50000"
In order to retroactively inspect parameters of a given, completed job, the following will do:
bhist -l 192240
By default, standard output will be gathered in a file that is delivered to your home directory once the job is no longer running.
A simple way to track progress is to write to standard output and look into this file on the go.
bjobs will tell you the job id of your job, say
In order to investigate the standard output so far, run:
In order to transfer files, e.g. log files or loss function evaluations, use scp. Note that wildcards have to be escaped when using scp. E.g. to transfer files from Leonhard to your machine, run the following on your machine
scp email@example.com:~/repo/logs/\*.csv .