The K-Scale Labs clusters are shared cluster for robotics research. This page contains notes on how to access the clusters.
Contents
OnboardingEdit
To get onboarded, you should send us the public key that you want to use and maybe your preferred username.
Lambda ClusterEdit
After being onboarded, you should receive the following information:
- Your user ID (for this example, we'll use
stompy
) - The jumphost ID (for this example, we'll use
127.0.0.1
) - The cluster ID (for this example, we'll use
127.0.0.2
)
To connect, you should be able to use the following command:
ssh -o ProxyCommand="ssh -i ~/.ssh/id_rsa -W %h:%p stompy@127.0.0.1" stompy@127.0.0.2 -i ~/.ssh/id_rsa
Note that ~/.ssh/id_rsa
should point to your private key file.
Alternatively, you can add the following to your SSH config file, which should allow you to connect directly,
Use your favorite editor to open the ssh config file (normally located at ~/.ssh/config
for Ubuntu) and paste the text:
Host jumphost
User stompy
Hostname 127.0.0.1
IdentityFile ~/.ssh/id_rsa
Host cluster
User stompy
Hostname 127.0.0.2
ProxyJump jumphost
IdentityFile ~/.ssh/id_rsa
After setting this up, you can use the command ssh cluster
to directly connect.
You can also access via VS Code. Tutorial of using ssh
in VS Code is here.
Please inform us if you have any issues!
NotesEdit
- You may need to restart
ssh
to get it working. - You may be sharing your part of the cluster with other users. If so, it is a good idea to avoid using all the GPUs. If you're training models in PyTorch, you can do this using the
CUDA_VISIBLE_DEVICES
command. - You should avoid storing data files and model checkpoints to your root directory. Instead, use the
/ephemeral
directory. Your home directory should come with a symlink to a subdirectory which you have write access to.
Andromeda ClusterEdit
The Andromeda cluster is a different cluster which uses Slurm for job management. Authentication is different from the Lambda cluster - Ben will provide instructions directly.
Don't do anything computationally expensive on the main node or you will crash it for everyone. Instead, when you need to run some experiments, reserve a GPU (see below).
SLURM CommandsEdit
Show all currently running jobs:
squeue
Show your own running jobs:
squeue --me
Show the available partitions on the cluster:
sinfo
You'll see something like this:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 8 idle compute-permanent-node-[68,285,493,580,625-626,749,801]
This means:
- There is one compute node type, called
compute
- There are 8 nodes of that type, all currently in
idle
state - The node names are things like
compute-permanent-node-68
To launch a job, use srun or sbatch.
- srun runs a command directly with the requested resources
- sbatch queues the job to run when resources become available
For example, suppose I have the following Shell script:
#!/bin/bash
echo "Hello, world!"
nvidia-smi
I can use srun
to run this script with the following result:
$ srun --gpus 8 ./test.sh
Hello, world!
Sat May 25 00:02:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
... truncated
Alternatively, I can queue the job using sbatch
, which gives me the following result:
$ sbatch --gpus 16 test.sh
Submitted batch job 461
We can specify sbatch
options inside our shell script instead using the following syntax:
#!/bin/bash
#SBATCH --gpus 16
echo "Hello, world!"
After launching the job, we can see it running using our original squeue
command:
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
461 compute test.sh ben R 0:37 1 compute-permanent-node-285
We can cancel an in-progress job by running scancel
:
scancel 461
Here is a reference sbatch
script for launching machine learning jobs.
Reserving a GPUEdit
Here is a script you can use for getting an interactive node through Slurm.
gpunode () {
local job_id=$(squeue -u $USER -h -t R -o %i -n gpunode)
if [[ -n $job_id ]]
then
echo "Attaching to job ID $job_id"
srun --jobid=$job_id --partition=$SLURM_GPUNODE_PARTITION --gpus=$SLURM_GPUNODE_NUM_GPUS --cpus-per-gpu=$SLURM_GPUNODE_CPUS_PER_GPU --pty $SLURM_XPUNODE_SHELL
return 0
fi
echo "Creating new job"
srun --partition=$SLURM_GPUNODE_PARTITION --gpus=$SLURM_GPUNODE_NUM_GPUS --cpus-per-gpu=$SLURM_GPUNODE_CPUS_PER_GPU --interactive --job-name=gpunode --pty $SLURM_XPUNODE_SHELL
}
Example env vars:
export SLURM_GPUNODE_PARTITION='compute'
export SLURM_GPUNODE_NUM_GPUS=1
export SLURM_GPUNODE_CPUS_PER_GPU=4
export SLURM_XPUNODE_SHELL='/bin/bash'
Integrate the example script into your shell then run gpunode
.
You can see partition options by running sinfo
.
You might get an error like this: groups: cannot find name for group ID 1506
. But things should still run fine. Check with nvidia-smi
.
Useful CommandsEdit
Set a node state back to normal:
sudo scontrol update nodename='nodename' state=resume