The K-Scale Labs clusters are shared cluster for robotics research. This page contains notes on how to access the clusters.
=== Onboarding ===
To get onboarded, you should send us the public key that you want to use and maybe your preferred username.
=== Lambda Cluster ===
After being onboarded, you should receive the following information:
* You should avoid storing data files and model checkpoints to your root directory. Instead, use the <code>/ephemeral</code> directory. Your home directory should come with a symlink to a subdirectory which you have write access to.
=== Andromeda Cluster 1 === The Andromeda cluster is a different cluster which uses Slurm for job management. Authentication is different from the Lambda cluster - Ben will provide instructions directly. Don't do anything computationally expensive on the main node or you will crash it for everyone. Instead, when you need to run some experiments, reserve a GPU (see below). ==== SLURM Commands ==== Show all currently running jobs: <syntaxhighlight lang="bash">squeue</syntaxhighlight> Show your own running jobs: <syntaxhighlight lang="bash">squeue --me</syntaxhighlight> Show the available partitions on the cluster: <syntaxhighlight lang="bash">sinfo</syntaxhighlight> You'll see something like this: <syntaxhighlight lang="bash">$ sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELISTcompute* up infinite 8 idle compute-permanent-node-[68,285,493,580,625-626,749,801]</syntaxhighlight> This means: * There is one compute node type, called <code>compute</code>* There are 8 nodes of that type, all currently in <code>idle</code> state* The node names are things like <code>compute-permanent-node-68</code> To launch a job, use [https://slurm.schedmd.com/srun.html srun] or [https://slurm.schedmd.com/sbatch.html sbatch]. * '''srun''' runs a command directly with the requested resources* '''sbatch''' queues the job to run when resources become available For example, suppose I have the following Shell script: <syntaxhighlight lang="bash">#!/bin/bash echo "Hello, world!" nvidia-smi</syntaxhighlight> I can use <code>srun</code> to run this script with the following result:
=== Cluster 2 ===The cluster has 8 available nodes (each with 8 GPUs):<syntaxhighlight lang="textbash">compute$ srun --gpus 8 ./test.shHello, world!Sat May 25 00:02:23 2024+----------------------------------------------------------------------------------------permanent-node+| NVIDIA-68SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |compute|-----------------------------------------+------------------------+--------------permanent-node-285compute-permanent-node-493compute-permanent-node-625+compute| GPU Name Persistence-M | Bus-permanentId Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-nodeUsage | GPU-626Util Compute M. || | | MIG M. ||=========================================+========================+======================| ... truncated</syntaxhighlight> Alternatively, I can queue the job using <code>sbatch</code>, which gives me the following result: <syntaxhighlight lang="bash">compute$ sbatch --permanentgpus 16 test.shSubmitted batch job 461</syntaxhighlight> We can specify <code>sbatch</code> options inside our shell script instead using the following syntax: <syntaxhighlight lang="bash">#!/bin/bash#SBATCH -node-749gpus 16 echo "Hello, world!"compute</syntaxhighlight> After launching the job, we can see it running using our original <code>squeue</code> command: <syntaxhighlight lang="bash">$ squeue -permanent-node-801me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 461 compute test.sh ben R 0:37 1 compute-permanent-node-580285</syntaxhighlight> We can cancel an in-progress job by running <code>scancel</code>: <syntaxhighlight lang="bash">scancel 461
</syntaxhighlight>
When you ssh-in, you log in to the bastion node pure-caribou-bastion from which you can log in to any other node where you can test your code.
[https://github.com/kscalelabs/mlfab/blob/master/mlfab/task/launchers/slurm.py#L262-L309 Here is a reference] <code>sbatch</code> script for launching machine learning jobs. ==== Reserving a GPU ====
Here is a script you can use for getting an interactive node through Slurm.
srun --partition=$SLURM_GPUNODE_PARTITION --gpus=$SLURM_GPUNODE_NUM_GPUS --cpus-per-gpu=$SLURM_GPUNODE_CPUS_PER_GPU --interactive --job-name=gpunode --pty $SLURM_XPUNODE_SHELL
}
</syntaxhighlight>
Example env vars:
<syntaxhighlight lang="bash">
export SLURM_GPUNODE_PARTITION='compute'
export SLURM_GPUNODE_NUM_GPUS=1
export SLURM_GPUNODE_CPUS_PER_GPU=4
export SLURM_XPUNODE_SHELL='/bin/bash'
</syntaxhighlight>
Integrate the example script into your shell then run <code>gpunode</code>.
You can see partition options by running <code>sinfo</code>.
You might get an error like this: <code>groups: cannot find name for group ID 1506</code>. But things should still run fine. Check with <code>nvidia-smi</code>.
==== Useful Commands ====
Set a node state back to normal:
<syntaxhighlight lang="bash">
sudo scontrol update nodename='nodename' state=resume
</syntaxhighlight>
[[Category:K-Scale]]