Difference between revisions of "K-Scale Cluster"

From Humanoid Robots Wiki
Jump to: navigation, search
(Onboarding)
(SLURM Commands)
 
(20 intermediate revisions by 5 users not shown)
Line 1: Line 1:
The K-Scale Labs cluster is a shared cluster for robotics research. This page contains notes on how to access the cluster.
+
The K-Scale Labs clusters are shared cluster for robotics research. This page contains notes on how to access the clusters.
  
=== Onboarding ===
+
== Onboarding ==
  
 
To get onboarded, you should send us the public key that you want to use and maybe your preferred username.
 
To get onboarded, you should send us the public key that you want to use and maybe your preferred username.
 +
 +
=== Lambda Cluster ===
  
 
After being onboarded, you should receive the following information:
 
After being onboarded, you should receive the following information:
Line 20: Line 22:
  
 
Alternatively, you can add the following to your SSH config file, which should allow you to connect directly,
 
Alternatively, you can add the following to your SSH config file, which should allow you to connect directly,
Use your favorite editor to open the ssh config file (normally located at ~/.ssh/config for Ubuntu) and paste the text:
+
Use your favorite editor to open the ssh config file (normally located at <code>~/.ssh/config</code> for Ubuntu) and paste the text:
  
 
<syntaxhighlight lang="text">
 
<syntaxhighlight lang="text">
Line 35: Line 37:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Please inform us if you have any issues.
+
After setting this up, you can use the command <code>ssh cluster</code> to directly connect.
 +
 
 +
You can also access via VS Code. Tutorial of using <code>ssh</code> in VS Code is [https://code.visualstudio.com/docs/remote/ssh-tutorial here].
 +
 
 +
Please inform us if you have any issues!
  
 
=== Notes ===
 
=== Notes ===
  
 +
* You may need to restart <code>ssh</code> to get it working.
 
* You may be sharing your part of the cluster with other users. If so, it is a good idea to avoid using all the GPUs. If you're training models in PyTorch, you can do this using the <code>CUDA_VISIBLE_DEVICES</code> command.
 
* You may be sharing your part of the cluster with other users. If so, it is a good idea to avoid using all the GPUs. If you're training models in PyTorch, you can do this using the <code>CUDA_VISIBLE_DEVICES</code> command.
 
* You should avoid storing data files and model checkpoints to your root directory. Instead, use the <code>/ephemeral</code> directory. Your home directory should come with a symlink to a subdirectory which you have write access to.
 
* You should avoid storing data files and model checkpoints to your root directory. Instead, use the <code>/ephemeral</code> directory. Your home directory should come with a symlink to a subdirectory which you have write access to.
 +
 +
=== Andromeda Cluster ===
 +
 +
The Andromeda cluster is a different cluster which uses Slurm for job management. Authentication is different from the Lambda cluster - Ben will provide instructions directly.
 +
 +
Don't do anything computationally expensive on the main node or you will crash it for everyone. Instead, when you need to run some experiments, reserve a GPU (see below).
 +
 +
==== SLURM Commands ====
 +
 +
Show all currently running jobs:
 +
 +
<syntaxhighlight lang="bash">
 +
squeue
 +
</syntaxhighlight>
 +
 +
Show your own running jobs:
 +
 +
<syntaxhighlight lang="bash">
 +
squeue --me
 +
</syntaxhighlight>
 +
 +
Show the available partitions on the cluster:
 +
 +
<syntaxhighlight lang="bash">
 +
sinfo
 +
</syntaxhighlight>
 +
 +
You'll see something like this:
 +
 +
<syntaxhighlight lang="bash">
 +
$ sinfo
 +
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 +
compute*    up  infinite      8  idle compute-permanent-node-[68,285,493,580,625-626,749,801]
 +
</syntaxhighlight>
 +
 +
This means:
 +
 +
* There is one compute node type, called <code>compute</code>
 +
* There are 8 nodes of that type, all currently in <code>idle</code> state
 +
* The node names are things like <code>compute-permanent-node-68</code>
 +
 +
To launch a job, use [https://slurm.schedmd.com/srun.html srun] or [https://slurm.schedmd.com/sbatch.html sbatch].
 +
 +
* '''srun''' runs a command directly with the requested resources
 +
* '''sbatch''' queues the job to run when resources become available
 +
 +
For example, suppose I have the following Shell script:
 +
 +
<syntaxhighlight lang="bash">
 +
#!/bin/bash
 +
 +
echo "Hello, world!"
 +
 +
nvidia-smi
 +
</syntaxhighlight>
 +
 +
I can use <code>srun</code> to run this script with the following result:
 +
 +
<syntaxhighlight lang="bash">
 +
$ srun --gpus 8 ./test.sh
 +
Hello, world!
 +
Sat May 25 00:02:23 2024
 +
+-----------------------------------------------------------------------------------------+
 +
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4    |
 +
|-----------------------------------------+------------------------+----------------------+
 +
| GPU  Name                Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
 +
| Fan  Temp  Perf          Pwr:Usage/Cap |          Memory-Usage | GPU-Util  Compute M. |
 +
|                                        |                        |              MIG M. |
 +
|=========================================+========================+======================|
 +
 +
... truncated
 +
</syntaxhighlight>
 +
 +
Alternatively, I can queue the job using <code>sbatch</code>, which gives me the following result:
 +
 +
<syntaxhighlight lang="bash">
 +
$ sbatch --gpus 16 test.sh
 +
Submitted batch job 461
 +
</syntaxhighlight>
 +
 +
We can specify <code>sbatch</code> options inside our shell script instead using the following syntax:
 +
 +
<syntaxhighlight lang="bash">
 +
#!/bin/bash
 +
#SBATCH --gpus 16
 +
 +
echo "Hello, world!"
 +
</syntaxhighlight>
 +
 +
After launching the job, we can see it running using our original <code>squeue</code> command:
 +
 +
<syntaxhighlight lang="bash">
 +
$ squeue --me
 +
            JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
 +
              461  compute  test.sh      ben  R      0:37      1 compute-permanent-node-285
 +
</syntaxhighlight>
 +
 +
We can cancel an in-progress job by running <code>scancel</code>:
 +
 +
<syntaxhighlight lang="bash">
 +
scancel 461
 +
</syntaxhighlight>
 +
 +
[https://github.com/kscalelabs/mlfab/blob/master/mlfab/task/launchers/slurm.py#L262-L309 Here is a reference] <code>sbatch</code> script for launching machine learning jobs.
 +
 +
==== Reserving a GPU ====
 +
 +
Here is a script you can use for getting an interactive node through Slurm.
 +
 +
<syntaxhighlight lang="bash">
 +
gpunode () {
 +
    local job_id=$(squeue -u $USER -h -t R -o %i -n gpunode)
 +
    if [[ -n $job_id ]]
 +
    then
 +
        echo "Attaching to job ID $job_id"
 +
        srun --jobid=$job_id --partition=$SLURM_GPUNODE_PARTITION --gpus=$SLURM_GPUNODE_NUM_GPUS --cpus-per-gpu=$SLURM_GPUNODE_CPUS_PER_GPU --pty $SLURM_XPUNODE_SHELL
 +
        return 0
 +
    fi
 +
    echo "Creating new job"
 +
    srun --partition=$SLURM_GPUNODE_PARTITION --gpus=$SLURM_GPUNODE_NUM_GPUS --cpus-per-gpu=$SLURM_GPUNODE_CPUS_PER_GPU --interactive --job-name=gpunode --pty $SLURM_XPUNODE_SHELL
 +
}
 +
</syntaxhighlight>
 +
 +
Example env vars:
 +
<syntaxhighlight lang="bash">
 +
export SLURM_GPUNODE_PARTITION='compute'
 +
export SLURM_GPUNODE_NUM_GPUS=1
 +
export SLURM_GPUNODE_CPUS_PER_GPU=4
 +
export SLURM_XPUNODE_SHELL='/bin/bash'
 +
</syntaxhighlight>
 +
 +
Integrate the example script into your shell then run <code>gpunode</code>.
 +
 +
You can see partition options by running <code>sinfo</code>.
 +
 +
You might get an error like this: <code>groups: cannot find name for group ID 1506</code>. But things should still run fine. Check with <code>nvidia-smi</code>.
 +
 +
==== Useful Commands ====
 +
 +
Set a node state back to normal:
 +
 +
<syntaxhighlight lang="bash">
 +
sudo scontrol update nodename='nodename' state=resume
 +
</syntaxhighlight>
 +
 +
[[Category:K-Scale]]

Latest revision as of 00:12, 25 May 2024

The K-Scale Labs clusters are shared cluster for robotics research. This page contains notes on how to access the clusters.

Onboarding[edit]

To get onboarded, you should send us the public key that you want to use and maybe your preferred username.

Lambda Cluster[edit]

After being onboarded, you should receive the following information:

  • Your user ID (for this example, we'll use stompy)
  • The jumphost ID (for this example, we'll use 127.0.0.1)
  • The cluster ID (for this example, we'll use 127.0.0.2)

To connect, you should be able to use the following command:

ssh -o ProxyCommand="ssh -i ~/.ssh/id_rsa -W %h:%p stompy@127.0.0.1" stompy@127.0.0.2 -i ~/.ssh/id_rsa

Note that ~/.ssh/id_rsa should point to your private key file.

Alternatively, you can add the following to your SSH config file, which should allow you to connect directly, Use your favorite editor to open the ssh config file (normally located at ~/.ssh/config for Ubuntu) and paste the text:

Host jumphost
    User stompy
    Hostname 127.0.0.1
    IdentityFile ~/.ssh/id_rsa

Host cluster
    User stompy
    Hostname 127.0.0.2
    ProxyJump jumphost
    IdentityFile ~/.ssh/id_rsa

After setting this up, you can use the command ssh cluster to directly connect.

You can also access via VS Code. Tutorial of using ssh in VS Code is here.

Please inform us if you have any issues!

Notes[edit]

  • You may need to restart ssh to get it working.
  • You may be sharing your part of the cluster with other users. If so, it is a good idea to avoid using all the GPUs. If you're training models in PyTorch, you can do this using the CUDA_VISIBLE_DEVICES command.
  • You should avoid storing data files and model checkpoints to your root directory. Instead, use the /ephemeral directory. Your home directory should come with a symlink to a subdirectory which you have write access to.

Andromeda Cluster[edit]

The Andromeda cluster is a different cluster which uses Slurm for job management. Authentication is different from the Lambda cluster - Ben will provide instructions directly.

Don't do anything computationally expensive on the main node or you will crash it for everyone. Instead, when you need to run some experiments, reserve a GPU (see below).

SLURM Commands[edit]

Show all currently running jobs:

squeue

Show your own running jobs:

squeue --me

Show the available partitions on the cluster:

sinfo

You'll see something like this:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      8   idle compute-permanent-node-[68,285,493,580,625-626,749,801]

This means:

  • There is one compute node type, called compute
  • There are 8 nodes of that type, all currently in idle state
  • The node names are things like compute-permanent-node-68

To launch a job, use srun or sbatch.

  • srun runs a command directly with the requested resources
  • sbatch queues the job to run when resources become available

For example, suppose I have the following Shell script:

#!/bin/bash

echo "Hello, world!"

nvidia-smi

I can use srun to run this script with the following result:

$ srun --gpus 8 ./test.sh
Hello, world!
Sat May 25 00:02:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|

... truncated

Alternatively, I can queue the job using sbatch, which gives me the following result:

$ sbatch --gpus 16 test.sh
Submitted batch job 461

We can specify sbatch options inside our shell script instead using the following syntax:

#!/bin/bash
#SBATCH --gpus 16

echo "Hello, world!"

After launching the job, we can see it running using our original squeue command:

$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               461   compute  test.sh      ben  R       0:37      1 compute-permanent-node-285

We can cancel an in-progress job by running scancel:

scancel 461

Here is a reference sbatch script for launching machine learning jobs.

Reserving a GPU[edit]

Here is a script you can use for getting an interactive node through Slurm.

gpunode () {
    local job_id=$(squeue -u $USER -h -t R -o %i -n gpunode)
    if [[ -n $job_id ]]
    then
        echo "Attaching to job ID $job_id"
        srun --jobid=$job_id --partition=$SLURM_GPUNODE_PARTITION --gpus=$SLURM_GPUNODE_NUM_GPUS --cpus-per-gpu=$SLURM_GPUNODE_CPUS_PER_GPU --pty $SLURM_XPUNODE_SHELL
        return 0
    fi
    echo "Creating new job"
    srun --partition=$SLURM_GPUNODE_PARTITION --gpus=$SLURM_GPUNODE_NUM_GPUS --cpus-per-gpu=$SLURM_GPUNODE_CPUS_PER_GPU --interactive --job-name=gpunode --pty $SLURM_XPUNODE_SHELL
}

Example env vars:

export SLURM_GPUNODE_PARTITION='compute'
export SLURM_GPUNODE_NUM_GPUS=1
export SLURM_GPUNODE_CPUS_PER_GPU=4
export SLURM_XPUNODE_SHELL='/bin/bash'

Integrate the example script into your shell then run gpunode.

You can see partition options by running sinfo.

You might get an error like this: groups: cannot find name for group ID 1506. But things should still run fine. Check with nvidia-smi.

Useful Commands[edit]

Set a node state back to normal:

sudo scontrol update nodename='nodename' state=resume