Difference between revisions of "K-Scale Cluster"

From Humanoid Robots Wiki
Jump to: navigation, search
(Reserving a GPU)
Line 94: Line 94:
  
 
You might get an error like this: <code>groups: cannot find name for group ID 1506</code>. But things should still run fine. Check with <code>nvidia-smi</code>.
 
You might get an error like this: <code>groups: cannot find name for group ID 1506</code>. But things should still run fine. Check with <code>nvidia-smi</code>.
 +
 +
=== Useful Commands ===
 +
 +
Set a node state back to normal:
 +
 +
<syntaxhighlight lang="bash">
 +
sudo scontrol update nodename='nodename' state=resume
 +
</syntaxhighlight>
  
 
[[Category:K-Scale]]
 
[[Category:K-Scale]]

Revision as of 05:08, 7 May 2024

The K-Scale Labs clusters are shared cluster for robotics research. This page contains notes on how to access the clusters.

Onboarding

To get onboarded, you should send us the public key that you want to use and maybe your preferred username.

After being onboarded, you should receive the following information:

  • Your user ID (for this example, we'll use stompy)
  • The jumphost ID (for this example, we'll use 127.0.0.1)
  • The cluster ID (for this example, we'll use 127.0.0.2)

To connect, you should be able to use the following command:

ssh -o ProxyCommand="ssh -i ~/.ssh/id_rsa -W %h:%p stompy@127.0.0.1" stompy@127.0.0.2 -i ~/.ssh/id_rsa

Note that ~/.ssh/id_rsa should point to your private key file.

Alternatively, you can add the following to your SSH config file, which should allow you to connect directly, Use your favorite editor to open the ssh config file (normally located at ~/.ssh/config for Ubuntu) and paste the text:

Host jumphost
    User stompy
    Hostname 127.0.0.1
    IdentityFile ~/.ssh/id_rsa

Host cluster
    User stompy
    Hostname 127.0.0.2
    ProxyJump jumphost
    IdentityFile ~/.ssh/id_rsa

After setting this up, you can use the command ssh cluster to directly connect.

You can also access via VS Code. Tutorial of using ssh in VS Code is here.

Please inform us if you have any issues!

Notes

  • You may need to restart ssh to get it working.
  • You may be sharing your part of the cluster with other users. If so, it is a good idea to avoid using all the GPUs. If you're training models in PyTorch, you can do this using the CUDA_VISIBLE_DEVICES command.
  • You should avoid storing data files and model checkpoints to your root directory. Instead, use the /ephemeral directory. Your home directory should come with a symlink to a subdirectory which you have write access to.

Cluster 1

Cluster 2

The cluster has 8 available nodes (each with 8 GPUs):

compute-permanent-node-68
compute-permanent-node-285
compute-permanent-node-493
compute-permanent-node-625
compute-permanent-node-626
compute-permanent-node-749
compute-permanent-node-801
compute-permanent-node-580

When you ssh-in, you log in to the bastion node pure-caribou-bastion from which you can log in to any other node where you can test your code.

Reserving a GPU

Here is a script you can use for getting an interactive node through Slurm.

gpunode () {
    local job_id=$(squeue -u $USER -h -t R -o %i -n gpunode)
    if [[ -n $job_id ]]
    then
        echo "Attaching to job ID $job_id"
        srun --jobid=$job_id --partition=$SLURM_GPUNODE_PARTITION --gpus=$SLURM_GPUNODE_NUM_GPUS --cpus-per-gpu=$SLURM_GPUNODE_CPUS_PER_GPU --pty $SLURM_XPUNODE_SHELL
        return 0
    fi
    echo "Creating new job"
    srun --partition=$SLURM_GPUNODE_PARTITION --gpus=$SLURM_GPUNODE_NUM_GPUS --cpus-per-gpu=$SLURM_GPUNODE_CPUS_PER_GPU --interactive --job-name=gpunode --pty $SLURM_XPUNODE_SHELL
}

Example env vars:

export SLURM_GPUNODE_PARTITION='compute'
export SLURM_GPUNODE_NUM_GPUS=1
export SLURM_GPUNODE_CPUS_PER_GPU=4
export SLURM_XPUNODE_SHELL='/bin/bash'

Integrate the example script into your shell then run gpunode.

You can see partition options by running sinfo.

You might get an error like this: groups: cannot find name for group ID 1506. But things should still run fine. Check with nvidia-smi.

Useful Commands

Set a node state back to normal:

sudo scontrol update nodename='nodename' state=resume