Changes

← Older edit

K-Scale Cluster

3,249 bytes added, 00:12, 25 May 2024

→‎SLURM Commands

The Andromeda cluster is a different cluster which uses Slurm for job management. Authentication is different from the Lambda cluster - Ben will provide instructions directly.

Don't do anything computationally expensive on the main node or you will crash it for everyone. Instead, when you need to run some experiments, reserve a GPU (see below).

==== SLURM Commands ====

Show all currently running jobs:

squeue

</syntaxhighlight>

Show your own running jobs:

squeue --me

</syntaxhighlight>

Show the available partitions on the cluster:

sinfo

</syntaxhighlight>

You'll see something like this:

$ sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

compute* up infinite 8 idle compute-permanent-node-[68,285,493,580,625-626,749,801]

</syntaxhighlight>

This means:

* There is one compute node type, called <code>compute</code>

* There are 8 nodes of that type, all currently in <code>idle</code> state

* The node names are things like <code>compute-permanent-node-68</code>

To launch a job, use [https://slurm.schedmd.com/srun.html srun] or [https://slurm.schedmd.com/sbatch.html sbatch].

* '''srun''' runs a command directly with the requested resources

* '''sbatch''' queues the job to run when resources become available

For example, suppose I have the following Shell script:

#!/bin/bash

echo "Hello, world!"

nvidia-smi

</syntaxhighlight>

I can use <code>srun</code> to run this script with the following result:

$ srun --gpus 8 ./test.sh

Hello, world!

Sat May 25 00:02:23 2024

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |

|-----------------------------------------+------------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+========================+======================|

... truncated

</syntaxhighlight>

Alternatively, I can queue the job using <code>sbatch</code>, which gives me the following result:

$ sbatch --gpus 16 test.sh

Submitted batch job 461

</syntaxhighlight>

We can specify <code>sbatch</code> options inside our shell script instead using the following syntax:

#!/bin/bash

#SBATCH --gpus 16

echo "Hello, world!"

</syntaxhighlight>

After launching the job, we can see it running using our original <code>squeue</code> command:

$ squeue --me

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

461 compute test.sh ben R 0:37 1 compute-permanent-node-285

</syntaxhighlight>

We can cancel an in-progress job by running <code>scancel</code>:

scancel 461

</syntaxhighlight>

[https://github.com/kscalelabs/mlfab/blob/master/mlfab/task/launchers/slurm.py#L262-L309 Here is a reference] <code>sbatch</code> script for launching machine learning jobs.

==== Reserving a GPU ====

Ben

blockimmune, Bureaucrats, Administrators

488

edits

Changes

K-Scale Cluster

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools