Changes

K-Scale Cluster

499 bytes removed, 00:24, 16 May 2024

no edit summary

* You may be sharing your part of the cluster with other users. If so, it is a good idea to avoid using all the GPUs. If you're training models in PyTorch, you can do this using the <code>CUDA_VISIBLE_DEVICES</code> command.

* You should avoid storing data files and model checkpoints to your root directory. Instead, use the <code>/ephemeral</code> directory. Your home directory should come with a symlink to a subdirectory which you have write access to.

~~=== Cluster 1 ===~~

~~=== Cluster 2 ===~~

~~The cluster has 8 available nodes (each with 8 GPUs):~~

~~<syntaxhighlight lang="text">~~

~~compute-permanent-node-68~~

~~compute-permanent-node-285~~

~~compute-permanent-node-493~~

~~compute-permanent-node-625~~

~~compute-permanent-node-626~~

~~compute-permanent-node-749~~

~~compute-permanent-node-801~~

~~compute-permanent-node-580~~

~~</syntaxhighlight>~~

~~When you ssh-in, you log in to the bastion node pure-caribou-bastion from which you can log in to any other node where you can test your code.~~

== Reserving a GPU ==

Ben

blockimmune, Bureaucrats, Administrators

488

edits

Humanoid Robots Wiki β

Changes

K-Scale Cluster

Humanoid Robots Wiki ^β