Open main menu

Humanoid Robots Wiki β

Changes

K-Scale Cluster

499 bytes removed, 16 May
no edit summary
* You may be sharing your part of the cluster with other users. If so, it is a good idea to avoid using all the GPUs. If you're training models in PyTorch, you can do this using the <code>CUDA_VISIBLE_DEVICES</code> command.
* You should avoid storing data files and model checkpoints to your root directory. Instead, use the <code>/ephemeral</code> directory. Your home directory should come with a symlink to a subdirectory which you have write access to.
 
=== Cluster 1 ===
 
=== Cluster 2 ===
The cluster has 8 available nodes (each with 8 GPUs):
<syntaxhighlight lang="text">
compute-permanent-node-68
compute-permanent-node-285
compute-permanent-node-493
compute-permanent-node-625
compute-permanent-node-626
compute-permanent-node-749
compute-permanent-node-801
compute-permanent-node-580
</syntaxhighlight>
When you ssh-in, you log in to the bastion node pure-caribou-bastion from which you can log in to any other node where you can test your code.
== Reserving a GPU ==
486
edits