DCE cluster overview

General introduction

Why do I need this ?

Running computation on remote machines can be useful for many reasons :

  • Accessing a machine with the right installation (Linux, ROS, VNC, ..)
  • Accessing a machine with a GPU for intensive computing (Deep learning labworks, projects, use a better machine than yours for computing, ...)
  • Accessing a machine and share the graphical session with other teammates as well as the teacher (remote labworks using VNC or NoMachine, remote work to comply with Covid19 restrictions).
  • ...

The machines available for that purpose come into groups, called "partitions". You may need only a subpart of a cluster (maybe just one computer or even just a part of its resources), since the whole cluster has to be shared between several users.

Be aware that you are not the only one who needs computational resources, and that you are not supposed to block the work of others... So ask for resources, release them when you do not need them anymore, etc.

Understanding the basics helps for a clever (and quite easy) use. So read this website !

The global picture of the DCE is given below :

The opera metaphor

Let us introduce the use of a cluster by a more usual metaphor. Your family (5 people) wants to attend the opera show "The magic flute" from Mozart, played on the 21st of January 2031 at "Opera Garnier". It lasts 3 hours, and starts at 9pm.

There are many opera houses in the world, "Opera Garnier" is a specific one. An opera house is the metaphor of one of our clusters, since we have several clusters available.

What is concurrent here is the seats. Seats represent machines in our metaphor. Seats have different features. Some are located in the "balcony", some others are in "the floor", and some others in the so-called "orchestra". The experience of the play is not the same if you are located in the balcony than in the floor, for example. There is someone in charge of registering whether the seats are free or not and to distribute the seats to the newcomers; this is the role of the scheduler. For our clusters, the scheduler is slurm.

Let us name "floor", "balcony" and "orchestra" so called partitions (of seats). They are predefined by the architect of the opera, and they gather seats. In the cluster, partitions gather machines with common features.

For allowing your family to attend that play at that opera house at that time, you have to book 5 seats at the opera website. When you order on this website, you specify when you want the seats available, and where you want to be placed (i.e. in which partition). The website where booking can be done is the metaphor of the cluster frontend, where a scheduler runs (the software slurm for example). Each cluster has its frontend allowing you to book machines.

Once I booked the 5 seats, I get a single ticket (i.e. a reservation identifier). This gives me the right to sit somewhere (i.e to compute), for a specific period of time. Outside this period, other people can use the seat (for another play given another day). If we forget to come to the opera, the seats are still booked for us... but they will not be used, and wasted if other people wanted to attend the show.

Using a seat consists of actually sitting on it to benefit from the show (i.e. actually compute on the machine... running programs of different users do not attend a common "show", they just need to occupy a seat at a certain duration... this is a limitation of the metaphor). On the D-day, at the right time, everyone in my family can sit on the seats that have been booked... since I can justify the booking of 5 seats by showing the ticket (telling the reservation identifier). At that time only, a seat is given to one person who sits for actual usage. This is allocation. Then, each person knows its seat number, that is called the job id. In other words, each user is given the right to work on a specific machine, this work is a job id.

Now let suppose that the play is about to start, and that I have not booked seats yet for it. I can ask a seat at the front desk, at the balcony for example. If there are some free seats still available (i.e not used and not booked), I will be given one... but this is risky without reservation, since there may be no seat available for me. Last minute allocation of machines is allowed on some clusters, but it may not succeed if the booking and the usage is full at that time.

How can I get one machine ?

Having a machine dedicated to your computation is allocating a machine. So to be able to run a program on a machine of some cluster, you need to allocate that machine.

An allocation is limited in time : a maximum duration (called walltime) comes with your allocation (either the default one or the one you specify). It can also be exclusive when you allocate part of a machine and the rest cannot be used by another user.

The usual story for a labwork that leads to an allocation for you is that a set of machines are booked in advance, and when you need to compute, you are allocated a machine taken among the ones that have been booked.

The usual story for a project is that you do not have a reservation but you freely allocate machines either in interactive mode for your live coding session or as batch for long running computations that do not require you to be in front of your screen.

Booking

Booking is done in advance, for example by requesting the use of the cluster on some website. This is typically what a teacher who plans to do a labwork next week has to do. For example s/he books 12 machines for next Friday afternoon, since 12 groups of students will attend the labwork at that time. In this use case, students do not need to book anything.

Booking returns a specific reservation name, allowing to identify it. It is similar to an entrance ticket for the opera, that justifies the booking of your seats. During the booked period, the machines are preserved for any other allocation, except those who mention the specific reservation name (you can use the seats only if you show the ticket).

A reservation is assigned a start time, a walltime, as well as a partition.

  • the start time and walltime (i.e. duration) define a time period for usage, after which the machines can be used by somebody else.
  • the partition is a predefined group of machines.

Once you have a reservation token, mentioning a time period or a partition is useless, since the reservation contains these two notions.

Allocation

Ok, now, someone wants to use a machine for computing... right now. This is typically what a student needs when the labwork starts. The idea is to ask the scheduler of one cluster to give you a machine (rather lend you a machine, indeed).

To do so, you have to identify which cluster you want to ask a machine to, and send allocation commands to it. Indeed, you have to talk to the scheduler of the cluster (i.e. the slurm server running on that cluster). Each cluster has a frontend machine, where the scheduler runs, so you have to know the frontend address for asking allocation.

There are various means to allocate a node depending on your skills in computer science and we provide in-depth details on the connection page.

How can I work remotely ?

Once a remote machine is dedicated to you, you know the job id for it. Then you use this job id to log on the remote machine and work.

Logging requires that you have a valid account (i.e. user) on that machine. This is something that should be set by system administrators for you. You should be aware of this when you start to work. Otherwise, ask your teachers.

You can use your account to log on the frontend of the cluster as well, for example to access the disk space associated to your account. Once again, this requires that the administrators of the cluster have created the account for you, and this account is different from the one your are using for ordinary work (mail, ...). So you may need to transfer files from/to your personal disk space from/to the disk space associated to your cluster account. We will see how in the page dedicated to data transfer.

You can also work on the remote machine with a graphical interface as you usually do with a local machine. However, you need specific tools to connect to this specific graphical sessions and in our case, you can either use VNC or NoMachine. This is explained in detail in the example page.

Last, for security reasons, all access to remote machines are done with ssh, so that your bash commands are sent to the remote bash interpreter and the result are sent back to your screen, with a encrypted communication channel. This is why you have to learn ssh.

The types of accounts

We define two types of accounts :

  • for the labworks
  • for the projects

We give the priority to the labworks. Some nodes can be used for projects but we are in the process of extending the DCE to fully support projects.

By giving the priority to the labworks we mean that jobs identified as projects might be preempted if labwork jobs need resources. In practice, this translates into partitions with specific rights depending on the type of your account.

Account creation request must be performed by the teachers : either the teacher responsible of a labwork who needs accounts for his students or for a teacher responsible of a project for his students working on the project. These requests can be adressed by sending an email to dce_support(arobase)groupes.renater.fr

The teachers can ask for resources for their labwork sessions on https://www.dce-cs.fr.

General informations about the infrastructure

  • OS distribution : Ubuntu 18.04 or 20.04
  • Network technology : TBD
  • Storage technology : Local 256GB SSD / 1TB SATA; Network NAS servers

Hardware detail

Login nodes

Nodes Nb of
CPUs
CPU
reference
CPU gen Max
memory
chome 48 Intel Xeon Silver 4214R
@ 2.4GHz
Cascade Lake 64 GB

CPU resources

Machine name Nb of
nodes
Nb of
CPU
CPU
reference
 CPU gen Max
memory
kyle[01-68] 68 32 Intel Xeon Silver 4110
@2.1GHz
SkyLake 64 GB
sar[01-32] 32 32 Intel Xeon(R) CPU E5-2637 v3
@ 3.50GHz
Haswell 32 GB

GPU resources

Machine name Nb of
nodes
Nb of
CPU
CPU
reference
 CPU gen Max
memory
GPU GPU RAM
cam[00-16] 17 8 Intel Xeon(R) W-2125
@ 4GHz
Haswell 32 GB Geforce 1080 8 Gb
tx[00-16] 17 8 Intel Xeon(R) W-2125
@ 4GHz
Haswell 32 GB Geforce 2080Ti 11 Gb
sh00 1 8 Intel Xeon(R) W-2225
@ 4.1GHz
Haswell 32 GB Geforce 3080 10 Gb
sh[01-09] 9 8 Intel Xeon(R) W-2225
@ 4.1GHz
Haswell 32 GB Geforce 3090 24 Gb
sh[10-16] 6 8 Intel Xeon(R) W-2225
@ 4.1GHz
Haswell 32 GB Geforce 1080Ti 11 Gb
sh[17-19] 3 8 Intel Xeon(R) W-2225
@ 4.1GHz
Haswell 32 GB Geforce 3080 10 Gb
sh[20-22] 3 8 Intel Xeon(R) W-2225
@ 4.1GHz
Haswell 32 GB Geforce 3090 24 Gb

(click on the table and scroll to the right to see all the data)