## Getting started with the Emmy cluster at RRZE

Programming Techniques for Supercomputers (PTfS2020)

Revision 1.14 -- 2020-04-20

## 1. Login to the HPC machines at RRZE

Course logins for the HPC systems at RRZE are provided after you send your IdM account and name to Georg Hager. This account will be mapped to your activated student or employee account, and you will later be able to change your password, configure email forwarding, etc., for your HPC account using the IdM self-service portal at https://www.idm.uni-erlangen.de/start.
You will perform most of the benchmark work on the Ivy Bridge Cluster "Emmy". Detailed information about login, file systems etc. can be found on the website:
All RRZE machines can be accessed via ssh from any machine within the FAU network. To log into one of Emmy's frontends, type
If you need software components that are not present on the production system you can use our general dialog server, cshpc.rrze.uni-erlangen.de. All network file systems are available also on this dialog server and special software (like, e.g., plotting tools) is most probably installed there in more or less current versions. If you miss anything, tell us so we can install it.

Please do not perform memory-intensive test runs on the frontends or dialog servers as this will disturb user operations. Moreover, many users are active on the frontends and you will not get sensible performance data anyway.

## 2. Compilers

On the Emmy cluster we use the Intel compiler suite. Usually the Intel compilers deliver higher performance than the GCC and we are quite familiar with their characteristics. To access the Intel compilers you first have to set up your environment correctly. This can be done for the currently running shell via:
(you can also specify a version number; this will be required from time to time). This will set up the necessary PATH and other variables that you need to work with the Intel compilers. The compilers are called ifort (Fortran77/90), icc(C) and icpc (C++).

### 2.1 Recommended compiler switches

The Intel compilers have loads of command line options. We recommend to use -O3 -xHost -fno-alias. The option -help will give you a complete list. The standard options (-c, -g, -o etc.) are identical to GCC. If you want to have a report on what the compiler did in the optimization stage you can use -opt_report3, but don't expect too much readable information.

## 3. Batch processing

Short test runs can be started directly on the Emmy frontends. However, for producing reliable benchmark results it is preferable to submit the jobs to the batch queue. The batch system accepts requests for resources (e.g., "6 nodes for 24 hours") and queues them according to some priority scheme. A job gets run, i.e. a previously specified shell script gets executed, when the resources are available and the batch system has chosen the job to be started. Apart from running a batch script (see below) and interactive testing on the frontends you can submit an interactive batch job which gives you, e.g., a shell on a compute node for some time. You can do this by typing:

$qsub -l nodes=1:ppn=40,walltime=01:00:00 -I This command will allocate a complete node (40 virtual cores) for 1 hour. You should always request complete nodes so that you can do your benchmarks on a quiet machine. Unless you do message passing parallelization (later this term) there will be no need to request more than one node. If you want to run longer benchmarks or parameter studies you have to submit a batch script: Figure 1: A simple batch script #!/bin/csh # # the script runs in$HOME, so
# change to correct directory (i.e. the directory
# from which the job was submitted)
cd $PBS_O_WORKDIR # start executable ./a.out$ qsub -l nodes=1:ppn=40,walltime=05:00:00 -m be -M you@somewhere script.csh
This will again request one complete node, this time for 5 hours. After job submission, qsub will print the job's ID number. You will be notifed when the job starts and when it ends (-m be) via e-mail to the indicated address (-M option). When the job starts, the script script.csh gets executed one the allocated node. Fig. 1 shows a simple example for a batch script.
If your job is not finished after the requested walltime, it will be killed mercilessly. You may request up to 24 hours of runtime, but be aware that shorter runtimes will increase the job's probability for running early. So try to give a sensible estimate for your runtime requirements.
After the job has finished, its stdout and stderr outputs will be saved in the directory where you had submitted it. Filenames for those files are usually compiled from the job name and ID, but can be modified using the -o and -e options to qsub (see manpage).

You can watch and control your jobs using the qstat and qdel commands, respectively.
• qstat will show you all your jobs, whether running (status R') or queued (status Q').
• qdel takes one or more job IDs (just the numbers) as arguments and allows you to remove a job from the queue, even when it's already running.

## 4 Measuring elapsed time and consumed CPU time

A sample for measuring elapsed time and consumed CPU time is provided in the files timing.* located in the directory ~unrz55/GettingStarted.
An example for the use of the timing functions in C is provided in the file example.c also located in the directory ~unrz55/GettingStarted.
You can link the timing.o object file to a Fortran program. It should work out of the box because timing.c also defines a wrapper function with an underscore appended to its name.
Of course there are a lot of other possibilities for measuring the time. Feel free to use your favorite routines instead of the ones mentioned above. Bear in mind that the only reliable measure for performance is wallclock time. CPU time can often be misleading.
Please bear in mind that timing functions have limited granularity, i.e., it does usually not make sense to measure time on a scale of microseconds. Always write your benchmarks in a way that time intervals to be measured are at least 100 ms.
Clock frequency settings
If you want to get accurate timings in terms of processor cycles, you have to know the exact clock speed of the CPU. The Emmy processors have a nominal clock speed of 2.2 GHz, but "Turbo Mode" is enabled by default. This means that the CPU can "overclock" to some degree, depending on the number of active cores and the temperature. The highest possible clock speed is 3.0 GHz. In order to set the clock frequency to a specific (fixed) value you can specify a parameter at job submit time:
\$ qsub -l nodes=1:ppn=40:f2.2,walltime=01:00:00 ...
In this example, the clock speed for all cores in this job would be set to 2.2 GHz. The available settings are listed in the output of the "pbsnodes -a" command.