## 1. Login to the HPC machines at RRZE

Course logins for the HPC systems at RRZE are provided during the first exercise. You will get a password for your course accounts.
You will perform all of the benchmark work on the IvyBridge-EP Cluster "Emmy". Detailed information about login, file systems etc. can be found on the website:
ssh -Y -p 8196 <your_account>@grid.rrze.uni-erlangen.de
This will take you to the cluster frontend "emmy1". We use the SSH port forwarding because the cluster is usually not visible from the outside world (private IP addresses).
Please do not perform memory-intensive test runs on the frontends or dialog servers as this will disturb user operations. Moreover, many users are active on the frontends and you will not get sensible performance data anyway
Example code for the hands-on exercises can be found in the directory ~j02y0000. E.g., for copying the directory DIV to your home you can type:
$cp -a ~j02y0000/DIV ~ The descriptions of the exercises contain the names of the respective folders. ## 2. Compilers On the Emmy cluster we use the Intel compiler suite. Usually the Intel compilers deliver higher performance than the GCC and we are quite familiar with their characteristics. To access the Intel compilers you first have to set up your environment correctly. Execute once per shell: module load intel64 (you can also specify a version number; this will be required from time to time). This will set up the necessary PATH and other variables that you need to work with the Intel compilers. The compilers are called ifort (Fortran77/90), icc(C) and icpc (C++). ### 2.1 Recommended compiler switches The Intel compilers have loads of command line options. We recommend to use -O3 -xHost -fno-alias. The option -help will give you a complete list. The standard options (-c, -g, -o etc.) are identical to GCC. If you want to have a report on what the compiler did in the optimization stage you can use -opt_report3, but don't expect too much readable information. ## 3. Batch processing Short test runs can be started directly on the Lima frontends. However, for producing reliable benchmark results it is preferable to submit the jobs to the batch queue. The batch system accepts requests for resources (e.g., "6 nodes for 24 hours") and queues them according to some priority scheme. A job gets run, i.e. a previously specified shell script gets executed, when the resources are available and the batch system has chosen the job to be started. Some nodes of the Lima cluster will be reserved for your exclusive use during tutorial hours. During the rest of the time, no CPUs are reserved, but you have a high priority. Apart from running a batch script (see below) and interactive testing on the frontends you can submit an interactive batch job which gives you, e.g., a shell on a compute node for some time. You can do this by typing: $ qsub -l nodes=1:ppn=40,walltime=02:00:00 -I

This command will allocate a complete node (40 CPUs) for 2 hours. You should always request complete nodes so that you can do your benchmarks on a quiet machine. Unless you do message passing parallelization (later this term) there will be no need to request more than one node.
If you want to run longer benchmarks or parameter studies you have to submit a batch script:

Figure 1: A simple batch script

#!/bin/csh
#
# the script runs in $HOME, so # change to correct directory (i.e. the directory # from which the job was submitted) cd$PBS_O_WORKDIR
# start executable
./a.out

$qsub -l nodes=1:ppn=40,walltime=05:00:00 -m be -M you@somewhere script.csh This will again request one complete node, this time for 5 hours. After job submission, qsub will print the job's ID number. You will be notifed when the job starts and when it ends (-m be) via e-mail to the indicated address (-Moption). When the job starts, the script script.csh gets executed one the allocated node. Fig. 1 shows a simple example for a batch script. If your job is not finished after the requested walltime, it will be killed mercilessly. You may request up to 24 hours of runtime, but be aware that shorter runtimes will increase the job's probability for running early. So try to give a sensible estimate for your runtime requirements. After the job has finished, its stdout and stderr outputs will be saved in the directory where you had submitted it. Filenames for those files are usually compiled from the job name and ID, but can be modified using the -o and -eoptions to qsub (see manpage). You can watch and control your jobs using the qstat and qdel commands, respectively. • qstat will show you all your jobs, whether running (status R') or queued (status Q'). • qdel takes one or more job IDs (just the numbers) as arguments and allows you to remove a job from the queue, even when it's already running. Clock frequency settings If you want to get accurate timings in terms of processor cycles, you have to know the exact clock speed of the CPU. The Emmy processors have a nominal clock speed of 2.2 GHz, but "Turbo Mode" is enabled by default. This means that the CPU can "overclock" to some degree, depending on the number of active cores and the temperature. The highest possible clock speed is 3.0 GHz. In order to set the clock frequency to a specific (fixed) value you can specify a parameter at job submit time: $ qsub -I -l nodes=1:ppn=40:f2.2,walltime=01:00:00 ...
In this example, the clock speed for all cores in this job would be set to 2.2 GHz. You can select from the following options: f2.2,f2.1,f2.0,f1.9,f1.8,f1.7,f1.6,f1.5,f1.4,f1.3,f1.2

## 3. LIKWID Tools

To use likwid-perfctr on the Emmy cluster you have to specify a special property to sub:

\$ qsub -I -l nodes=1:ppn=40:likwid,walltime=01:00:00 ...

likwid-topology and likwid-pin, and almost all other LIKWID tools work without the property.