Exercise 1: simple benchmarks


Start up MobaXterm, start a local session, and log into the Emmy frontend of the Emmy system at RRZE:

$ ssh -Y -p 8196 c79nXXXX@grid.rrze.fau.de

(Substitute "XXXX" with the number from your personal account).

Get the source files micro.c from ~c79n0000/micro and ~c79n0000/stream. Have a look at the source code (always do this if you use a benchmark you did not write yourself).

Get an interactive  job on the Emmy cluster with:

$ qsub -l nodes=1:ppn=40:f2.2,walltime=01:00:00 -I

(This also disables Turbo Mode and sets a fixed clock speed of 2.2 GHz for all the cores on the node that you have just allocated.)

Set up the environment:

$ module load intel64/19.0up05 likwid

Exploring the memory hierarchy

Build the executable with:

$ icc -Ofast -xhost -qopenmp -std=c99 -D_GNU_SOURCE   -o ./micro ./micro.c

Test if it is working:

$ ./micro
$ likwid-pin -c S0:2 ./micro  0 5000

There is a helper script ./bench.pl that that allows to scan data set size. Use it as follows:

$ ./bench.pl <numcores> <seq|tp|ws>

You can generate a png plot of the result with gnuplot with:

$ gnuplot bench.plot 

The output is expected in bench.dat!

Run the following tests:

  1. Explore bandwidth across memory hierarchy using ./bench.pl 1 seq > bench.dat (this should take ca. 1m)
  2. Explore the node topology using likwid-topogy -g and compare the results with regard to the cache sizes.
  3. Explore parallel bandwidth across memory hierarchy using ./bench.pl <N> tp > bench-<N>.dat. Use 1,2,4,8,10,20 for N. There is a special bench-tp.plot file for gnuplot. What does this benchmark test?
  4. Now do the same with work sharing: ./bench.pl <N> ws > bench-<N>.dat. You can use the bench-ws.plot gnuplot config, this one also needs bench-seq.dat. What does this benchmark test?

What do you learn from the results?

Are these the theoretical limits? How can we know?

What is this useful for?

Main Memory Bandwidth

Build the executable with:

$ icc -Ofast -xhost -qopenmp -std=c99 -D_GNU_SOURCE   -o ./stream ./stream.c

Test if it is working:

$ likwid-pin -c S0:0 ./stream

There is a script to get the absolute best memory bandwidth on the socket level:

./bench.pl ./stream  <N from>-<N to>  <reps>

Questions to ask:

  • Why does the bandwidth depend on the operation?
  • What is the best single core bandwidth?
  • What is the best saturated socket/node bandwidth?
  • Where is the saturation point?

Last modified: Monday, 20 January 2020, 4:09 PM