## Exercise 1: Microarchitectural exploration

### Get the source files micro.c from ~k89u0000/MICRO and ~k89u0000/STREAM. Have a look at the source code (always do this if you use a benchmark you did not write yourself).

Get an interactive  job on the Emmy cluster with:

$qsub -I -l nodes=1:ppn=40:f2.2:likwid -l walltime=08:00:00 Setup the environment: $ module load intel64/19.0up02
$module load likwid/4.3.4 #### Exploring the memory hierarchy Build the executable with: $ icc -Ofast -xhost -qopenmp -std=c99 -D_GNU_SOURCE   -o ./micro ./micro.c

Test if it is working:

$./micro $ likwid-pin -c S0:2 ./micro  0 5000

There is a helper script ./bench.pl that that allows to scan data set size. Use it as follows:

$./bench.pl <numcores> <seq|tp|ws> You can generate a png plot of the result with gnuplot with: $ gnuplot bench.plot

The output is expected in bench.dat!

Run the following tests:

1. Explore bandwidth across memory hierarchy using ./bench.pl 1 seq > bench.dat (this should take ca. 1m)
2. Explore the node topology using likwid-topogy -g and compare the results with regard to the cache sizes.
3. Explore parallel bandwidth across memory hierarchy using ./bench.pl <N> tp > bench-<N>.dat. Use 1,2,4,8,10,20 for N. There is a special bench-tp.plot file for gnuplot. What does this benchmark test?
4. Now do the same with work sharing: ./bench.pl <N> ws > bench-<N>.dat. You can use the bench-ws.plot gnuplot config, this one also needs bench-seq.dat. What does this benchmark test?

What do you learn from the results?

Are these the theoretical limits? How can we know?

What is this useful for?

#### Main Memory Bandwidth

Build the executable with:

$icc -Ofast -xhost -qopenmp -std=c99 -D_GNU_SOURCE -o ./stream ./stream.c Test if it is working: $ likwid-pin -c S0:0 ./stream

There is a script to get the absolute best memory bandwidth on the socket level:

./bench.pl ./stream  <N from>-<N to>  <reps>