Exercise 1: Microarchitectural exploration

Preparation

Get the source files micro.c from ~zzfbjanet/MICRO and ~zzfbjanet/STREAM. Have a look at the source code (always do this if you use a benchmark you did not write yourself).

Get an interactive  job on the Emmy cluster with:

$ srun -p medium -N1 -c80 --ntasks-per-core=2 -t 01:00:00 --reservation=zzfbjanet_88 --pty bash

Setup the environment:

$ module load intel/19.0.5
$ export PATH=/opt/likwid/4.3.4/bin:$PATH
$ likwid-setFrequencies -f 2.4 -t 0

Exploring the memory hierarchy

Build the executable with:

$ icc -Ofast -xhost -qopenmp -std=c99 -D_GNU_SOURCE   -o ./micro ./micro.c

Test if it is working:

$ ./micro
$ likwid-pin -c S0:2 ./micro  0 5000

There is a helper script ./bench.pl that that allows to scan data set size. Use it as follows:

$ ./bench.pl <numcores> <seq|tp|ws>

You can generate a png plot of the result with gnuplot with:

$ gnuplot bench.plot 

The output is expected in bench.dat!

Run the following tests:

  1. Explore bandwidth across memory hierarchy using ./bench.pl 1 seq > bench.dat (this should take ca. 1m)
  2. Explore the node topology using likwid-topogy -g and compare the results with regard to the cache sizes.
  3. Explore parallel bandwidth across memory hierarchy using ./bench.pl <N> tp > bench-<N>.dat. Use 1,2,4,8,10,20 for N. There is a special bench-tp.plot file for gnuplot. What does this benchmark test?
  4. Now do the same with work sharing: ./bench.pl <N> ws > bench-<N>.dat. You can use the bench-ws.plot gnuplot config, this one also needs bench-seq.dat. What does this benchmark test?

What do you learn from the results?

Are these the theoretical limits? How can we know?

What is this useful for?

Main Memory Bandwidth

Build the executable with:

$ icc -Ofast -xhost -qopenmp -std=c99 -D_GNU_SOURCE   -o ./stream ./stream.c

Test if it is working:

$ likwid-pin -c S0:0 ./stream

There is a script to get the absolute best memory bandwidth on the socket level:

./bench.pl ./stream  <N from>-<N to>  <reps>

Questions to ask:

  • Why does the bandwidth depend on the operation?
  • What is the best single core bandwidth?
  • What is the best saturated socket/node bandwidth?
  • Where is the saturation point?



Last modified: Tuesday, 19 November 2019, 1:39 PM