## Exercise: The STREAM benchmarks

The STREAM benchmarks are the standard for measuring the main memory bandwidth capabilities of systems. You can find the Fortran and C source codes in the folder STREAM. The program reports performance in MByte/s for the four variants COPY (A=C), SCALE (A=s*C), ADD (A=B+C), and TRIAD (A=B+s*C).

1. Compile and link the Fortran code with

$icc -DUNDERSCORE -c mysecond.c$ ifort -O3 -xHost -qopenmp -nolib-inline -qopt-streaming-stores never stream.f mysecond.o

There is also a C variant, which you can compile directly using

$icc -O3 -xHost -qopenmp -nolib-inline -qopt-streaming-stores never stream.c 2. Run the STREAM benchmarks on one to all physical cores of 1 socket and on all physical cores of two sockets. How many cores do you need to saturate the memory interface of a socket? Does the clock speed setting influence the scaling behavior (if you can influence the clock speed)? Does SMT help? Does the performance scale from one to two sockets? If it doesn't, fix it! 3. What is the real data bandwidth over the memory bus (as opposed to the value reported by the benchmark) for the four kernels? What happens if you use "always" instead of "never" as the argument to the -qopt-streaming-stores option? What could those "streaming stores" be then? 4. Modify the SCALE benchmark loop so that the output of the program for SCALE reflects the actual traffic over the memory bus, independent of the streaming stores setting. 5. After the talk on performance counters: Endow the code with LIKWID API markers so that you you can directly measure the data traffic caused by the SCALE loop through the memory hierarchy. Does the measurement meet your expectation? 6. What performance do you get with 20 threads (2 sockets) if you run the benchmark with interleaved pages?:$ env OMP_NUM_THREADS=20 numactl --interleave=0,1 likwid-pin -c N:0-19 ./a.out

7. Use the numactl command to generate a NUMA map of one compute node. What is the bandwidth penalty for non-local memory access?

You will need the STREAM results for generating light speed estimates for the other exercises.