Exercise: The STREAM benchmarks
The STREAM benchmarks are the standard for measuring the main memory bandwidth capabilities of systems. You can find the Fortran source code in folder STREAM. The program reports performance in MByte/s for the four variants COPY (A=C), SCALE (A=s*C), ADD (A=B+C), and TRIAD (A=B+s*C).
- Compile and link the plain code with
icc -DUNDERSCORE -c mysecond.c
ifort -O3 -xHost -qopenmp -nolib-inline -opt-streaming-stores never stream.f mysecond.o
There is also a C variant, which you can compile directly using
icc -O3 -xHost -qopenmp -nolib-inline -opt-streaming-stores never stream.c
- Run the STREAM benchmarks on 1..10 (1 socket) and 20 cores (2 sockets) on one Emmy node. How many cores do you need to saturate the memory interface of a socket? Does the clock speed setting influence the scaling behavior? Does SMT help? Does the performance scale from one to two sockets? If it doesn't, fix it!
- What is the real data bandwidth over the memory bus (as opposed to the value reported by the benchmark) for the four kernels? What happens if you use "always" instead of "never" as the argument to the -opt-streaming-stores option? What could those "streaming stores" be then?
- Modify the SCALE benchmark loop so that the output of the program for SCALE reflects the actual traffic over the memory bus, independent of the streaming stores setting.
- What performance do you get with 20 threads (2 sockets) if you run the benchmark with interleaved pages?:
$ env OMP_NUM_THREADS=20 numactl --interleave=0,1 likwid-pin -c N:0-19 ./a.out
- Use the numactl command to generate a NUMA map of one Emmy node. What is the bandwidth penalty for non-local memory access?
You will need the STREAM results for generating lightspeed estimates for the other exercises.