## Assignment 5: Peak performance, energy to solution

1. Peak performance.
(a) Calculate the arithmetic single-precision floating-point peak performance of an Intel Xeon Phi "Knights Landing" chip with
• 64 cores
• a clock speed of 1.3 GHz
• the AVX-512 instruction set (512-bit registers, 2 full-width FMA instructions per cycle).

(b) Calculate the arithmetic SP peak performance of an Nvidia "Volta" chip with
• 80 SM units (streaming multiprocessors), each with 64 SP FMA units
• a clock speed of 1.4 GHz

2. Optimal energy to solution. We want to model the power consumption of an 8-core processor under a certain workload, and make the following assumptions:

• The chip dissipates a "baseline power" W0 when all cores are idle, independent of the clock frequency.
• If a core is executing instructions, it dissipates the additional ("dynamic") power Wd, which adds to the chip's baseline power.
• The performance of a certain code running on a single core of this processor is P0.
• The maximum performance of the parallel version of this code is Pmax>P0. Hence, when solving a given problem in parallel with n cores, the overall performance is min(nP0,Pmax). This models a behavior where a code is limited by a bottleneck if it is running on multiple cores (i.e., performance "saturates" at some point).
We want to solve a given problem with n=1..8 cores, and calculate the energy E(n) it takes to solve it. This is our "energy to solution" metric. You can assume that the amount of work to be done is normalized to 1, i.e., the time to solution on n cores is T(n)=1/P(n). For the following tasks you may find it helpful to use a spreadsheet program:

(a) Calculate E(n) for W0 = 20 W, Wd = 12 W, P= 1 s-1, Pmax = 5 s-1. For which nmin is E(n=nmin) minimal? Can it make sense to use more than nmin cores?

(b) Now assume we apply a code optimization (such as SIMD vectorization) that improves the single-core performance to P0=2.5 s-1, but Pmax is unaffected. How does that change nmin and E(nmin)? What is the general conclusion from this result?

(c) Keeping P0 as it is, we now perform an optimization that improves the saturated performance Pmax to 7.5 s-1. How does that change nmin and E(nmin)? What is the general conclusion from this result?

(d) What qualitative impact do you expect from changing the clock speed (higher or lower)? Remember that the dynamic power dissipation of each core goes with the third power of the frequency. Assume that that Pmax is unaffected by the clock speed.

3. Parallel vector triad. Parallelize the standard vector triad benchmark from earlier assignments by using OpenMP. To do this, compile and link with the -qopenmp switch and use the "fused" parallel for/do directives in the following way:

 Fortran C !$OMP PARALLEL DOdo i=1,N a(i) = b(i) + c(i) * d(i)enddo!$OMP END PARALLEL DO #pragma omp parallel forfor(i=0; i

To determine the number of processors (threads), set the OMP_NUM_THREADS environment variable to the desired number prior to starting your executable, e.g.:

$export OMP_NUM_THREADS=6$ ./vectortriad.exe

Perform benchmark runs with the parallel triad on one Emmy socket. Draw performance graphs for N = 101...107(use log scaling on the x axis) for 1...10 threads (draw them all in the same diagram). Use likwid-pin to control the placement of threads. Example:

$module load likwid$ likwid-pin -c S0:0-9 ./vectortriad.exe`

This runs the benchmark with 10 threads on the first socket (likwid-pin sets the OMP_NUM_THREADS variable automatically according to the pin mask if it is not set already). Comment on the scalability of the benchmark for different sizes (L1 cache, L2 cache, memory) and compare with the purely serial code (compiled without -qopenmp).
For which working set sizes do you get a "good" speedup with multiple threads?