## Assignment 8: Performance models

1. Resource-driven modeling. There are two fundamental bottlenecks that govern the runtime of a loop on a CPU chip: memory data transfer and instruction execution. Based on this insight, we can construct a simple model for the runtime of a loop .

Assume a loop of length N which, per iteration, requires V bytes of data to/from memory and performs W instructions. The CPU has a memory bandwidth of bS (in bytes/s) and a peak execution capability of pE (in instructions per cycle). The clock frequency is f. Assume that memory transfers and instruction execution are the only relevant resources.

(a) What is the expected execution time of the loop in cycles per iteration, assuming that execution and data transfer cannot overlap?
(b) How does your answer to (a) change if you assume that there is  full overlap between execution and data transfer?
(c) Construct a model for execution performance (instructions per time unit) from cases (a) and (b).

2. Barrier pain. Write a benchmark that measures the time it takes to execute an OpenMP barrier.

(a) Plot the barrier time in cycles on 1..20 cores on one Emmy node using the default Intel compiler and the default GNU compiler.
(b) On an Intel Xeon Phi coprocessor (1.05 GHz, 60 cores, 8-way SIMD at double precision, 16 Flops per core per cycle) the barrier takes 18000 cycles (Intel compiler). How can we usefully compare this number with the one from (a)? Is 18000 cycles "slow" or "fast" compared to the Emmy barrier? Hint: Think about how much "work" (i.e., Flops) can be done per cycle on either system, and how "precious" a cycle thus is. Hence, if the same problem is executed on the two systems and you assume that the code (excluding the barriers) can reach peak performance, you can calculate the "pain" incurred by the barrier, i.e., how much performance loss it causes for a fixed amount of work.
(c) What is the expected performance of the Schönauer vector triad at a loop length of N=500 with 10 threads (cores) on a socket, assuming that the executed loop code is "perfect" and the barrier is the only overhead?

3. Balance and performance prediction. Calculate the code balance for the following loop kernels, assuming that all arrays have to be loaded from memory and ignoring the latency problem ("perfect streaming"). A long inner loop over the counter variable i is always implied:

(a)  a[i] = b[i] + c[i]    (STREAM add)
(b) s = s + a[i] / b[i] ("scalar ratio" - a divide is one flop!)
(c)  y[i] = y[i] + v[j][i] * b[k[i]]  (JDS sparse matrix-vector product)

All arrays are double precision except k[], which stores 4-byte integers. s is a double precision scalar. Using the Roofline Model, calculate the expected application performance in MFlop/s for those kernels on one 2.2 GHz Ivy Bridge (Emmy) socket (measure the memory bandwidth using a suitable benchmark and assume that 4-way SIMD vectorization (AVX) is possible). The Ivy Bridge has a cache line size of 64 bytes. You may assume that N is large so that the arrays do not fit into any cache, and that non-temporal stores can be used where applicable. For (c), give numbers for best and worst case scenarios.

Which of the above kernels would benefit from (i) an increase in clock speed or (ii) an improved memory bandwidth of the processor? What bandwidth would be required to run kernel (a) at full peak performance?