There are two fundamental bottlenecks that govern the runtime of a loop on a CPU chip: memory data transfer and instruction execution. Based on this insight, we can construct a simple model for the runtime of a loop .
Assume a loop of length N which, per iteration, requires V bytes of data to/from memory and performs W instructions. The CPU has a memory bandwidth of bS (in bytes/s) and a peak execution capability of pE (in instructions per cycle). The clock frequency is f. Assume that memory transfers and instruction execution are the only relevant resources.
(a) What is the expected execution time of the loop in cycles per iteration, assuming that execution and data transfer cannot overlap? (b) How does your answer to (a) change if you assume that there is full overlap between execution and data transfer? (c) Construct a model for execution performance (instructions per time unit) from cases (a) and (b).
Barrier pain. Write a benchmark that measures the time it takes to execute an OpenMP barrier.
(a) Plot the barrier time in cycles on 1..20 cores on one Emmy node using the default Intel compiler and the default GNU compiler. (b) On an Intel Xeon Phi coprocessor
(1.05 GHz, 60 cores, 8-way SIMD at double precision, 16 Flops per core
per cycle) the barrier takes 18000 cycles (Intel compiler). How can we usefully compare
this number with the one from (a)? Is 18000 cycles "slow" or "fast" compared to the Emmy barrier? Hint: Think about
how much "work" (i.e., Flops) can be done per cycle on either system,
and how "precious" a cycle thus is. Hence, if the same problem is
executed on the two systems and you assume that the code (excluding the
barriers) can reach peak performance, you can calculate the "pain"
incurred by the barrier, i.e., how much performance loss it causes for a
fixed amount of work. (c) What is the expected performance of the Schönauer vector triad at a loop length of N=500 with 10 threads (cores) on a socket, assuming that the executed loop code is "perfect" and the barrier is the only overhead?
Balance and performance prediction. Calculate the
code balance for the following loop kernels, assuming that all arrays
have to be loaded from memory and ignoring the latency problem ("perfect
streaming"). A long inner loop over the counter variable i is always
(a) a[i] = b[i] + c[i] (STREAM add) (b) s = s + a[i] / b[i] ("scalar ratio" - a divide is one flop!) (c) y[i] = y[i] + v[j][i] * b[k[i]] (JDS sparse matrix-vector product)
arrays are double precision except k, which stores 4-byte
integers. s is a double precision scalar. Using the Roofline Model,
calculate the expected application performance in MFlop/s for those
kernels on one 2.2 GHz Ivy Bridge (Emmy) socket (measure the memory
bandwidth using a suitable benchmark and assume that 4-way SIMD
vectorization (AVX) is possible). The Ivy Bridge has a cache line size
of 64 bytes. You may assume that N is large so that the arrays do not
fit into any cache, and that non-temporal stores can be used where
applicable. For (c), give numbers for best and worst case scenarios.
of the above kernels would benefit from (i) an increase in clock speed
or (ii) an improved memory bandwidth of the processor? What bandwidth
would be required to run kernel (a) at full peak performance?