## Assignment 2: In-core models, cache hierarchy

1. π reloaded. (10 credits) In Assignment 0 you studied a program that calculated the value of π via numerical integration (midpoint rule). in double precision, one iteration took almost exactly 7 cycles, and this time is entirely determined by the divide instructions (everything else happens so fast that it can be hidden behind them). Re-run your code but compile it with the additional Intel compiler option -no-vec. This disables SIMD vectorization, i.e., only scalar instructions are used. Knowing that the Ivy Bridge CPU in Emmy supports AVX and that the compiler actually uses (if allowed) an instruction that does four double-precision divides, what is your conclusion from the observed scalar performance? What is the inverse throughput (in cycles per instruction) of the AVX and of the scalar double-precision divide instructions on this machine?
2. Vector triad reloaded. (20 credits) Calculate the maximum achievable in-core performance $$P_\max$$ of the AVX-vectorized vector triad (see Assignment 1a) on an Ivy Bridge core at 2.2 GHz! What fraction of this limit could you achieve with your own code? Also calculate $$P_\max$$ for the SSE and scalar variants. Did the Intel compiler actually use AVX? How can you tell?
3. Pipelines and SIMD units. (50 credits) One way to increase the peak performance of a microprocessor is to use wider SIMD units and registers, just as Intel has done it when going from SSE (16 byte) to AVX (32 byte). 64-byte SIMD or AVX512 registers are now available in the "Xeon Skylake-X" architecture. Consider the following loop (a scalar product) on a modern processor core:

double s=0.0;double a[N],b[N];// ... initialization omittedfor(i=0; i<N; ++i) s = s + a[i] * b[i];

Assume that the core can execute at most one load instruction per cycle, and that the add (multiply) instruction has a latency of three (five) cycles and a maximum throughput of one per cycle. The load instruction has a latency of four cycles (if the data is in L1 cache) and a maximum throughput of one per cycle.

(a) Assuming that the compiler applies no unrolling and no SIMD vectorization, calculate the maximum possible performance of the loop in Flops/cycle if the data comes from the L1 cache.
(b) Calculate the performance to expect at N=8.
(c) Now assume that the compiler can SIMD-vectorize the loop with AVX, but does no unrolling beyond that. Also assume that adding up the four slots of an AVX register takes eight additional cycles. Calculate how this changes the  predictions from (a) and (b). How would the numbers change for 8-wide SIMD units (64-byte registers)?
(d) Now assume that the compiler performs a 2-way modulo unrolling of the loop, on top of AVX. Calculate the maximum expected performance with AVX if the data comes from the L1 cache. Would 4-way unrolling change anything? Why (not)?

4. Strided access. (20 credits) Use your vector triad code from Assignment 1a and modify it so that only each Mth element is used:

for(i=0; i<N; i+=M) {  a[i] = b[i] + c[i] * d[i];}
This is called a "strided triad" with stride M. Plot the performance of this loop versus M∈{1,2,4,8} for N=2x106 and explain the change in performance with growing M. What happens if you increase M even further, i.e., to 16,32,64,128,256,512,1024,...? Explain why this behavior is expected.