## Assignment 2: pipelines and loops

*Pipelines and SIMD units*. One way to increase the peak performance of a microprocessor is to use wider SIMD units and registers, just as Intel has done it when going from SSE (16 byte) to AVX (32 byte). 64-byte SIMD or AVX512 registers are now available in the "Xeon Skylake-X" architecture. Consider the following loop (a scalar product) on a modern processor core:

**double s=0.0;****double a[N],b[N];****// ... initialization omitted****for(i=0; i<N; ++i) s = s + a[i] * b[i];**

Assume that the core can execute at most one load instruction per cycle, and that the add (multiply) instruction has a latency of three (five) cycles and a maximum throughput of one per cycle. The load instruction has a latency of four cycles (if the data is in L1 cache) and a maximum throughput of one per cycle.

(a) Assuming that the compiler applies no unrolling and no SIMD vectorization, what is the maximum possible performance of the loop in Flops/cycle if the data comes from the L1 cache?

(b) What performance do you expect at N=8? How large must N be to get half the expected value from (a)?

(c) Now assume that compiler can SIMD-vectorize the loop with AVX, but does no unrolling beyond that. Also assume that adding up the four slots of an AVX register takes eight additional cycles. How does that change the predictions from (a) and (b)? How would the numbers change for 8-wide SIMD units (64-byte registers)?

(d) Now assume that the compiler performs a 2-way modulo unrolling of the loop, on top of AVX. What is now the maximum expected performance with AVX if the data comes from the L1 cache? Would 4-way unrolling change anything?*Strided access*. Use your vector triad code from assignment 1 and modify it so that only each Mth element is used:

This is called a "strided triad" with stride M. Plot the performance of this loop versus M∈{1,2,4,8} for N=2x10**for(i=0; i<N; i+=M) {**

a[i] = b[i] + c[i] * d[i];

}^{6}and explain the change in performance with growing M. What happens if you increase M even further, i.e., to 16,32,64,128,256,...?*Optimizing loop code*. What is wrong with the following code (in terms of performance)? Can you suggest simple optimizations that will improve the performance?

**double mat[8192][8192],s[8192][8192];**

// ...

srand(1); // random seed

for(i=0; i<N ; ++i) {

**val = (double)rand(); // random number**

for(j=0; j<N; ++j) {

mat[j][i] = s[j][i]/val;

}**}**