## Assignment 3: In-core performance, caches, data transfers

*Dense MVM*. The following loop nest multiplies a matrix a[][] with a vector b[] and stores the result in a vector c[]:for(i=0; i<N; ++i)

for(j=0; j<N; ++j)

c[i] += a[i][j] * b[j];

You can assume that N is "large" (the matrix does not fit into any cache), and that all data in c, a, and b are double-precision floating-point numbers. This serial code is run on a quad-core processor with cache sizes of 32 KiB (per-core L1), 256 KiB (per-core L2), and 8 MiB (shared L3), a peak performance of 4 GFlop/s (ADD+MULT) per core, and a memory bandwidth of 16 GB/s. Apart from one multiply and one add instruction, a core can execute one load or one store instruction per cycle (throughput). There is no SIMD capability. The ADD instruction has a latency of four cycles, and the MULT instruction has a latency of six cycles.

(a) If all data came from the L1 cache (i.e., if N were very small), what is the highest possible performance you could expect from this code? What is the worst performance you may expect if the compiler produces "naive" code, i.e, without any unrolling?

(b) How would the analysis change if the processor supported 4-way SIMD with double precision operands?

(c) Identify spatial and temporal access locality in this code. Which data structures lead to which locality of access? Does this depend on N?-
*An intricate stencil code*. Predict the maximum in-L1 performance of the following loop nest on an Ivy Bridge and a Haswell core in Flops/cycle (assuming double-precision arithmetic):for(int k=4; k < N-4; k++) { for(int j=4; j < N-4; j++) { for(int i=4; i < N-4; i++) { U[k][j][i] = 2*V[k][j][i] - U[k][j][i] + C[k][j][i] * ( c0 * V[k][j][i] +c1 * (V[ k ][ j ][i+1]+V[ k ][ j ][i-1] +V[ k ][j+1][ i ]+V[ k ][j-1][ i ] +V[k+1][ j ][ i ]+V[k-1][ j ][ i ]) +c2 * (V[ k ][ j ][i+2]+V[ k ][ j ][i-2] +V[ k ][j+2][ i ]+V[ k ][j-2][ i ] +V[k+2][ j ][ i ]+V[k-2][ j ][ i ]) +c3 * (V[ k ][ j ][i+3]+V[ k ][ j ][i-3] +V[ k ][j+3][ i ]+V[ k ][j-3][ i ] +V[k+3][ j ][ i ]+V[k-3][ j ][ i ]) +c4 * (V[ k ][ j ][i+4]+V[ k ][ j ][i-4] +V[ k ][j+4][ i ]+V[ k ][j-4][ i ] +V[k+4][ j ][ i ]+V[k-4][ j ][ i ])); } } }

Consider AVX(2), SSE2, and scalar execution on both architectures and identify the relevant execution bottleneck. *Loop nests*. What is "wrong" with the following code? How would you fix it?

#define N 16384 //... s = 0.0; for (int i=0; i<N; i++) { for(int j=0; j<N; j++) { s += a[j][i]; } }