## Assignment 3: In-core performance, cache hierarchy

1. Dense MVM. The following loop nest multiplies a matrix a[][] with a vector b[] and stores the result in a vector c[]:

for(i=0; i<N; ++i)   for(j=0; j<N; ++j)    c[i] += a[i][j] * b[j];

You can assume that N is "large" (the matrix does not fit into any cache), and that all data in c, a, and b are double-precision floating-point numbers. This serial code is run on a quad-core processor with cache sizes of 32 KiB (per-core L1), 256 KiB (per-core L2), and 8 MiB (shared L3), a peak performance of 4 GFlop/s (ADD+MULT) per core, and a memory bandwidth of 16 GB/s. Apart from one multiply and one add instruction, a core can execute one load or one store instruction per cycle (throughput). There is no SIMD capability. The ADD instruction has a latency of four cycles, and the MULT instruction has a latency of six cycles.

(a) (5 crd) Identify spatial and temporal access locality in this code. Which data structures lead to which locality of access? How does this depend on N?

(b) (10 crd) Calculate the code balance with respect to the matrix size for all caches and for main memory. You can assume that the matrix does not fit into any cache (i.e., $$8\times N^2$$ is much larger than any cache).

2. An intricate stencil code. (20 crd) Predict the maximum in-L1 performance $$P_\mathrm{max}$$ of the following loop nest on an Ivy Bridge and a Haswell core in flops/cycle (assuming double-precision arithmetic):

for(int k=4; k < N-4; k++) {
for(int j=4; j < N-4; j++) {
for(int i=4; i < N-4; i++) {
U[k][j][i] = 2*V[k][j][i] - U[k][j][i] + C[k][j][i] * ( c0 *  V[k][j][i]
+c1 * (V[ k ][ j ][i+1]+V[ k ][ j ][i-1] +V[ k ][j+1][ i ]+V[ k ][j-1][ i ]
+V[k+1][ j ][ i ]+V[k-1][ j ][ i ]) +c2 * (V[ k ][ j ][i+2]+V[ k ][ j ][i-2]
+V[ k ][j+2][ i ]+V[ k ][j-2][ i ] +V[k+2][ j ][ i ]+V[k-2][ j ][ i ])
+c3 * (V[ k ][ j ][i+3]+V[ k ][ j ][i-3] +V[ k ][j+3][ i ]+V[ k ][j-3][ i ]
+V[k+3][ j ][ i ]+V[k-3][ j ][ i ]) +c4 * (V[ k ][ j ][i+4]+V[ k ][ j ][i-4]
+V[ k ][j+4][ i ]+V[ k ][j-4][ i ] +V[k+4][ j ][ i ]+V[k-4][ j ][ i ]));
}
}
}

Consider AVX(2) and scalar execution on both architectures and identify the relevant execution bottleneck. For HSW you can assume that any ADD or MULT is actually carried out as an FMA instruction.

3. Balance. Calculate the in-memory code balance for loops with the following loop bodies, assuming that all arrays do not fit into any cache (a loop over i is always implied):

(a) (5 crd) s = s + a[i];     (double precision)
(b) (5 crd) s = s + a[i]*b[i]; (double precision)
(c) (5 crd) a[i] = b[i] + s*c[i]; (single precision)
(d) (5 crd) a[i] = a[i] + s*c[i]; (single precision)
(e) (15 crd) y[i] = y[i] + x[i] * v[index[i]];  (DP for x[],y[], and v[], and int32 for index[])

In case (e) you have to be a little imaginative; obviously it is impossible to give a single, definite answer...

4. Optimizing for code balance. Suppose you have a code like this:

for(i=0; i<N; ++i) {
a[i] = b[i] * c[i];
}
for(i=0; i<N-1; ++i) {
b[i] = a[i+1] * 0.5;
}


All arrays hold DP floating-point numbers, and N is "large".

(a) (10 crd) Suggest an optimization that will improve the performance of the code. What is the expected change in the memory code balance? Discuss if the optimized code is SIMD vectorizable.
(b) (20 crd) Construct a single-core performance model (along the lines of what we did in the lecture with the vector triad) for the optimized code on an Intel Haswell core. Calculate the expected performance in Gflop/s and the expected memory bandwidth in Gbyte/s for an in-memory data set. You can assume that the compiler generates separate load instructions for loads to successive elements, e.g., a[i] and a[i+1].