## Assignment 1: Loop kernel benchmarking

1. Write a benchmark program that measures the performance in MFlop/s of the following three computation kernels:

(a) vector triad: a[i] = b[i] + c[i] * d[i]
(b) vector update: a[i] = s * a[i]
(c) vector update with dependency: a[i] = s * a[i-1]
(d) matrix-vector multiply: y = A • x, with an inner j loop of y[i] += A[i][j] * x[j]

a, b, c, d, x and y are double precision arrays of length N. s is a double precision scalar. A is an NxN matrix of plain double precision elements. Allocate memory for those data structures on the heap, i.e. using malloc() in C, new in C++ or allocate() in Fortran90. Do not forget to initialize all data elements with valid floating-point (FP) numbers. Using calloc() is not sufficient - see this blog entry for an explanation.

For the vector triad (a) and the vector updates (b) and (c), run your code using the following vector lengths: N=int(2.5k), k = 3...20. For the MVM (d), use matrix sizes N=int(1.5k), k = 10...22 (choose the vector length accordingly).

Perform the measurement on one core of the Emmy cluster for the given loop lengths. For reasons of accuracy, make sure that the runtime of each kernel with each vector/matrix size is larger than 0.1 seconds by repeating the computation kernel in an outer loop. Maybe it is a good idea to dynamically adjust the number of repetitions depending on the runtime of the kernel:

int repeat = 1;double runtime=0.;for(; runtime<.1; repeat*=2) {  timing(&wcs, &ct);  for(r=0; r<repeat; ++r) {    /* PUT THE KERNEL BENCHMARK LOOP HERE */    if(CONDITION_NEVER_TRUE) dummy(a); // for the compiler  }  timing(&wce, &ct);  runtime = wce-wcs;}repeat /= 2;

Make sure that the operations in the kernel actually get executed - compilers are smart! (This is what the bogus if statement is for. The dummy() function must reside in a separate source file, and the compiler must not be able to determine the result of the condition at compile time. Example: if(a[N>>1]<0.) - if all arrays are initialized with positive numbers, this condition is never true.)

Use your favorite graphics program (e.g. gnuplot or xmgrace or Excel if you must) to generate plots of the performance in MFlop/s vs. N. Choose a logarithmic scale on the x axis. Check your results for plausibility by comparing the performance of the vector triad to typical achievable bandwidth numbers on the IvyBridge processor in a Emmy node (memory bandwidth: approx. 15 GB/s, L2/L3 cache bandwidth: tens of GB/s, L1 cache bandwidth: close to 100 GB/s). The triad is limited by data transfer to cache or main memory (depending on the vector lengths), and each array access (read or write) incurs a data transfer of eight bytes.

Does the compiler do "the right thing" when optimizing the inner loop of benchmark (d)? How can you tell?

2. (C++ only) Repeat the measurement for the vector triad (Assignment 2(a) above), but now using STL std::vector<double> objects instead of plain arrays, i.e., declare the arrays as

std::vector<double> a(N), b(N), c(N), d(N);

For which vector lengths do you see a change in performance? For comparison run the benchmark with a binary that has been compiled with the -fno-inline option (which prevents inlining of all function calls).