Assignment 1: Loop kernel benchmarking

  1. Write a benchmark program that measures the performance in MFlop/s of the following three computation kernels:

    (a) (20 credits) vector triad: a[i] = b[i] + c[i] * d[i]
    (b) (20 credits) vector update:  a[i] = s * a[i] 
    (c) (30 credits) vector update "with a twist": a[i] = s * a[i-1]


    a, b, c, d are double precision arrays of length N. s is a double precision scalar. Allocate memory for those data structures on the heap, i.e. using malloc() in C, new in C++ or allocate() in Fortran90. Do not forget to initialize all data elements with valid floating-point (FP) numbers. Using calloc() is not sufficient - see this blog entry for an explanation.

    Run your code using the following vector lengths (in elements): N=int(1.5k), k = 8 ... 43. This will give you a decent resolution and equidistant points on the x axis (see below).

    Perform the measurement on one core of the Emmy cluster for the given loop lengths (do not forget to fix the clock frequency). For reasons of accuracy, make sure that the runtime of each kernel with each vector/matrix size is larger than 0.1 seconds by repeating the computation kernel in an outer loop. Maybe it is a good idea to dynamically adjust the number of repetitions depending on the runtime of the kernel:

    int repeat = 1;
    double runtime=0.;
    for(; runtime<.1; repeat*=2) {
    wcs = getTimeStamp();
    for(r=0; r<repeat; ++r) {
    /* PUT THE KERNEL BENCHMARK LOOP HERE */
    if(CONDITION_NEVER_TRUE) dummy(a); // for the compiler
    }
    wce = getTimeStamp();
    runtime = wce-wcs;
    }
    repeat /= 2;

    Make sure that the operations in the kernel actually get executed - compilers are smart! (This is what the bogus if statement is for. The dummy() function must reside in a separate source file, and the compiler must not be able to determine the result of the condition at compile time. Example: if(a[N>>1]<0.) - if all arrays are initialized with positive numbers, this condition is never true.) Use the standard compiler options -O3 -xHost -fno-alias .

    Use your favorite graphics program (e.g., gnuplot or xmgrace or Excel if you must) to generate plots of the performance in MFlop/s vs. N. Choose a logarithmic scale on the x axis (think about why this is advisable). Always let the y axis start at zero. If you don't, the graph may be misleading and you will collect some very bad Karma.

    In case of benchmark (c), describe the fundamental difference to the other benchmarks. Is the the actual performance number expected? Why? 

  2. (30 credits) Repeat the measurement for the vector triad (Assignment 1(a) above), but now using STL std::vector<double> objects instead of plain arrays, i.e., declare the arrays as 

    std::vector<double> a(N), b(N), c(N), d(N);

    This is a C++ code now, so take care to use the ".cc" file extension and use the Intel C++ compiler (icpc). Note also that you have to add an "extern C" declaration whenever you include C headers in a C++ source file. You will also need standard C++ headers instead of the standard C headers:

    #include <iostream>
    #include <vector>
    #include <cmath>

    extern "C" {
    #include "timing.h"
    #include "dummy.h"
    }


    to make everything work as intended. You should also change the line where the performance is printed to something like

    std::cout << "Perf: " << performance << " MFlop/s." << std::endl;

    Present the data in the same manner as above for (a) the default Intel compiler with recommended command line options, (b) the default Intel compiler with the additional -fno-inline switch (which prevents function inlining),  and (c) the default g++ compiler (options -Ofast -mavx -fargument-noalias). Draw the three data sets in the same diagram. What conclusions can you draw from the data? What is the best and the worst value for "CPU cycles per iteration" that you could observe for all three data sets?