## Exercise: Measuring the divide instruction/operation throughput

Note: The exercise requires that the clock speed of the CPU is known. If it is, for some reason, not possible to fix the clock frequency, execute the code with the following command:

$likwid-perfctr -C S1:0 -g CLOCK <your_executable> This prints (among other things) the average actual clock frequency after the binary terminates. In order for this to work you have to load the likwid module first. It is also a good idea to always have a compiler module loaded: $ module load intel/17 likwid/4.3.2

Exercise: We want to numerically integrate the function

f(x) = 4/(1+x2)

from 0 to 1. The result should be an approximation to π, of course. You may use a very simple rectangular integration scheme that works by summing up areas of rectangles centered around xi with a width of Δx and a height of f(xi):

int SLICES = 1000000000;double delta_x = 1.0/SLICES;for (int i=0; i < SLICES; i++) {
x = (i+0.5)*delta_x;
sum += (4.0 / (1.0 + x * x));
}
Pi = sum * delta_x;

You can find example programs in C and Fortran in the DIV folder. Make sure that your code actually computes an approximation to π, and report runtime and performance in MFlops/s as obtained on one core of the cluster. How many flops per cycle are performed?
1. Assuming that the divide instruction dominates the runtime of the code (and everything else is hidden behind the divides), can you estimate the inverse throughput (i.e., the number of operations per cycle) for a divide operation in CPU cycles?
2. In the Makefile we have deactivated SIMD vectorization for the Intel compiler via the -no-vec option. If you delete this option, the compiler will use AVX (32-byte) SIMD instructions. How does the divide throughput change? Did you expect this result?
3. Parallelize the code using OpenMP. Run it with different numbers of cores on one socket (if you know already how to do this). Does the performance scale with the number of cores? Did you expect this result?