## Exercise 1: Fun with divides

We want to numerically integrate the function

f(x) = 4/(1+x2)

from 0 to 1. The result should be an approximation to π, of course. We use a very simple rectangular integration scheme that works by summing up areas of rectangles centered around xi with a width of Δx and a height of f(xi):

int SLICES = 1000000000;double delta_x = 1.0/SLICES;for (int i=0; i < SLICES; i++) {
x = (i+0.5)*delta_x;
sum += (4.0 / (1.0 + x * x));
}
Pi = sum * delta_x;

You can find example programs in C and Fortran in the DIV folder. Make sure that your code actually computes an approximation to π, and report runtime and performance in MFlops/s as obtained on one core of the cluster. How many flops per cycle are performed?
1. Assuming that the divide instruction dominates the runtime of the code (and everything else is hidden behind the divides), can you estimate the inverse throughput (i.e., the number of operations per cycle) for a divide operation in CPU cycles?
2. In the Makefile we have deactivated SIMD vectorization for the Intel compiler via the -no-vec option. If you delete this option, the compiler will use AVX (32-byte) SIMD instructions. How does the divide throughput change? Did you expect this result?
3. Parallelize the code using OpenMP. Run it with different numbers of cores on one socket (if you know already how to do this). Does the performance scale with the number of cores? Did you expect this result? (Hint: It should scale. If it doesn't, something is wrong)
4. What happens if you run the scaling experiment from 3. in "Turbo Mode"? Does the performance still scale?