## Measuring the throughput of the divide instruction

We want to numerically integrate the function

f(x) = 4/(1+x2)

from 0 to 1. The result should be an approximation to π, of course. You may use a very simple rectangular integration scheme that works by summing up areas of rectangles centered around xi with a width of Δx and a height of f(xi):

int SLICES = 100000000;double delta_x = ....;for (int i=0; i < SLICES; i++) {
  x = (i+0.5)*delta_x;  sum += (4.0 / (1.0 + x * x));
}Pi = sum * delta_x;

You can find (not quite complete) example programs in C and Fortran in the DIV folder. Make sure that your code actually computes an approximation to π, and report runtime and performance in MFlops/s as obtained on one core of the cluster. How many flops per cycle are performed?
1. Assuming that the divide instruction dominates the runtime of the code (and everything else is hidden behind the divides), can you estimate the inverse throughput for a divide in CPU cycles?
2. Try compiling without the additional option -no-vec (which disables SIMD vectorization). How does that change your result? What is your conclusion from this?
3. Parallelize the code using OpenMP. Run it with different numbers of cores on one socket (if you know already how to do this). Does the performance scale with the number of cores? Did you expect this result?
4. Can you modify the code so that it measures the divide instruction latency instead of the inverse throughput?