## Measuring the throughput of the divide instruction

We want to numerically integrate the function

f(x) = 4/(1+x2)

from 0 to 1. The result should be an approximation to π, of course. You may use a very simple rectangular integration scheme that works by summing up areas of rectangles centered around xi with a width of Δx and a height of f(xi):

int SLICES = 100000000;double delta_x = ....;for (int i=0; i < SLICES; i++) {
  x = (i+0.5)*delta_x;  sum += (4.0 / (1.0 + x * x));
}Pi = sum * delta_x;

You can find (not quite complete) example programs in C and Fortran in the DIV folder. Make sure that your code actually computes an approximation to π, and report runtime and performance in MIterations/s as obtained on one core of the cluster (you will need the clock speed in GHz).
1. Assuming that the (non-pipelined) divide operation dominates the runtime of the code (any everything else is hidden behind the divides), can you estimate the latency (or throughput, which is practically the same) for a divide in CPU cycles?
2. Try compiling without the additional option -no-vec (which disables SIMD vectorization). How does that change your result? What is your conclusion from this?
3. Parallelize the code using OpenMP. Run it with different numbers of cores on one socket (if you know already how to do this). Does the performance scale with the number of cores? Did you expect this result?
4. Can you modify the code so that it measures the divide instruction latency instead of the throughput?