Exercise: Measuring the divide throughput

We want to numerically integrate the function

f(x) = 4/(1+x2)

from 0 to 1. The result should be an approximation to π, of course. You may use a very simple rectangular integration scheme that works by summing up areas of rectangles centered around xi with a width of Δx and a height of f(xi):

int SLICES = 100000000;
double delta_x = ....;
for (int i=0; i < SLICES; i++) {
  x = (i+0.5)*delta_x;
sum += (4.0 / (1.0 + x * x));
Pi = sum * delta_x;

You can find (not quite complete) example programs in C and Fortran in the DIV folder. Make sure that your code actually computes an approximation to π, and report runtime and performance in MIterations/s as obtained on one core of the cluster.
  1. Assuming that the (non-pipelined) divide operation dominates the runtime of the code (any everything else is hidden behind the divides), can you estimate the latency for a divide in CPU cycles?
  2. Try compiling without the additional option -no-vec (which disables SIMD vectorization). How does that change your result? What is your conclusion from this?
Last modified: Sunday, 22 March 2015, 10:25 PM